Making code auto vectorizable

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Making code auto vectorizable

Stefan Westerfeld
   Hi!

I've built a new beast tree, which thanks to Tims work now supports
building SSE optimized versions of plugins. I used the compiler flag
-ftree-vectorizer-verbose=5 to see what actually gets vectorized. The
result is somewhat disappointing: not a single plugin benefits from the
auto vectorizer. The tree vectorizer messages are:

bseadder.c:209: note: not vectorized: number of iterations cannot be computed.
bseadder.c:216: note: not vectorized: number of iterations cannot be computed.
bseadder.c:229: note: not vectorized: number of iterations cannot be computed.
bseadder.c:238: note: not vectorized: number of iterations cannot be computed.
bseadder.c:238: note: vectorized 0 loops in function.

The problem is that the standard handling of loop boundaries (and
accessing audio data buffers) that many beast plugins use is this:

  gfloat *bound = output + n_values;
  while (output < bound)
    *output++ = *input++;

which is not (currently) recognized by the tree vectorizer. Rewriting
the loop without this construct, like this:

  int i;
  for (i = 0; i < n_values; i++)
    output[i] = input[i];

leads to a vectorizable loop. Note that this only works if i is signed,
so using a guint for iterating does not enable vectorization (it took me
quite some trial and error to figure that out).

To get most of the auto vectorizer, I suggest rewriting vectorizable
inner loops in the way I indicated, as I think it is (generally) not
slower for the non-SIMD case. In fact, which is faster (incrementing all
pointers or using one index variable) will probably depend on quite some
factors, like the processor type, pipelining, register allocation,
number of channels, the algorithm within the loop and so on.

Below is such a patch for BseAdder.

   Cu... Stefan

cvs server: Diffing .
Index: ChangeLog
===================================================================
RCS file: /cvs/gnome/beast/plugins/ChangeLog,v
retrieving revision 1.163
diff -u -p -r1.163 ChangeLog
--- ChangeLog 12 Apr 2006 01:05:32 -0000 1.163
+++ ChangeLog 12 Apr 2006 13:40:08 -0000
@@ -1,3 +1,8 @@
+Wed Apr 12 15:38:20 2006  Stefan Westerfeld  <[hidden email]>
+
+ * bseadder.c: Rewrote inner loops in a way that can be auto vectorized
+ by the gcc-4.1 auto vectorizer.
+
 Wed Apr 12 02:35:47 2006  Tim Janik  <[hidden email]>
 
  * Makefile.am: added a rule "refresh-Makefile.plugins:" to rebuild the
Index: bseadder.c
===================================================================
RCS file: /cvs/gnome/beast/plugins/bseadder.c,v
retrieving revision 1.30
diff -u -p -r1.30 bseadder.c
--- bseadder.c 23 Jul 2004 18:12:41 -0000 1.30
+++ bseadder.c 12 Apr 2006 13:40:08 -0000
@@ -190,10 +190,10 @@ adder_process (BseModule *module,
   Adder *adder = module->user_data;
   guint n_au1 = BSE_MODULE_JSTREAM (module, BSE_ADDER_JCHANNEL_AUDIO1).n_connections;
   guint n_au2 = BSE_MODULE_JSTREAM (module, BSE_ADDER_JCHANNEL_AUDIO2).n_connections;
-  gfloat *out, *audio_out = BSE_MODULE_OBUFFER (module, BSE_ADDER_OCHANNEL_AUDIO_OUT);
-  gfloat *bound = audio_out + n_values;
+  gfloat *audio_out = BSE_MODULE_OBUFFER (module, BSE_ADDER_OCHANNEL_AUDIO_OUT);
   const gfloat *auin;
   guint i;
+  int n;
 
   if (!n_au1 && !n_au2)
     {
@@ -203,17 +203,13 @@ adder_process (BseModule *module,
   if (n_au1) /* sum up audio1 inputs */
     {
       auin = BSE_MODULE_JBUFFER (module, BSE_ADDER_JCHANNEL_AUDIO1, 0);
-      out = audio_out;
-      do
- *out++ = *auin++;
-      while (out < bound);
+      for (n = 0; n < n_values; n++)
+ audio_out[n] = auin[n];
       for (i = 1; i < n_au1; i++)
  {
   auin = BSE_MODULE_JBUFFER (module, BSE_ADDER_JCHANNEL_AUDIO1, i);
-  out = audio_out;
-  do
-    *out++ += *auin++;
-  while (out < bound);
+  for (n = 0; n < n_values; n++)
+    audio_out[n] += auin[n];
  }
     }
   else
@@ -223,19 +219,15 @@ adder_process (BseModule *module,
     for (i = 0; i < n_au2; i++)
       {
  auin = BSE_MODULE_JBUFFER (module, BSE_ADDER_JCHANNEL_AUDIO2, i);
- out = audio_out;
- do
-  *out++ += *auin++;
- while (out < bound);
+ for (n = 0; n < n_values; n++)
+  audio_out[n] += auin[n];
       }
   else if (n_au2) /*  subtract audio2 inputs */
     for (i = 0; i < n_au2; i++)
       {
  auin = BSE_MODULE_JBUFFER (module, BSE_ADDER_JCHANNEL_AUDIO2, i);
- out = audio_out;
- do
-  *out++ -= *auin++;
- while (out < bound);
+ for (n = 0; n < n_values; n++)
+  audio_out[n] -= auin[n];
       }
 }
 
cvs server: Diffing evaluator
cvs server: Diffing freeverb
cvs server: Diffing icons



--
Stefan Westerfeld, Hamburg/Germany, http://space.twc.de/~stefan
_______________________________________________
beast mailing list
[hidden email]
http://mail.gnome.org/mailman/listinfo/beast
Reply | Threaded
Open this post in threaded view
|

Re: Making code auto vectorizable

Tim Janik
On Wed, 12 Apr 2006, Stefan Westerfeld wrote:

>   Hi!
>
> I've built a new beast tree, which thanks to Tims work now supports
> building SSE optimized versions of plugins. I used the compiler flag
> -ftree-vectorizer-verbose=5 to see what actually gets vectorized. The
> result is somewhat disappointing: not a single plugin benefits from the
> auto vectorizer. The tree vectorizer messages are:
>
> bseadder.c:209: note: not vectorized: number of iterations cannot be computed.
> bseadder.c:216: note: not vectorized: number of iterations cannot be computed.
> bseadder.c:229: note: not vectorized: number of iterations cannot be computed.
> bseadder.c:238: note: not vectorized: number of iterations cannot be computed.
> bseadder.c:238: note: vectorized 0 loops in function.
>
> The problem is that the standard handling of loop boundaries (and
> accessing audio data buffers) that many beast plugins use is this:
>
>  gfloat *bound = output + n_values;
>  while (output < bound)
>    *output++ = *input++;
>
> which is not (currently) recognized by the tree vectorizer. Rewriting
> the loop without this construct, like this:
>
>  int i;
>  for (i = 0; i < n_values; i++)
>    output[i] = input[i];
>
> leads to a vectorizable loop. Note that this only works if i is signed,
> so using a guint for iterating does not enable vectorization (it took me
> quite some trial and error to figure that out).

what compiler version is this?
does the guint/gint problem persist in gcc-4.2snapshot?

> To get most of the auto vectorizer, I suggest rewriting vectorizable
> inner loops in the way I indicated, as I think it is (generally) not
> slower for the non-SIMD case. In fact, which is faster (incrementing all
> pointers or using one index variable) will probably depend on quite some
> factors, like the processor type, pipelining, register allocation,
> number of channels, the algorithm within the loop and so on.
>
> Below is such a patch for BseAdder.

ok thanks. please feel free to cook up more patches ;)

>   Cu... Stefan
>
> cvs server: Diffing .
> Index: ChangeLog
> ===================================================================
> RCS file: /cvs/gnome/beast/plugins/ChangeLog,v
> retrieving revision 1.163
> diff -u -p -r1.163 ChangeLog
> --- ChangeLog 12 Apr 2006 01:05:32 -0000 1.163
> +++ ChangeLog 12 Apr 2006 13:40:08 -0000
> @@ -1,3 +1,8 @@
> +Wed Apr 12 15:38:20 2006  Stefan Westerfeld  <[hidden email]>
> +
> + * bseadder.c: Rewrote inner loops in a way that can be auto vectorized
> + by the gcc-4.1 auto vectorizer.
> +
> Wed Apr 12 02:35:47 2006  Tim Janik  <[hidden email]>
>
> * Makefile.am: added a rule "refresh-Makefile.plugins:" to rebuild the
> Index: bseadder.c
> ===================================================================
> RCS file: /cvs/gnome/beast/plugins/bseadder.c,v
> retrieving revision 1.30
> diff -u -p -r1.30 bseadder.c
> --- bseadder.c 23 Jul 2004 18:12:41 -0000 1.30
> +++ bseadder.c 12 Apr 2006 13:40:08 -0000
> @@ -190,10 +190,10 @@ adder_process (BseModule *module,
>   Adder *adder = module->user_data;
>   guint n_au1 = BSE_MODULE_JSTREAM (module, BSE_ADDER_JCHANNEL_AUDIO1).n_connections;
>   guint n_au2 = BSE_MODULE_JSTREAM (module, BSE_ADDER_JCHANNEL_AUDIO2).n_connections;
> -  gfloat *out, *audio_out = BSE_MODULE_OBUFFER (module, BSE_ADDER_OCHANNEL_AUDIO_OUT);
> -  gfloat *bound = audio_out + n_values;
> +  gfloat *audio_out = BSE_MODULE_OBUFFER (module, BSE_ADDER_OCHANNEL_AUDIO_OUT);
>   const gfloat *auin;
>   guint i;
> +  int n;

for pure iteration i,j,k,u,v,x,y,z are more often used as iteration
variables than those often used to denote certain sizes, lengths or
dimensions like l,m,n,s.
i.e. please use 'j' instead of 'n' here.


>
>   if (!n_au1 && !n_au2)
>     {
> @@ -203,17 +203,13 @@ adder_process (BseModule *module,
>   if (n_au1) /* sum up audio1 inputs */
>     {
>       auin = BSE_MODULE_JBUFFER (module, BSE_ADDER_JCHANNEL_AUDIO1, 0);
> -      out = audio_out;
> -      do
> - *out++ = *auin++;
> -      while (out < bound);
> +      for (n = 0; n < n_values; n++)
> + audio_out[n] = auin[n];

and while you're at it, please declare const gfloat *auin=... in the innermost
scope possible.


>       for (i = 1; i < n_au1; i++)
> {
>  auin = BSE_MODULE_JBUFFER (module, BSE_ADDER_JCHANNEL_AUDIO1, i);
> -  out = audio_out;
> -  do
> -    *out++ += *auin++;
> -  while (out < bound);
> +  for (n = 0; n < n_values; n++)
> +    audio_out[n] += auin[n];
> }
>     }
>   else
> @@ -223,19 +219,15 @@ adder_process (BseModule *module,
>     for (i = 0; i < n_au2; i++)
>       {
> auin = BSE_MODULE_JBUFFER (module, BSE_ADDER_JCHANNEL_AUDIO2, i);
> - out = audio_out;
> - do
> -  *out++ += *auin++;
> - while (out < bound);
> + for (n = 0; n < n_values; n++)
> +  audio_out[n] += auin[n];
>       }
>   else if (n_au2) /*  subtract audio2 inputs */
>     for (i = 0; i < n_au2; i++)
>       {
> auin = BSE_MODULE_JBUFFER (module, BSE_ADDER_JCHANNEL_AUDIO2, i);
> - out = audio_out;
> - do
> -  *out++ -= *auin++;
> - while (out < bound);
> + for (n = 0; n < n_values; n++)
> +  audio_out[n] -= auin[n];
>       }
> }
>

the rest looks good. provided it has been properly tested,
this can go into CVS. do we have a feature test for BseAdder
already?

---
ciaoTJ
_______________________________________________
beast mailing list
[hidden email]
http://mail.gnome.org/mailman/listinfo/beast
Reply | Threaded
Open this post in threaded view
|

Re: Making code auto vectorizable

Stefan Westerfeld
   Hi!

On Tue, Apr 18, 2006 at 05:51:52PM +0200, Tim Janik wrote:

> On Wed, 12 Apr 2006, Stefan Westerfeld wrote:
> >I've built a new beast tree, which thanks to Tims work now supports
> >building SSE optimized versions of plugins. I used the compiler flag
> >-ftree-vectorizer-verbose=5 to see what actually gets vectorized. The
> >result is somewhat disappointing: not a single plugin benefits from the
> >auto vectorizer. The tree vectorizer messages are:
> >
> >bseadder.c:209: note: not vectorized: number of iterations cannot be
> >computed.
> >bseadder.c:216: note: not vectorized: number of iterations cannot be
> >computed.
> >bseadder.c:229: note: not vectorized: number of iterations cannot be
> >computed.
> >bseadder.c:238: note: not vectorized: number of iterations cannot be
> >computed.
> >bseadder.c:238: note: vectorized 0 loops in function.
> >
> >The problem is that the standard handling of loop boundaries (and
> >accessing audio data buffers) that many beast plugins use is this:
> >
> > gfloat *bound = output + n_values;
> > while (output < bound)
> >   *output++ = *input++;
> >
> >which is not (currently) recognized by the tree vectorizer. Rewriting
> >the loop without this construct, like this:
> >
> > int i;
> > for (i = 0; i < n_values; i++)
> >   output[i] = input[i];
> >
> >leads to a vectorizable loop. Note that this only works if i is signed,
> >so using a guint for iterating does not enable vectorization (it took me
> >quite some trial and error to figure that out).
>
> what compiler version is this?
> does the guint/gint problem persist in gcc-4.2snapshot?

Yes, it does. And there is another change in gcc-snapshot: it doesn't
vectorize the loop any more, unless __restrict__ is used to declare that
the input and output buffer don't have a data dependency. I've updated
my patch accordingly.

> >BSE_ADDER_OCHANNEL_AUDIO_OUT);
> >  const gfloat *auin;
> >  guint i;
> >+  int n;
>
> for pure iteration i,j,k,u,v,x,y,z are more often used as iteration
> variables than those often used to denote certain sizes, lengths or
> dimensions like l,m,n,s.
> i.e. please use 'j' instead of 'n' here.

Ok.

> >-      do
> >- *out++ = *auin++;
> >-      while (out < bound);
> >+      for (n = 0; n < n_values; n++)
> >+ audio_out[n] = auin[n];
>
> and while you're at it, please declare const gfloat *auin=... in the
> innermost
> scope possible.

Ok.

> the rest looks good. provided it has been properly tested,
> this can go into CVS. do we have a feature test for BseAdder
> already?

We do have a feature test and it still passes with the vectorized loop.
Should I commit the updated patch with the __restrict__ keyword added?
It might be necessary to look whether the compiler has support for it.

The new patch looks like this:

Index: ChangeLog
===================================================================
RCS file: /cvs/gnome/beast/plugins/ChangeLog,v
retrieving revision 1.168
diff -u -p -u -r1.168 ChangeLog
--- ChangeLog 20 Apr 2006 18:09:27 -0000 1.168
+++ ChangeLog 2 May 2006 20:05:43 -0000
@@ -1,3 +1,8 @@
+Tue May  2 22:04:52 2006  Stefan Westerfeld  <[hidden email]>
+
+ * bseadder.c: Rewrote inner loops in a way that can be auto vectorized
+ by the gcc-4.1 and gcc-snapshot auto vectorizer.
+
 Thu Apr 20 20:08:52 2006  Tim Janik  <[hidden email]>
 
  * bseblockutils.cc: fixed inner variable declarations which erroneously
Index: bseadder.c
===================================================================
RCS file: /cvs/gnome/beast/plugins/bseadder.c,v
retrieving revision 1.30
diff -u -p -u -r1.30 bseadder.c
--- bseadder.c 23 Jul 2004 18:12:41 -0000 1.30
+++ bseadder.c 2 May 2006 20:05:43 -0000
@@ -190,10 +190,9 @@ adder_process (BseModule *module,
   Adder *adder = module->user_data;
   guint n_au1 = BSE_MODULE_JSTREAM (module, BSE_ADDER_JCHANNEL_AUDIO1).n_connections;
   guint n_au2 = BSE_MODULE_JSTREAM (module, BSE_ADDER_JCHANNEL_AUDIO2).n_connections;
-  gfloat *out, *audio_out = BSE_MODULE_OBUFFER (module, BSE_ADDER_OCHANNEL_AUDIO_OUT);
-  gfloat *bound = audio_out + n_values;
-  const gfloat *auin;
+  gfloat *__restrict__ audio_out = BSE_MODULE_OBUFFER (module, BSE_ADDER_OCHANNEL_AUDIO_OUT);
   guint i;
+  gint j;
 
   if (!n_au1 && !n_au2)
     {
@@ -202,18 +201,14 @@ adder_process (BseModule *module,
     }
   if (n_au1) /* sum up audio1 inputs */
     {
-      auin = BSE_MODULE_JBUFFER (module, BSE_ADDER_JCHANNEL_AUDIO1, 0);
-      out = audio_out;
-      do
- *out++ = *auin++;
-      while (out < bound);
+      const gfloat *__restrict__ auin = BSE_MODULE_JBUFFER (module, BSE_ADDER_JCHANNEL_AUDIO1, 0);
+      for (j = 0; j < n_values; j++)
+ audio_out[j] = auin[j];
       for (i = 1; i < n_au1; i++)
  {
   auin = BSE_MODULE_JBUFFER (module, BSE_ADDER_JCHANNEL_AUDIO1, i);
-  out = audio_out;
-  do
-    *out++ += *auin++;
-  while (out < bound);
+  for (j = 0; j < n_values; j++)
+    audio_out[j] += auin[j];
  }
     }
   else
@@ -222,20 +217,16 @@ adder_process (BseModule *module,
   if (n_au2 && !adder->subtract) /* sum up audio2 inputs */
     for (i = 0; i < n_au2; i++)
       {
- auin = BSE_MODULE_JBUFFER (module, BSE_ADDER_JCHANNEL_AUDIO2, i);
- out = audio_out;
- do
-  *out++ += *auin++;
- while (out < bound);
+ const gfloat *__restrict__ auin = BSE_MODULE_JBUFFER (module, BSE_ADDER_JCHANNEL_AUDIO2, i);
+ for (j = 0; j < n_values; j++)
+  audio_out[j] += auin[j];
       }
   else if (n_au2) /*  subtract audio2 inputs */
     for (i = 0; i < n_au2; i++)
       {
- auin = BSE_MODULE_JBUFFER (module, BSE_ADDER_JCHANNEL_AUDIO2, i);
- out = audio_out;
- do
-  *out++ -= *auin++;
- while (out < bound);
+ const gfloat *__restrict__ auin = BSE_MODULE_JBUFFER (module, BSE_ADDER_JCHANNEL_AUDIO2, i);
+ for (j = 0; j < n_values; j++)
+  audio_out[j] -= auin[j];
       }
 }
 

   Cu... Stefan
--
Stefan Westerfeld, Hamburg/Germany, http://space.twc.de/~stefan
_______________________________________________
beast mailing list
[hidden email]
http://mail.gnome.org/mailman/listinfo/beast
Reply | Threaded
Open this post in threaded view
|

Re: Making code auto vectorizable

Tim Janik
On Tue, 2 May 2006, Stefan Westerfeld wrote:

>   Hi!
>
> On Tue, Apr 18, 2006 at 05:51:52PM +0200, Tim Janik wrote:

>>> which is not (currently) recognized by the tree vectorizer. Rewriting
>>> the loop without this construct, like this:
>>>
>>> int i;
>>> for (i = 0; i < n_values; i++)
>>>   output[i] = input[i];
>>>
>>> leads to a vectorizable loop. Note that this only works if i is signed,
>>> so using a guint for iterating does not enable vectorization (it took me
>>> quite some trial and error to figure that out).
>>
>> what compiler version is this?
>> does the guint/gint problem persist in gcc-4.2snapshot?
>
> Yes, it does. And there is another change in gcc-snapshot: it doesn't
> vectorize the loop any more, unless __restrict__ is used to declare that
> the input and output buffer don't have a data dependency. I've updated
> my patch accordingly.

>> the rest looks good. provided it has been properly tested,
>> this can go into CVS. do we have a feature test for BseAdder
>> already?
>
> We do have a feature test and it still passes with the vectorized loop.
> Should I commit the updated patch with the __restrict__ keyword added?
> It might be necessary to look whether the compiler has support for it.

no, first, we should define "restrict" to __restrict__ if it is supported
and to nothing otherwise. and second, the bseadder code could be rewritten
in terms of bse_block_copy_float() and bse_block_add_floats(), right?
then, we should use that instead.

>   Cu... Stefan

---
ciaoTJ
_______________________________________________
beast mailing list
[hidden email]
http://mail.gnome.org/mailman/listinfo/beast
Reply | Threaded
Open this post in threaded view
|

Re: Making code auto vectorizable

Stefan Westerfeld
   Hi!

On Tue, May 02, 2006 at 11:04:47PM +0200, Tim Janik wrote:

> On Tue, 2 May 2006, Stefan Westerfeld wrote:
> >On Tue, Apr 18, 2006 at 05:51:52PM +0200, Tim Janik wrote:
>
> >>>which is not (currently) recognized by the tree vectorizer. Rewriting
> >>>the loop without this construct, like this:
> >>>
> >>>int i;
> >>>for (i = 0; i < n_values; i++)
> >>>  output[i] = input[i];
> >>>
> >>>leads to a vectorizable loop. Note that this only works if i is signed,
> >>>so using a guint for iterating does not enable vectorization (it took me
> >>>quite some trial and error to figure that out).
> >>
> >>what compiler version is this?
> >>does the guint/gint problem persist in gcc-4.2snapshot?
> >
> >Yes, it does. And there is another change in gcc-snapshot: it doesn't
> >vectorize the loop any more, unless __restrict__ is used to declare that
> >the input and output buffer don't have a data dependency. I've updated
> >my patch accordingly.
>
> >>the rest looks good. provided it has been properly tested,
> >>this can go into CVS. do we have a feature test for BseAdder
> >>already?
> >
> >We do have a feature test and it still passes with the vectorized loop.
> >Should I commit the updated patch with the __restrict__ keyword added?
> >It might be necessary to look whether the compiler has support for it.
>
> no, first, we should define "restrict" to __restrict__ if it is supported
> and to nothing otherwise. and second, the bseadder code could be rewritten
> in terms of bse_block_copy_float() and bse_block_add_floats(), right?
> then, we should use that instead.

Since BseAdder supports subtracting, we would need to extend the Block
API. I can provide a new patch which does this, and reimplements the
adder on top of it.

However, the point about having an auto vectorizer in the first place is
that you don't have to rewrite all your code; you simply use a compiler
option and everything else happens automatically. We kind-of give this
up if we go on and on extending the block API for every problem that we
get, and eliminate inner loops more and more of modules, instead of
letting the auto vectorizer do the work.

Of course, its the question where to draw the line. Subtracting blocks
could be argued to be reasonably common, so that its not too bad to have
a generic version available.

   Cu... Stefan
--
Stefan Westerfeld, Hamburg/Germany, http://space.twc.de/~stefan
_______________________________________________
beast mailing list
[hidden email]
http://mail.gnome.org/mailman/listinfo/beast
Reply | Threaded
Open this post in threaded view
|

Re: Making code auto vectorizable

Tim Janik
On Wed, 3 May 2006, Stefan Westerfeld wrote:

>   Hi!
>
> On Tue, May 02, 2006 at 11:04:47PM +0200, Tim Janik wrote:
>> On Tue, 2 May 2006, Stefan Westerfeld wrote:
>>> On Tue, Apr 18, 2006 at 05:51:52PM +0200, Tim Janik wrote:

>>>> the rest looks good. provided it has been properly tested,
>>>> this can go into CVS. do we have a feature test for BseAdder
>>>> already?
>>>
>>> We do have a feature test and it still passes with the vectorized loop.
>>> Should I commit the updated patch with the __restrict__ keyword added?
>>> It might be necessary to look whether the compiler has support for it.
>>
>> no, first, we should define "restrict" to __restrict__ if it is supported
>> and to nothing otherwise. and second, the bseadder code could be rewritten
>> in terms of bse_block_copy_float() and bse_block_add_floats(), right?
>> then, we should use that instead.
>
> Since BseAdder supports subtracting, we would need to extend the Block
> API. I can provide a new patch which does this, and reimplements the
> adder on top of it.

sure, that'd be great.

> However, the point about having an auto vectorizer in the first place is
> that you don't have to rewrite all your code; you simply use a compiler
> option and everything else happens automatically. We kind-of give this
> up if we go on and on extending the block API for every problem that we
> get, and eliminate inner loops more and more of modules, instead of
> letting the auto vectorizer do the work.
>
> Of course, its the question where to draw the line. Subtracting blocks
> could be argued to be reasonably common, so that its not too bad to have
> a generic version available.

exactly, you're right that we'll have to draw an arbitrary line somewhere.

not relying on the auto-vectorizer but using hand crufted vectorized
functions does have certain advantages though:
- the optimization is less compiler (version) dependent;
- the code is possibly faster, because the programmer can adapt loops and
   associated data structures for vectorized operations, that's more than
   the compiler can do;
- in some cases, hand crufted optimizations may be doable that are
   not available to the auto-vectorizer, such as using small asm-loops or
   other pointer/block-address pokage that rely on intrinsic system knowledge.

as you discovered in your auto-vectorization tests, changes to the existing
code are required anyway, so i suggest we do the following:
- factor out simple inner loops with high optimization potential, such as
   the adder subtract loop, when this is simple enough to do;
- add "restrict" and "int" loop variables in other cases where this helps
   the auto-vectorizer.
- factor out any block operation that can be optimized but resides in the
   BSE core. that's because the core can't be compield with SSE or similar
   optimizations like the plugins can.

>   Cu... Stefan

---
ciaoTJ
_______________________________________________
beast mailing list
[hidden email]
http://mail.gnome.org/mailman/listinfo/beast