From: Terje Mathisen "terje.mathisen at on
James Van Buskirk wrote:
> "Terje Mathisen"<"terje.mathisen at tmsw.no"> wrote in message
> news:lc1sg7-5id1.ln1(a)ntp.tmsw.no...
>
>> OTOH, afaik it should definitely be possible to plug length=256 and
>> vector=4 into the FFTW synthesizer and get a very big, completely
>> unrolled, minimum-operation count, piece of code out of it.
>
> FFTW is in no way capable of producing minimum operation count
> code. I beat it every time. The only way that their code

Nice!

> generator can catch up to my algorithms is if they look at my code
> and incorporate its new tricks into the set of transformations
> that their code generator tries.

OK, that's good.

By how many percent would your code beat them for the 256 and 2048
element IMDCT?
>
> Surprising that they took the problem to only one coder since my
> understanding of the situation with SSE2 is that there must be many
> coders out there, each of whom knows a trick or two that the others
> don't that can increase performance by a percent or so. Of course
> each coder would probably want to be paid for revealing their
> secrets and it could end up costing a lot for a fairly small gain
> in performance.

I have the great advantage that I don't make a living from my
optimization work, so I don't need to keep any secrets. :-)

My current daytime job is to be chief architect for the complete swap
operation of the largest Norwegian cell network, i.e. pretty far from
SIMD optimization.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
From: Terje Mathisen "terje.mathisen at on
James Van Buskirk wrote:
> "Terje Mathisen"<"terje.mathisen at tmsw.no"> wrote in message
> news:l1esg7-34e1.ln1(a)ntp.tmsw.no...
>
>> By how many percent would your code beat them for the 256 and 2048 element
>> IMDCT?
>
> That would require my writing code for that transform in the first

OK. IMDCT is particularly interesting these days because every single
audio codec has been built around it.

> place. In operation counts, perhaps identical because I haven't
> come up with any new algorithm since the one that I published that
> still holds the minimum for power of 2 FFTs (unless someone else
> has beaten me subsequently), and since that algorithm can be
> considered to be built on DCTs...
>
> Looking at their timing numbers for smaller power of 2 FFTs it seems
> to me that FFTW doesn't utilize the 2 way SIMD capabilities of at
> least a core 2 duo effectively. I am pretty much ignorant of the
> style of project you are working on: is it single-precision, double-

Audio codecs do very well with single precision and 4-way SIMD
processing. Almost all of Vorbis is defined to be fp, but it is of
course possible to write a fixed-point implementation. The main problem
is that at one particular point you must, by codec definition, be
prepared to handle 64 bits of dynamic range...

> precision or integer data? I'm really not very interested in this
> stuff any more; I'm trying to make progress in completely different
> projects.
>
Anything I could help out on?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"