From: Bengt Larsson on
"nedbrek" <nedbrek(a)yahoo.com> wrote:

>Hello all,
>
>"Bengt Larsson" <bengtl8.net(a)telia.NOSPAMcom> wrote in message
>news:a9kqi5tp99eana4uoc2r9d0l998gpuu21g(a)4ax.com...
>> Bengt Larsson <bengtl8.net(a)telia.NOSPAMcom> wrote:
>>
>>>I have an Atom, and I tested with a parallell make (of an editor,
>>>mg2a, in C). With all the files in memory, the make takes 14.4
>>>seconds. with make -j (make -j 3 or 4 seems the most efficient) it
>>>takes 10.7 seconds. That is an improvement with 30-35 percent.
>>
>> Actually that is a bit stupid, since it improves beyond 2 threads.
>> With two threads, I get 11.3 seconds, an improvment with 27%.
>
>I usually do a "make -j N", where N = cores * 1.5 or 2. Compiling often
>gets stuck on disk (even if the source is in memory, and the final output is
>memory [ramdisk?], are all the temporary outs in memory? what about staticly
>linked libs?).

Well, everything is cached in memory, but there are no other special
arrangements. I should have something less disk-intensive, but I don't
have anything handy.
From: nedbrek on
Hello all.,

"Bengt Larsson" <bengtl8.net(a)telia.NOSPAMcom> wrote in message
news:kghui59h6ea26452kbhc8b69rpo3tm4rab(a)4ax.com...
> "nedbrek" <nedbrek(a)yahoo.com> wrote:
>
>>"Bengt Larsson" <bengtl8.net(a)telia.NOSPAMcom> wrote in message
>>news:a9kqi5tp99eana4uoc2r9d0l998gpuu21g(a)4ax.com...
>>> Bengt Larsson <bengtl8.net(a)telia.NOSPAMcom> wrote:
>>>
>>>>I have an Atom, and I tested with a parallell make (of an editor,
>>>>mg2a, in C). With all the files in memory, the make takes 14.4
>>>>seconds. with make -j (make -j 3 or 4 seems the most efficient) it
>>>>takes 10.7 seconds. That is an improvement with 30-35 percent.
>>>
>>> Actually that is a bit stupid, since it improves beyond 2 threads.
>>> With two threads, I get 11.3 seconds, an improvment with 27%.
>>
>>I usually do a "make -j N", where N = cores * 1.5 or 2. Compiling often
>>gets stuck on disk (even if the source is in memory, and the final output
>>is
>>memory [ramdisk?], are all the temporary outs in memory? what about
>>staticly
>>linked libs?).
>
> Well, everything is cached in memory, but there are no other special
> arrangements. I should have something less disk-intensive, but I don't
> have anything handy.

I think that is a pretty good test. Parallel make is one of the few "real
life" type of benchmarks that people actually use. I was just trying to
explain why you'd get more speedup with more than 2 threads.

Ned


From: Bengt Larsson on
"nedbrek" <nedbrek(a)yahoo.com> wrote:

>I think that is a pretty good test. Parallel make is one of the few "real
>life" type of benchmarks that people actually use. I was just trying to
>explain why you'd get more speedup with more than 2 threads.

Exactly. It's easy to make micro-benchmarks. I already made some, so I
can publish:

First: This is an Acer Aspire One, N270 Atom 1600 MHz, running Cygwin
under Windows XP. The compiler is gcc 4.3.2, with options
-march=native -mfpmath=sse -O2.

A simple multiply-add:

int i; double sum;

for (i=0; i<limit; i++) {
sum = sum*0.5 + 10.0;
}

Single-thread: 318 MFlops, Two threads: 2*314=628 MFlops, Improvement
in throughput from two threads: 97%

----

Unrolled:

for (i=0; i<limit; i++) {
sum1 = sum1*0.5 + 10.0;
sum2 = sum2*0.5 + 10.0;
sum3 = sum3*0.5 + 10.0;
sum4 = sum4*0.5 + 10.0;
}

Single-thread: 976 MFlops, Two threads: 2*726=1452 MFlops,
Improvement: 49%

----

Unrolled som more (to fill SSE registers):

for (i=0; i<limit; i++) {
sum1 = sum1*0.5 + 10.0;
sum2 = sum2*0.5 + 10.0;
sum3 = sum3*0.5 + 10.0;
sum4 = sum4*0.5 + 10.0;
sum5 = sum5*0.5 + 10.0;
sum6 = sum6*0.5 + 10.0;
}

Single-thread: 1118 MFlops, Two threads: 2*793=1586 MFlops,
Improvement: 42%

With two threads, this is quite close to 1600 MFlops, which would be
the maximum. The Atom can issue a floating-point double-precision
multiply only every two cycles. The adds either double issue or issue
in between.

----

Redo the last benchmark in single precision:

Single-thread: 1888 MFlops, Two threads: 2*1052=2104 MFlops,
Improvement: 11.4%

The Atom can issue a single-precision fp multiply every cycle, so that
limit goes away. This achieves more than 1 Flop/cycle in a single
thread. In two threads, it's 1.3 Flops/cycle.

----

Conclusion: if you use SSE, unless the code is extremely well
scheduled, you gain quite a lot from the second thread.
From: Bengt Larsson on
And some more. Classic FP Math too:

Scalar SSE FP Math
318 2*314, simple loop
976 2*726, unrolled by 4
1118 2*793, unrolled by 6
1888 2*1052, unrolled by 6, single precision

Classic FP
318 2*314, simple loop
309 2*277, unrolled by 4
268 2*224, unrolled by 6
268 2*224, unrolled by 6, single precision

Classic FP math doesn't like unrolling. I assume this is especially
bad on an in-order processor.
From: nmm1 on
In article <f0bvi5lpsqpp9s7otcivtga3sqn43tknbu(a)4ax.com>,
Bengt Larsson <bengtl8.net(a)telia.NOSPAMcom> wrote:
>
>Conclusion: if you use SSE, unless the code is extremely well
>scheduled, you gain quite a lot from the second thread.

Sorry, but no. Your testing is fine, but that conclusion does
not follow.

Even micro-benchmarks should bear some relationship to what real
code does. The days when testing the floating-point performance
alone indicated anything useful are long gone. Only the older of
us now remember when Whetstones were a useful comparison of
relative performance ....

Experience with most forms of threading, especially SMT, is that
whether it helps or not depends on memory accesses and not actual
calculation.


Regards,
Nick Maclaren.