From: Chris Morley on
>> movl 8(%ebp), %edx
>> movl 12(%ebp), %ecx
>> movb 16(%ebp), %al
>> movl %edx, %edi
>> rep stosb
>>sets up the byte pattern in AL, the destination address in EDX and the
>>count in ECX. "rep stosb" triggers the loop which implements:
>> while (--ECX >= 0)
>> *EDX++ = AL;
> Whoops, "EDX" should have been "EDI" in the above description.

Going back to the OP's question about can you do better with a hand loop
then typically yes. There are optimisations which can be made in C/C++
source (or assembler) which involve better use of bus width & cache. Some
are general, others processor/platform specific & involved.

e.g. on the 386DX(!) it was significantly quicker to movsd/stosd vs
movsb/stosb as you push 32 bits per access. (still is now!)
e.g. unrolling movsd vs rep movsd
e.g. Pentium it was worth pushing doubles around (regardless of actual data
type) to move 64bits (extend for MMX, 3dNow, then SSE(n) etc.)
e.g. Cache block prefetching

This is worth a read, while from 2002 & AMD specific it still has relevance
(e.g. page 174+):

Also mentioned in post:

There are examples memcopy... the p75 version bandwidth ~1630 Mbytes/sec vs.
the p67 rep mobsb ~570 Mbytes/sec. A memset with movntq for example for
blocks >512 bytes. You will however need intrinsics/assembler to do this
which stops being "c++" quite quickly!!

Before people invoke 'portable' if you are targeting a specific platform you
can optimise for that platform and still default to something else for other

You _can_ beat the memxxx libraries & std::x if you want/need but probably
not worth the time/effort unless you are actually bandwidth limited. You
would also sacrifice the generality & safety of the std:: funcs which other
posters point out.


[ See for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]