Fast Assignment of POD Struct Whose Members Have Copy Constructors [C++]

Prev: std::map element in heap or stack?
Next: C++ library that offers tensors?

From: Kenneth 'Bessarion' Boyd on 8 Dec 2009 08:47

On Dec 7, 1:55 pm, Goran Pusic <gor...(a)cse-semaphore.com> wrote:
> On Dec 6, 1:07 am, Le Chaud Lapin <jaibudu...(a)gmail.com> wrote:
> > Certain x86 memory movement instructions are much faster than calls to
> > memcpy, which simply employs those same instructions internally along
> > with unnecessary overhead.

Back when I actually tested this (Win16, ~1996): it's DLL-imported
memmove that acts like it's using those instructions (and even then
the DLL import overhead was barely measurable). The DLL imported
version of memcpy caused a 1.5x slowdown relative to memmove.

So I changed all of my code targeting Windows to always use memmove,
rather than mess with assembly programming. It makes trying to port
to *NIX much easier not having to think about assembly language.

> (I find it hard to believe that your concern is code size, it's about
> speed, right? If so...)
>
> Are you sure about that overhead? I just made a smallest possible
> memmove function I could think of (cld, init esi/edi/ecx, rep movsd).

Reads like memcpy to me, as memmove has to be safe when the source and
destination memory blocks overlap while memcpy doesn't. When I
checked, the speed difference between assembly-handwritten memcpy and
assembly-handwritten memmove for Intel wasn't measurable.

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Goran Pusic on 9 Dec 2009 05:56

On Dec 9, 1:41 am, Le Chaud Lapin <jaibudu...(a)gmail.com> wrote:
> Having written quite a bit of x86 assembly Ye Olden Days, I find it
> hard to believe that the difference is "statiscally irrelevant"
> between a movs and the 200+ instructions in full version of memcpy, at
> least 95 of which gets executed for a stock operator =. That's
> excluding stack manipulaation and function calls.
>

Did you try to code it faster? It's been a couple of days since this
started ;-).

It's clearly the question of data size. If it's big enough, memcpy or
asm won't matter, because time needed for rep movsd (which is what
memcpy of MS CRT uses) will swamp all else.

But, now that you voiced your disbelief of my utterly opaque, yet
highly scientific test ;-), I thought, perhaps my struct size was too
big (~22K). I tried with smaller, ~12K. Nope, still the same. (Stock
PC, 32-bit code on 64-bit Windows). ~8k, same. (Again, inline or not,
doesn't matter).

And finally, I saw a strange thing when I approach 4k: suddenly, stock
operator= (which __is__ memcpy) and manual memcpy become faster than
mine asm! That's something I didn't expect. Must be related to some
hardware effect memcpy knows about and I don't. (Either that, or
there's a flaw in my test.

But as now I passed through all the moves, I am convinced - you should
leave your optimization idea aside. It's __false__. Try applying first
rule of optimization:

1. make code faster by changing the design to eliminate hotspots
(precluded by: find hotspots)

Goran.

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

|
Pages: 1
Prev: std::map element in heap or stack?
Next: C++ library that offers tensors?