From: gast128 on
Hello all,

this may be a difficult to explain problem, and I need some assembly
to show the difference. In a DLL we export some STL containers to
minimize code bloat, like:


template class __declspec(dllexport) std::vector<int>;
typedef std::vector<int> int_vector;


In a simple test probgram I see now a huge difference in performance.
The c++ function is as follows (same as std::fill, but this is just
example):


void PrfMemoryIterator(int_vector* pVector, int nValue, size_t nLoop)
{
for (size_t n = 0; n != nLoop; ++n)
{
const int_vector::iterator itEnd = pVector->end();

for (int_vector::iterator it = pVector->begin(); it != itEnd; +
+it)
{
*it = nValue;
}
}
}


In the assembly code somehow exception handling has been put in, and
this gets updated in the loop, which is major performance issue (see
'//! <- difference'):


void PrfMemoryIterator(int_vector* pVector, int nValue, size_t nLoop)
{
00401D30 push 0FFFFFFFFh
00401D32 push offset __ehhandler$?PrfMemoryIterator@@YAXPAV?
$vector(a)HV?$allocator@H(a)std@@@std@@HI@Z (403718h)
00401D37 mov eax,dword ptr fs:[00000000h]
00401D3D push eax
00401D3E mov dword ptr fs:[0],esp
00401D45 sub esp,4Ch
00401D48 mov eax,dword ptr [___security_cookie (406270h)]
00401D4D xor eax,esp
00401D4F push edi
00401D50 mov edi,ecx

<snip>

for (int_vector::iterator it = pVector->begin(); it != itEnd; +
+it)
00401D7D lea ecx,[esp+4]
00401D81 push ecx
00401D82 mov ecx,ebx
00401D84 call dword ptr
[__imp_std::vector<int,std::allocator<int> >::begin (404004h)]
00401D8A mov eax,dword ptr [esp+4]
00401D8E cmp eax,dword ptr [esp+8]
00401D92 je PrfMemoryIterator+79h (401DA9h)
{
*it = nValue;
00401D94 mov dword ptr [eax],esi
00401D96 mov eax,dword ptr [esp+4] //! <- difference
00401D9A mov ecx,dword ptr [esp+8] //! <- difference
00401D9E add eax,4
00401DA1 cmp eax,ecx
00401DA3 mov dword ptr [esp+4],eax //! <- difference
00401DA7 jne PrfMemoryIterator+64h (401D94h)


However if we not export the STL containers, the generated code is
different:


void PrfMemoryIterator(int_vector* pVector, int nValue, size_t nLoop)
{
00401F60 sub esp,44h
00401F63 mov eax,dword ptr [___security_cookie (406290h)]
00401F68 xor eax,esp
00401F6A push edi
00401F6B mov edi,ecx

<snip>

for (int_vector::iterator it = pVector->begin(); it != itEnd; +
+it)
00401F86 mov eax,dword ptr [ebx+4]
00401F89 cmp eax,ecx
00401F8B je PrfMemoryIterator+39h (401F99h)
00401F8D lea ecx,[ecx]
{
*it = nValue;
00401F90 mov dword ptr [eax],esi
00401F92 add eax,4
00401F95 cmp eax,ecx
00401F97 jne PrfMemoryIterator+30h (401F90h)


I use vstudio 2003 here, but I noticed something similar with the
_SECURE_SCL option in vstudio 2008, which also makes a difference from
a performance perspective .

Can anyone help? It is probably somewhere in the exception handling
corner, however why would this make a difference when using exported
classes or not?

Thx in advance.
From: Alexander Grigoriev on
Normally, the STL-generated code can get heavily optimized and inlined. But
if you export the code, the no-inline functions will be used.

<gast128(a)hotmail.com> wrote in message
news:09ae418f-3610-4ef5-8df2-d41d7e45eed5(a)g19g2000yqe.googlegroups.com...
> Hello all,
>
> this may be a difficult to explain problem, and I need some assembly
> to show the difference. In a DLL we export some STL containers to
> minimize code bloat, like:
>
>
> template class __declspec(dllexport) std::vector<int>;
> typedef std::vector<int> int_vector;
>
>
> In a simple test probgram I see now a huge difference in performance.
> The c++ function is as follows (same as std::fill, but this is just
> example):
>
>
> void PrfMemoryIterator(int_vector* pVector, int nValue, size_t nLoop)
> {
> for (size_t n = 0; n != nLoop; ++n)
> {
> const int_vector::iterator itEnd = pVector->end();
>
> for (int_vector::iterator it = pVector->begin(); it != itEnd; +
> +it)
> {
> *it = nValue;
> }
> }
> }
>
>
> In the assembly code somehow exception handling has been put in, and
> this gets updated in the loop, which is major performance issue (see
> '//! <- difference'):
>
>
> void PrfMemoryIterator(int_vector* pVector, int nValue, size_t nLoop)
> {
> 00401D30 push 0FFFFFFFFh
> 00401D32 push offset __ehhandler$?PrfMemoryIterator@@YAXPAV?
> $vector(a)HV?$allocator@H(a)std@@@std@@HI@Z (403718h)
> 00401D37 mov eax,dword ptr fs:[00000000h]
> 00401D3D push eax
> 00401D3E mov dword ptr fs:[0],esp
> 00401D45 sub esp,4Ch
> 00401D48 mov eax,dword ptr [___security_cookie (406270h)]
> 00401D4D xor eax,esp
> 00401D4F push edi
> 00401D50 mov edi,ecx
>
> <snip>
>
> for (int_vector::iterator it = pVector->begin(); it != itEnd; +
> +it)
> 00401D7D lea ecx,[esp+4]
> 00401D81 push ecx
> 00401D82 mov ecx,ebx
> 00401D84 call dword ptr
> [__imp_std::vector<int,std::allocator<int> >::begin (404004h)]
> 00401D8A mov eax,dword ptr [esp+4]
> 00401D8E cmp eax,dword ptr [esp+8]
> 00401D92 je PrfMemoryIterator+79h (401DA9h)
> {
> *it = nValue;
> 00401D94 mov dword ptr [eax],esi
> 00401D96 mov eax,dword ptr [esp+4] //! <- difference
> 00401D9A mov ecx,dword ptr [esp+8] //! <- difference
> 00401D9E add eax,4
> 00401DA1 cmp eax,ecx
> 00401DA3 mov dword ptr [esp+4],eax //! <- difference
> 00401DA7 jne PrfMemoryIterator+64h (401D94h)
>
>
> However if we not export the STL containers, the generated code is
> different:
>
>
> void PrfMemoryIterator(int_vector* pVector, int nValue, size_t nLoop)
> {
> 00401F60 sub esp,44h
> 00401F63 mov eax,dword ptr [___security_cookie (406290h)]
> 00401F68 xor eax,esp
> 00401F6A push edi
> 00401F6B mov edi,ecx
>
> <snip>
>
> for (int_vector::iterator it = pVector->begin(); it != itEnd; +
> +it)
> 00401F86 mov eax,dword ptr [ebx+4]
> 00401F89 cmp eax,ecx
> 00401F8B je PrfMemoryIterator+39h (401F99h)
> 00401F8D lea ecx,[ecx]
> {
> *it = nValue;
> 00401F90 mov dword ptr [eax],esi
> 00401F92 add eax,4
> 00401F95 cmp eax,ecx
> 00401F97 jne PrfMemoryIterator+30h (401F90h)
>
>
> I use vstudio 2003 here, but I noticed something similar with the
> _SECURE_SCL option in vstudio 2008, which also makes a difference from
> a performance perspective .
>
> Can anyone help? It is probably somewhere in the exception handling
> corner, however why would this make a difference when using exported
> classes or not?
>
> Thx in advance.


From: gast128 on
On Mar 12, 4:50 am, "Alexander Grigoriev" <al...(a)earthlink.net> wrote:
> Normally, the STL-generated code can get heavily optimized and inlined. But
> if you export the code, the no-inline functions will be used.

> > 00401D92  je          PrfMemoryIterator+79h (401DA9h)
> >      {
> >         *it = nValue;
> > 00401D94  mov         dword ptr [eax],esi
> > 00401D96  mov         eax,dword ptr [esp+4] //! <- difference
> > 00401D9A  mov         ecx,dword ptr [esp+8] //! <- difference
> > 00401D9E  add         eax,4
> > 00401DA1  cmp         eax,ecx
> > 00401DA3  mov         dword ptr [esp+4],eax //! <- difference
> > 00401DA7  jne         PrfMemoryIterator+64h (401D94h)

Yes but an optimizer could conclude from the assembly code that it
stores and loads the value of the eax again and again in [esp + 4].
Even the ecx register gets reloaded all the time, with being changed
in the loop. So my conclusion would be that it somehow is essential
that this eax value gets written back to [esp + 4] in the loop or
otherwise it may be a bug. I also do not use the volatile keyword, so
the optimizer is freely to use all its power.
From: gast128 on
I made 2 changes to the original code:
1) use const_iterator as end iterator
2) pulled iterator out of loop

And now the values of the iterator aren't reloaded again and again in
the for loop. No idea why; a compiler specialist could help here?

void PrfMemoryIterator(int_vector* pVector, int nValue, size_t nLoop)
{
PRF_FUNCTION();

for (size_t n = 0; n != nLoop; ++n)
{
const int_vector::const_iterator itEnd = pVector->end();
int_vector::iterator it;

for (it = pVector->begin(); it != itEnd; ++it)
{
*it = nValue;
}
}
}

I saw alos another nice effect (which may or may not be related):
'Inconsistent inlining of C++ class template member functions across
DLLs'
https://connect.microsoft.com/VisualStudio/feedback/details/511979/inconsistent-inlining-of-c-class-template-member-functions-across-dlls