MFC executable is 100% slower on faster machine [MFC]

Prev: Thread Pool Class?
Next: Intellisense

From: Joseph M. Newcomer on 23 Jan 2010 00:41

It would not be the first time Intel delivered a chip that had lower performance for
equivalent clock speed. Also, depending on the application, your application may not be a
good match for the particular caching strategies that are used. Caches are not just
caches.

Note that the Pentium III was notorious for being in every way having inferior performance
to the Pentium II or Pentium Pro.

Factors for performance include:
Number of ALUs
Cache size
Cache replacement algorithm
TLB size
TLB replacement algorithm
Prefetch pipe depth
Write pipe depth
Microinstruction pipe depth
Front Side Bus speed
Memory architecture
Memory width
Working set size
Paging policies
Available memory for programs

And those are just the items I can think of off the top of my head. Such observations as
you make are dismaying, to say the least, and seriously disappointing, but you have
essentially assumed that both machines are identical in most ways. And in the single most
critical parameter, total physical memory, the slower machine has half the memory of the
faster machine. Sounds like paging to me.

Sometimes, you can have a program whose access patterns work well with one caching
strategy and work absolutely against another caching strategy. Things like the "stride"
of accesses become factors. A program that hits the caches "wrong" relative to its
replacement algorithm can give an order of magnitude degradation. A factor of 2 is well
within this variance.

Note that "faster access to RAM" is only one of the parameters in the above list. A
faster FSB doesn't necessarily translate to faster memory access for a particular
algorithm. That's because "memory access" time based on raw memory speed is NOT the
operative parameter for algorithm performance; in fact, it is one of the least important
parameters. Cache hit ratio is critical.

Cache behavior can reduce your performance by an order of magnitude if you hit the wrong
patterns.

Back in the days when real people (not teams of thousands) designed caches, a friend of
mine was designing the cache for a high-performance personal workstation. "I'm trying to
increase the cache hit ratio from 97% to 98%" he told me. When I asked what only a 1%
improvement would buy, he said "A 1% improvement in cache hits means a 30% improvement in
program execution". I think this was in about 1979, when caches were still a pretty new
cool idea (memory had finally gotten cheap enough that we could actually consider building
multilevel memory hierarchies).

If you have bad paging behavior, you should consider yourself fortunate that you have ONLY
a factor of 2 degradation. Paging can reduce your performance by ORDERS (not just ORDER)
of magnitude.

Note that you have not indicated if you are using the same OS. You have not reported on
the number of page faults your program took during the measurement interval (this is
trivially available from Task Manager). You have not measured the actual executable code
performance of key subroutines, or indicated performance figures (available from kernel
APIs) around those key algorithms.

Note that the same OS could have different policies on the two machines. Working set
configuration for the account would have a profound impact on overall performance.

There are so many variables involved here that a superficial measurement of front-to-back
execution with no instrumentation effort is meaningless.

You have to make some effort to come up with some of the "why" yourself. You have
presented essentially zero useful information that someone trying to answer this question
could use to say anything meaningful. All we know is one machine is an i5 core and one is
a Celeron, some raw memory bus time information (largely irrelevant), disk performance
(relevant only if the disk is involved in the problem), nothing about the operating system
running, the dozens of performance tuning parameters that exist in the user policies. You
observe that your program is memory-intensive, and the slower machine has half the memory
of the faster machine, which almost immediately screams "paging". If they do not have the
same size memory, comparisons are not going to be particularly meaningful.

I'd look at paging performance first, cache organization second. Those are probably the
two most useful parameters to study at this point. If paging performance differs, you
need to think about either a memory upgrade or looking into paging tuning parameters.
joe

On Fri, 22 Jan 2010 21:29:55 -0600, "Peter Olcott" <NoSpam(a)SeeScreen.com> wrote:

>I is very memory intensive, thus much faster memory and much
>larger cache should make it faster and not slower. It is
>also single threaded and no floating point is used at all. I
>bought the Core i5 specifically because it has faster access
>to RAM.
>
>"Alexander Grigoriev" <alegr(a)earthlink.net> wrote in message
>news:%23BWgrp9mKHA.1548(a)TK2MSFTNGP02.phx.gbl...
>> Is the application multithreaded? Is it floating-point
>> intensive, memory-intensive, and what else?
>>
>> "Peter Olcott" <NoSpam(a)SeeScreen.com> wrote in message
>> news:6PydnYAu9qOHgcfWnZ2dnUVZ_v2dnZ2d(a)giganews.com...
>>>I recently upgraded my computer hardware from a 2.4 Ghz
>>>Celeron to a 2.66 Ghz Core i5 and an MFC application that
>>>I developed runs only half as fast on the faster machine.
>>>What could be causing this?
>>>
>>> Both machines have identical sata hard-drives, and the
>>> fast machine has much faster RAM 1333 ddr3 and 4.0 GB.
>>> The slower machine has 2.0 GB of 333 ddr RAM. Why is the
>>> slower machine twice as fast on the same executable?
>>>
>>
>>
>
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Tom Serface on 23 Jan 2010 02:43

I'd be suspicious that something you're doing is causing it to swap memory
to disk. You could use Task Manager to check what's happening with physical
memory. I'm not sure why that would be the case since you have more memory
on the new machine, but maybe something else is running that's not on the
other machine that could be affecting memory usage or disk speed. Also, is
the program accessing the network at all. Maybe there is a problem with
your network setup.

Tom

"Peter Olcott" <NoSpam(a)SeeScreen.com> wrote in message
news:6PydnYAu9qOHgcfWnZ2dnUVZ_v2dnZ2d(a)giganews.com...
> I recently upgraded my computer hardware from a 2.4 Ghz Celeron to a 2.66
> Ghz Core i5 and an MFC application that I developed runs only half as fast
> on the faster machine. What could be causing this?
>
> Both machines have identical sata hard-drives, and the fast machine has
> much faster RAM 1333 ddr3 and 4.0 GB. The slower machine has 2.0 GB of 333
> ddr RAM. Why is the slower machine twice as fast on the same executable?
>

From: Woody on 23 Jan 2010 04:17

I would look at two aspects of your app:

1) Use SysInternals' VMMap to show memory usage.
2) Use AMD's CodeAnalyst to profile the execution.

By comparing results on the two systems, you may be able to see where
the difference is.

If you are doing disk writes, you should be sure both systems are set
the same in regard to delayed writes. This could cause a drastic
difference in performance, even with identical hw.

You can also use Task Manager to check whether the app is doing disk
activity, such as loading and unloading DLLs, differently on the two
systems. While you're in TM, look at the task's priority.

From: Peter Olcott on 23 Jan 2010 10:10

"Joseph M. Newcomer" <newcomer(a)flounder.com> wrote in
message news:5s0ll5hlimpptv63bkegcfbig3au6gpc6g(a)4ax.com...
> It would not be the first time Intel delivered a chip that
> had lower performance for
> equivalent clock speed. Also, depending on the
> application, your application may not be a
> good match for the particular caching strategies that are
> used. Caches are not just
> caches.
>
> Note that the Pentium III was notorious for being in every
> way having inferior performance
> to the Pentium II or Pentium Pro.
>
> Factors for performance include:
> Number of ALUs
> Cache size
> Cache replacement algorithm
> TLB size
> TLB replacement algorithm
> Prefetch pipe depth
> Write pipe depth
> Microinstruction pipe depth
> Front Side Bus speed
> Memory architecture
> Memory width
> Working set size
> Paging policies
> Available memory for programs
>
> And those are just the items I can think of off the top of
> my head. Such observations as
> you make are dismaying, to say the least, and seriously
> disappointing, but you have
> essentially assumed that both machines are identical in
> most ways. And in the single most
> critical parameter, total physical memory, the slower
> machine has half the memory of the
> faster machine. Sounds like paging to me.

The slower machine has twice the memory, the slower machine
has 1333 memory, and the faster machine has 333 memory. The
slower machine is a Core i5 quad core 2.66, and the faster
machine is a Celeron 2.4. The process uses a single thread.

>
> Sometimes, you can have a program whose access patterns
> work well with one caching
> strategy and work absolutely against another caching
> strategy. Things like the "stride"
> of accesses become factors. A program that hits the
> caches "wrong" relative to its
> replacement algorithm can give an order of magnitude
> degradation. A factor of 2 is well
> within this variance.
>
> Note that "faster access to RAM" is only one of the
> parameters in the above list. A
> faster FSB doesn't necessarily translate to faster memory
> access for a particular
> algorithm. That's because "memory access" time based on
> raw memory speed is NOT the
> operative parameter for algorithm performance; in fact, it
> is one of the least important
> parameters. Cache hit ratio is critical.
>
> Cache behavior can reduce your performance by an order of
> magnitude if you hit the wrong
> patterns.
>
> Back in the days when real people (not teams of thousands)
> designed caches, a friend of
> mine was designing the cache for a high-performance
> personal workstation. "I'm trying to
> increase the cache hit ratio from 97% to 98%" he told me.
> When I asked what only a 1%
> improvement would buy, he said "A 1% improvement in cache
> hits means a 30% improvement in
> program execution". I think this was in about 1979, when
> caches were still a pretty new
> cool idea (memory had finally gotten cheap enough that we
> could actually consider building
> multilevel memory hierarchies).
>
> If you have bad paging behavior, you should consider
> yourself fortunate that you have ONLY
> a factor of 2 degradation. Paging can reduce your
> performance by ORDERS (not just ORDER)
> of magnitude.
>
> Note that you have not indicated if you are using the same
> OS. You have not reported on
> the number of page faults your program took during the
> measurement interval (this is
> trivially available from Task Manager). You have not
> measured the actual executable code
> performance of key subroutines, or indicated performance
> figures (available from kernel
> APIs) around those key algorithms.
>
> Note that the same OS could have different policies on the
> two machines. Working set
> configuration for the account would have a profound impact
> on overall performance.
>
> There are so many variables involved here that a
> superficial measurement of front-to-back
> execution with no instrumentation effort is meaningless.
>
> You have to make some effort to come up with some of the
> "why" yourself. You have
> presented essentially zero useful information that someone
> trying to answer this question
> could use to say anything meaningful. All we know is one
> machine is an i5 core and one is
> a Celeron, some raw memory bus time information (largely
> irrelevant), disk performance
> (relevant only if the disk is involved in the problem),
> nothing about the operating system
> running, the dozens of performance tuning parameters that
> exist in the user policies. You
> observe that your program is memory-intensive, and the
> slower machine has half the memory
> of the faster machine, which almost immediately screams
> "paging". If they do not have the
> same size memory, comparisons are not going to be
> particularly meaningful.
>
> I'd look at paging performance first, cache organization
> second. Those are probably the
> two most useful parameters to study at this point. If
> paging performance differs, you
> need to think about either a memory upgrade or looking
> into paging tuning parameters.
> joe
>

I found out last night that the difference is related to
video card settings. I was able to make the faster machine
much faster than the slower machine by setting the NVIDIA
9800 GTX to maximize performance over quality. This setting
has now stopped working.

>
> On Fri, 22 Jan 2010 21:29:55 -0600, "Peter Olcott"
> <NoSpam(a)SeeScreen.com> wrote:
>
>>I is very memory intensive, thus much faster memory and
>>much
>>larger cache should make it faster and not slower. It is
>>also single threaded and no floating point is used at all.
>>I
>>bought the Core i5 specifically because it has faster
>>access
>>to RAM.
>>
>>"Alexander Grigoriev" <alegr(a)earthlink.net> wrote in
>>message
>>news:%23BWgrp9mKHA.1548(a)TK2MSFTNGP02.phx.gbl...
>>> Is the application multithreaded? Is it floating-point
>>> intensive, memory-intensive, and what else?
>>>
>>> "Peter Olcott" <NoSpam(a)SeeScreen.com> wrote in message
>>> news:6PydnYAu9qOHgcfWnZ2dnUVZ_v2dnZ2d(a)giganews.com...
>>>>I recently upgraded my computer hardware from a 2.4 Ghz
>>>>Celeron to a 2.66 Ghz Core i5 and an MFC application
>>>>that
>>>>I developed runs only half as fast on the faster
>>>>machine.
>>>>What could be causing this?
>>>>
>>>> Both machines have identical sata hard-drives, and the
>>>> fast machine has much faster RAM 1333 ddr3 and 4.0 GB.
>>>> The slower machine has 2.0 GB of 333 ddr RAM. Why is
>>>> the
>>>> slower machine twice as fast on the same executable?
>>>>
>>>
>>>
>>
> Joseph M. Newcomer [MVP]
> email: newcomer(a)flounder.com
> Web: http://www.flounder.com
> MVP Tips: http://www.flounder.com/mvp_tips.htm

From: Alexander Grigoriev on 23 Jan 2010 12:08

"Peter Olcott" <NoSpam(a)SeeScreen.com> wrote in message
news:4OKdnQ_BNJzFjMbWnZ2dnUVZ_tWdnZ2d(a)giganews.com...
>
>
> I found out last night that the difference is related to video card
> settings. I was able to make the faster machine much faster than the
> slower machine by setting the NVIDIA 9800 GTX to maximize performance over
> quality. This setting has now stopped working.
>

Boot in VGA-only mode (hit F8) and see if the performance get better. Maybe
it's the video adapter+driver that is causing excessive interrupts or
unnecessary bus accesses. Does your program use video card for high volume
operations?

First | Prev | Next | Last
Pages: 1 2 3 4
Prev: Thread Pool Class?
Next: Intellisense