|
From: Nick Maclaren on 30 Mar 2008 05:20 In article <4d98c7c6-9019-4d59-af03-82af04489c67(a)f63g2000hsf.googlegroups.com>, "Paul A. Clayton" <paaronclayton(a)earthlink.net> writes: | |> > That's why I have liked the idea of switching threads on a cache miss |> > for a good many years now - given that memory latency is THE problem, |> > a solution that starts by assuming that seems good to me. |> |> Memory latency is the MAIN problem, but unpredictable control flow |> changes are not entirely trivial in some applications. Even data |> dependencies might be significant limitations. Agreed. I didn't mean "the only" but "the dominant". See my other response as to why SMT doesn't usually help much. |> Providing a very wide processor with a 10% larger core to support SMT |> can make as much or more sense than providing a non-SMT version that |> uses the 10% space for another much simpler core. (Sure it would |> generate better throughput at lower power to use only simpler cores; |> but it seems that ILP drives a lot of processor design.) That is actually the key - and the target is not technical. SMT is a compromise between producing a pure benchmarketing SPEC engine, which would have a dire performance in most servers, and one that would look bad but perform well. As such, it is a good technology. But the claims that it actually solves the technical problems are not backed up by evidence. Regards, Nick Maclaren.
From: Nick Maclaren on 30 Mar 2008 05:25 In article <1ielzej.18dmntm1moby6hN%nospam(a)ab-katrinedal.dk>, nospam(a)ab-katrinedal.dk (=?ISO-8859-1?Q?Niels_J=F8rgen_Kruse?=) writes: |> John Dallman <jgd(a)cix.co.uk> wrote: |> |> > Not exactly. A kind of mathematical modelling that has substantial |> > elements of goal-seeking. It tends to search memory in a good simulation |> > of a random pattern for a few micro-seconds, do some moderately serious |> > FP crunching for a vaguely similar length of time, albeit in algorithms |> > rather more complex than a typical HPC inner loop, and then go |> > memory-searching again. It repeats this until it gets an answer - |> > there's no real way to predict how long this will take - and then |> > returners to its caller It bashes memory hardest, FP next and branch |> > prediction third. |> |> Smells like the problem is that you modify the datastructures that you |> search, in order to speed up searches? |> |> You could have 2 versions of the search, one that improve the data |> structures and one that don't. Using the former a percentage of the time |> would be enough to evolve the data structures. Eh? I understood John Dallman to mean that the data structures are essentially read-only, but are being accessed in a most cache-unfriendly way. That is the normal characteristic of such programs - and they are not rare. As I have posted before, there are some fairly basic algorithms that are provably of that form - i.e. mathematically unoptimisable. The only real hope for a considerable speedup in those cases is to solve the problem using an entirely different mathematical model - which is quite often possible, but needs a VERY high calibre of person to do the transformation! Regards, Nick Maclaren.
From: Nick Maclaren on 30 Mar 2008 05:51 In article <fsnlii$p10$1(a)gemini.csx.cam.ac.uk>, nmm1(a)cus.cam.ac.uk (Nick Maclaren) writes: |> In article <52acdbf3-c3ff-4de0-ae56-ed0db2381930(a)59g2000hsb.googlegroups.com>, |> "Paul A. Clayton" <paaronclayton(a)earthlink.net> writes: |> |> |> If the code is heavy |> |> with unpredictable control flow changes, then adding a thread |> |> (whether SMT or FineGrainedMT) will effectively half the branch |> |> misprediction penalty. (FGMT doesn't help much [any?] with data |> |> dependencies unless they are multicycle.) |> |> That is not true - analyse it more carefully. It would be true only |> if no work were needed to recover from a misprediction. But, as |> critical parts of the CPU are typically working flat-out to recover, |> it doesn't help. Sorry - I was being misleading. I should have said "it doesn't effectively halve the branch misprediction penalty". Exactly what effect it will have will depend on a lot of details - starting with the question of how many instruction decode paths there are and how independent they are of the actual execution! What I was thinking of was the comparison between SMT and two independent cores, with the same chip constraints (be they area, transistor count, watts or whatever). And I have seen no decent analysis of that, though I should be interested to see one. Regards, Nick Maclaren.
From: Terje Mathisen on 30 Mar 2008 07:10 Nick Maclaren wrote: > Eh? I understood John Dallman to mean that the data structures are > essentially read-only, but are being accessed in a most cache-unfriendly > way. That is the normal characteristic of such programs - and they are > not rare. I believe the latest HPC oil codes are moving in this direction. > > As I have posted before, there are some fairly basic algorithms that > are provably of that form - i.e. mathematically unoptimisable. The > only real hope for a considerable speedup in those cases is to solve > the problem using an entirely different mathematical model - which is > quite often possible, but needs a VERY high calibre of person to do > the transformation! The big hump is possibly that the parallel friendly algorithms might start out with 10% of the performance, requiring 10 times the hw resources just to recover the initial algorithmic overhead, before you get any gains at all. Terje -- - <Terje.Mathisen(a)hda.hydro.com> "almost all programming can be viewed as an exercise in caching"
From: Nick Maclaren on 30 Mar 2008 07:44
In article <Uu6dneNs3YKI6HLanZ2dnUVZ_oSunZ2d(a)giganews.com>, Terje Mathisen <terje.mathisen(a)hda.hydro.com> writes: |> |> > Eh? I understood John Dallman to mean that the data structures are |> > essentially read-only, but are being accessed in a most cache-unfriendly |> > way. That is the normal characteristic of such programs - and they are |> > not rare. |> |> I believe the latest HPC oil codes are moving in this direction. I didn't know that, but it doesn't surprise me. |> > As I have posted before, there are some fairly basic algorithms that |> > are provably of that form - i.e. mathematically unoptimisable. The |> > only real hope for a considerable speedup in those cases is to solve |> > the problem using an entirely different mathematical model - which is |> > quite often possible, but needs a VERY high calibre of person to do |> > the transformation! |> |> The big hump is possibly that the parallel friendly algorithms might |> start out with 10% of the performance, requiring 10 times the hw |> resources just to recover the initial algorithmic overhead, before you |> get any gains at all. That is a separate matter, but is certainly true :-( Exercise for the reader: work out a parallel-friendly 3-D Fourier transform algorithm, and analyse its performance with respect to a parallel implementation of a conventional 3-D FFT. It is actually very easy to do, and most instructive - if not useful! Regards, Nick Maclaren. |