From: "Andy "Krazy" Glew" on
Terje Mathisen wrote:
> Andy "Krazy" Glew wrote:

> Andy, you really owe it to yourself to take a hard look at h264 and
> CABAC: In approximately the same timeframe as DES was replaced with AES,
> with a stated requirement of being easy to make fast/efficient on a
> PentiumPro cpu, the MPEG working groups decided that a "Context Adaptive
> Binary Arithmetic Coder" was the best choice for a video codec.
>
> CABAC requires 3 or 4 branches for every single _bit_ decoded, and the
> last of these branches depends on the value of that decoded bit.
>
> Until you've made that branch you don't even know which context to apply
> when decoding the next bit!
>
> (I have figured out workarounds (either branchless code or making them
> predictable) for most of those inline branches in the bit decoder, but
> that last context branch is unavoidable.)
>
> The only possible skip ahead is a really big one: You have to locate the
> next key frame and start another core/thread, but this approach is of
> extremely limited value if you are in a realtime situation, i.e. video
> conferencing.
>
> Terje


I have looked at CABAC, and you are right, it seems to be the branch
equivalent of chasing down a hash chain.

It is also the equivalent of good online compression (well, duh) and
encryption, where every bit depends on all previous bits (and possibly
some or alll future bits.

And if you can't skip to the next independent chunk of work - if there
is no independent work to skip to - you are screwed. You have to make
the dependent stuff run faster. Or do nothing at all. You make the
dependent stuff run faster by architectures that make sequential code
run faster - by having faster ALUs or, if it is important enough, by
having dedicated hardware. Is CABAC important enough?

E.g. Terje, you're known to be a Larrabee fan. Can you vectorize CABAC?

I'm not opposed to making sequentially dependent stuff run faster. I'm
just observing that, if device limitations get in the way, there are
lots of workloads that are not dominated bny sequentially dependent
stuff (at the fine grain).

As for CABAC, I must admit that I have some hope in algorithmic
techniques similar to those you were recently discussing for
parallelizing encryption.

For example: divide the image up into subblocks, and run CABAC on each
subblock in parallel. To obtain similar compression ratios you would
have to have keyframes less frequently. Bursts at keyframes possibly
could be avoided by skewing them.

Moreover, I remain a fan of model based encoding. Although that requires
significantly more computation, it is parallel.
From: "Andy "Krazy" Glew" on
Mayan Moudgill wrote:
> Andy "Krazy" Glew wrote:
>
>> (3) Recall that I am a fan of skip-ahead, speculative multithreading
>> architectures such as Haitham Akkary's DMT. If you can't predict a
>> branch, skip ahead to the next loop iteration or function return, and
>> execute code that you know will be executed with high probability.
>
> I was wondering - how much of the DMT performance improvement is
> becauase of all the speculative execution, annd how much of it is
> because it's acting as a I-cache prefetch engine? IIRC, the performance
> numbers for some of the non-linear I-prefetch schemes seem to track the
> performance improvements reported by DMT.

It's about half and half. Every SpMT / DMT simulator that I have seen
has the option of turning off speculation, and just using skipahead as
an instruction prefetcher. And possibly a data prefetcher. In fact,
good proposals don't bother to store the speculative results for
instructions that can easily be recomputed - it's easier to (re) compute
than it is to look up in a large store.

One might then reasonably ask "Why not take the hardware that is needed
for SpMT, and use it to make your predictor tablees larger?" Which is
totally valid, and explains much of the last 10 years in CPU
microarchitecture: we might call this the era of predictors. Aided and
abetted by the fact that application got *SIMPLER* in thwe last decade,
as simplistic multimedia codes with simple access patterns became
especially important.

However, most "predictors" are history based. They predict the last
value seen, or a linear stride added to the last few values seen, or
some other extrapolation. Or they rely on something like a Markov model
for state transitions. I suppose that you can add more curve fitting to
your predictor, but the easiest way to see what complicated non-linear
data access patterns may be occurring is to actually execute the code.
Use the real values if possible, or the best predictions for values that
you can't obtain and execute the intervening code.

"Code based prefetch".

I once worked with a prefetcher guy who really, really, wanted access to
the instruction stream. And the TLBs. What he was doing was executing
chunks of code - by no means the whole code, but just the parts that he
needed - and using that in his preftcher / address predictor.

There's a spectrum, ranging from (a) speculatively execute nothing,
preduct and prefetch everything through (z) speculatively execute
everything, remembering all speculative results, with intermediate
points such as (m) speculatively execute everything, remembering cache
miss results only, rexecute and verify, (n) remember cache misses +
computations that a simple timing model shows would be on the critical
path, (p) don't remember cache misses - bias the cache replacement
policy, and (g) on the other side, speculatively execute using data
value predictors and/or whatever stale data you have in the cavhe.




From: "Andy "Krazy" Glew" on
Robert Myers wrote:
> On Oct 21, 11:54 pm, Joe Pfeiffer <pfeif...(a)cs.nmsu.edu> wrote:
>> We were sure supposed to take it seriously -- didn't Merced actually
>> have a i386 core on it when delivered?
>
> It had something or other, but PIII had to be in the works (Andy would
> know) and it would have stomped anything that came before.

I am not aware of an Itanium shipped or proposed that had an "x86 core
on the side".

There were proposals to have some special purpose hardware, like some
x86 instruction decoders that packed into VLIW instructions.


> That is to say, I find it hard to believe that anyone took Itanium
> seriously as an x86 competitor.

I can assure you that it was sold that way to Intel senior management.
From: Terje Mathisen on
Andy "Krazy" Glew wrote:
> Terje Mathisen wrote:
>> Andy "Krazy" Glew wrote:
>
>> Andy, you really owe it to yourself to take a hard look at h264 and
>> CABAC: In approximately the same timeframe as DES was replaced with
>> AES, with a stated requirement of being easy to make fast/efficient on
>> a PentiumPro cpu, the MPEG working groups decided that a "Context
>> Adaptive Binary Arithmetic Coder" was the best choice for a video codec.
>>
>> CABAC requires 3 or 4 branches for every single _bit_ decoded, and the
>> last of these branches depends on the value of that decoded bit.
>>
>> Until you've made that branch you don't even know which context to
>> apply when decoding the next bit!
>>
>> (I have figured out workarounds (either branchless code or making them
>> predictable) for most of those inline branches in the bit decoder, but
>> that last context branch is unavoidable.)
>>
>> The only possible skip ahead is a really big one: You have to locate
>> the next key frame and start another core/thread, but this approach is
>> of extremely limited value if you are in a realtime situation, i.e.
>> video conferencing.
>>
>> Terje
>
>
> I have looked at CABAC, and you are right, it seems to be the branch
> equivalent of chasing down a hash chain.
>
> It is also the equivalent of good online compression (well, duh) and
> encryption, where every bit depends on all previous bits (and possibly
> some or alll future bits.
>
> And if you can't skip to the next independent chunk of work - if there
> is no independent work to skip to - you are screwed. You have to make
> the dependent stuff run faster. Or do nothing at all. You make the
> dependent stuff run faster by architectures that make sequential code
> run faster - by having faster ALUs or, if it is important enough, by
> having dedicated hardware. Is CABAC important enough?

It is almost certainly important enough that anything remotely
power-sensitive will need dedicated hw to handle at least the CABAC part.
>
> E.g. Terje, you're known to be a Larrabee fan. Can you vectorize CABAC?

Not at all, afaik.
>
> I'm not opposed to making sequentially dependent stuff run faster. I'm
> just observing that, if device limitations get in the way, there are
> lots of workloads that are not dominated bny sequentially dependent
> stuff (at the fine grain).
>
> As for CABAC, I must admit that I have some hope in algorithmic
> techniques similar to those you were recently discussing for
> parallelizing encryption.
>
> For example: divide the image up into subblocks, and run CABAC on each
> subblock in parallel. To obtain similar compression ratios you would

This is the only silver lining: Possibly due to the fact that they were
working on PS3 at the time,Sony specified that Bluray frames are all
split into 4 independent quadrants, which means that they could
trivially split the job across four of the 7 or 8 cell cores.

This also reduced the size of each subframe, in 1080i, to 256 K pixels. :-)

> have to have keyframes less frequently. Bursts at keyframes possibly
> could be avoided by skewing them.
>
> Moreover, I remain a fan of model based encoding. Although that requires
> significantly more computation, it is parallel.

OK.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
From: nmm1 on
In article <DcudnTf8XIfmmXzXnZ2dnUVZ_u2dnZ2d(a)metrocastcablevision.com>,
Bill Todd <billtodd(a)metrocast.net> wrote:
>
>The fact that Itanic came so close to world domination *despite* its
>abject failure to deliver on the promises that had seemed to make that
>domination inevitable tends to prove that the attempt to bluff its way
>to success was a daring and risky move but hardly an insane one. ...

Not really. It was a lot further from that than the hype indicated.
It made practical headway in two areas, so let's consider them.

HPC was its most successful area, and something like two sites tried
it and rejected it for every one that delivered a service using it.
The SGI Altix was the main success, though I have heard that Bull
made headway and suspect that Hitachi may have made some, too. And,
when I say rejected, I mean that it was often made a condition on
future tenders - i.e. don't tender IA64, as it will not be short
listed.

[ Aside: EU procurement law makes bias by public purchasers illegal,
but Those Of Us With Clue had no difficulty in funding technical and
financial reasons to veto IA64. Like, for example, just WHERE can
you find staff capable of tracking down code-generation bugs in
compilers for parallel IA64 codes? If anyone says "the vendor",
then he clearly doesn't understand HPC. ]

The other was Mission Critical computers for Big Business. I met
people from several of those, and they had all taken the position
that they were going to run it in parallel with their existing
systems for a year or more before making a decision. Asking how
it was going got a very po-faced non-response.

My point here is that, if the Itanic had started to be pushed much
harder, the real heavyweights would have joined the opposition.
It never had an earthly of doing what it was originally hyped to
do (i.e. entirely replace x86).


Regards,
Nick Maclaren.