From: Bernd Paysan on
Andy "Krazy" Glew wrote:
> Some of them think that the problem with VLIW was the instruction set.

The problem of Itanic was the designed-by-committee-ISA. Too many "good"
features together aren't good anymore. The scaling concept of Itanic was
wrong, either. Look at current GPGPUs, and how they scale (e.g. ATI/AMD's):
Through a relatively simple VLIW instruction set, through SIMD, through
multi-threading, through multi-core. Multi-everything. The ISA itself
seems to be stabilizing, but the interface usually is through source-like
stuff (when you use OpenCL, you don't ship precompiled binaries, neither do
you when you use DirectX shader programs or the corresponding OpenGL stuff).

Itanic scaling first was mainly through added ILP, scaling VLIW. I didn't
believe that you could scale VLIW beyond some sweet spot, e.g. the four
integer operations per cycle of my 4stack. If you go further, the return is
diminishing. The same is true for OOO - you can extract some parallelism
out of a program, but not too much. If you use up your transistor resources
mostly for supporting logic, and not for the actual work itself, you go into
the wrong direction. OOO and VLIW are there to maximize the output of a
single core. Given that a lot of software is just written for that, it
makes sense. But only to what this software has as inherent parallelism.

> Some of them think that binary translation will solve all those
> problems. Me, I think that binary translation is a great idea - but
> only if the target microarchitecture makes sense in the absence of
> binary translation, if we lived in an open source world and x86
> compatibility was not an issue.

We live in an open source world, and x86 compatibility isn't an issue - if
you ignore the Windows-dominated desktop ;-). Binary translation matters
for closed source, for open source, it's a non-issue. Below the desktop, in
the mobile internet devices, open source already is quite dominant (even
when the whole offering is proprietary, like Apple's iPhone, most parts are
open source), above the desktop, on servers, the same is true. Smaller
devices have custom applications, i.e. even though they often have
proprietary licenses or simply are trade secrets, the programmer has the
sources readily available.

However, this is 10 years later. And for higher level of parallelism, even
open source doesn't help. If you want to make GCC use a GPGPU, you better
rewrite it from scratch (actually, that's my suggestion anyway: rewrite GCC
from scratch, it stinks of rotten source ;-). Same for many other programs.
We discussed about Excel and the suboptimal algorithms there - why not
redesign the spreadsheet around high performance computing? A column is a
vector, a sheet is a matrix? Create data dependencies for recalculation?

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/
From: Bernd Paysan on
Terje Mathisen wrote:
> Andy, you really owe it to yourself to take a hard look at h264 and
> CABAC: In approximately the same timeframe as DES was replaced with AES,
> with a stated requirement of being easy to make fast/efficient on a
> PentiumPro cpu, the MPEG working groups decided that a "Context Adaptive
> Binary Arithmetic Coder" was the best choice for a video codec.
>
> CABAC requires 3 or 4 branches for every single _bit_ decoded, and the
> last of these branches depends on the value of that decoded bit.
>
> Until you've made that branch you don't even know which context to apply
> when decoding the next bit!

The solution of course is that CABAC goes to a specific CABAC-decoding
hardware ;-). CABAC decoders in FPGA can decode 1 symbol/cycle, with just
1300LEs (you can only fit two b16s in that space). There's absolutely no
point in doing this with a CPU. Perhaps people should start putting an FPGA
onto the CPU die for this sort of stateful bit-manipulation tasks.

The other solution is to get on working on next generation codecs, and not
repeating the same mistake. The IMHO right way to do it is to sort the
coefficients by their context (i.e. first do a wavelet transformation), and
then encode with a standard dictionary based entropy encoder like LZMA,
which is compact and fast to decompress (for mobile encoders, you probably
need to have an option to go with LZ77 for faster compression, but less
dense results, or simply restrict the size of the dictionary).

Wavelet transformations also make it easier to distribute different quality
out of the same raw data (e.g. stream SD/HD/4k at the same time, where the
SD and HD clients only extract the scaled down versions, and the HD/4k use
the SD stream as base plus additional streams for higher resolution) - also
helpful for video editing (preview the SD stream fast, render on the 4k
stream).

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/
From: Bernd Paysan on
Andy "Krazy" Glew wrote:
> OK, OK, OK. This is not my area. But I would love to understand WHY
> something like this cannot work.

The problem with CMOS is that all the transistors have embedded diodes, that
need to be reverse biased to make them operable. A transistor really is a
four terminal device (source, drain, gate, *and* bulk).

This sort of low-power reversible computation stuff is more for nano-scale
electronics (using carbon nanotubes and whatever science fiction-like things
you can imagine ;-) than for microelectronics.

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/
From: Anton Ertl on
Stephen Fuld <SFuld(a)alumni.cmu.edu.invalid> writes:
>Paul Wallich wrote:
>> I would be that a huge chunk of the time isn't in doing the actual
>> calculations but in verifying that the calculations can be done.
>> Spreadsheets are pretty much the ultimate in mutably-typed interactive
>> code, and there's very little to prevent a recalculation from requiring
>> a near-universal reparse.
>
>Wow, I hadn't thought of that. But if you are say running multiple
>simulation runs, or something else where the only thing changing is the
>value of some parameters, not the "structure" of the spreadsheet, does
>Excel understand that it can skip at least most of the reparse?

Probably not, because for most users Excel is fast enough even with
slow algorithms. And those where it isn't, they have probably
invested so much in Excel that most of them would not change to a
spreadsheet program with better algorithms even if one is available.
So there is no incentive for other spreadsheet programs to improve
their algorithms, and therefore also no incentive for Excel.

Concerning the structure of the spreadsheet, this changes only cell by
cell, so any parsing should only have to deal with a cell at a time.
Or if you have operations that deal with many cells (say, copying a
column or loading a spreadsheet), it's reasonable that the time taken
is proportional to the size of the change; and these operations are
not the most frequent, so it's not so bad if they take a little time
on huge spreadsheets.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
anton(a)mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html
From: Anton Ertl on
Robert Myers <rbmyersusa(a)gmail.com> writes:
>I've fiddled a little with the off-the-shelf Itanium compilers, but I
>always assumed that none of those compilers were even remotely good
>enough that you could expect just to run old software through them and
>get anything like hoped-for performance. John Dallman has had a bit
>to say on the subject here.
>
>When I talked about rewriting code, I meant just that, not merely
>recompiling it. I wasn't all that interested in the standard task:
>how do you feed bad code to an Itanium compiler and get acceptable
>performance, because I was pretty sure that the answer was: you
>don't.

Bad code?

For most software, performance is not that much of an issue, and the
developers have left much more performance on the table than what
switching between IA-64 and other architectures, or between different
compilers for IA-64, or coding things in an IA-64-friendly manner
would have bought.

To get an idea of how Itanium II performs compared to other CPUs on
code that's not tuned for it (at least not more than for any other
architecture), take a look at the first graphics (slide 4) on

http://www.complang.tuwien.ac.at/anton/euroforth/ef09/papers/ertl-slides.pdf

This is performance per cycle, and the benchmarks are prety CPU-bound.
The compilers used here are various gcc versions (the fastest code
produced by the various gcc versions available on the test machines is
shown here). The only Gforth version that does not treat IA-64 as a
generic architecture is 0.7.0, and the only thing that's
IA-64-specific there is that it knows how to flush the I-cache.

The performance per cycle of the Itanium II is not particularly good,
but also not particularly bad.

The only ones that are significantly faster on Gforth 0.7.0 are the
IA-32 and AMD64 implementations, and that's because they have
implemented at least a BTB for indirect branch prediction.
Interestingly, the 21264B, which has a similar mechanism for branch
prediction is barely faster per clock on this code than the Itanium
II.

On Gforth 0.5.0 Itanium II does ok.

On Gforth 0.6.x the comparison is a little unfair, because some
machines have the "dynamic superinstruction" optimization, while
others don't have it. The Itanium II performs best among those that
don't have it, but is much slower than those that have it.

Ok, this is just one benchmark, you can see another benchmark on
<http://www.complang.tuwien.ac.at/franz/latex-bench>; just some lines
from there:

Machine seconds
- UP1500 21264B 800MHz 8MB L2 cache, RedHat 7.1 (b1) 3.28
- Intel Atom N330, 1.6GHz, 512K L2 Zotac ION A, Knoppix 6.1 32bit 2.323
- Athlon (Thunderbird) 900, Win2K, MikTeX 1.11d 2.306
- Athlon 64 X2 5600+, 2800MHz, 1MB L2, Debian Etch (64-bit) 0.624
- Xeon 5450, 3000MHz, (2*2*)6MB L2, Debian Etch (64-bit) 0.460
- iBook G4 12", 1066MHz 7447A, 512KB L2, Debian Sarge GNU/Linux 2.62
- PowerMac G5, 2000MHz PPC970, Gentoo Linux PPC64 1.47
- Sun Blade 1000, UltraSPARC-IIIi 900Mhz Solaris 8 3.09
- HP workstation 900MHz Itanium II, Debian Linux 3.528

Again, not great performance, but not extremely bad, either.

If the others would not have beaten it on clock rate, it would have
been competetive even on such applications compiled with gcc.

>RedHat Enterprise still supports Itanium, so far as I know. Open
>source depends on gcc, perhaps the cruftiest bit of code on the
>planet. Yes, gcc will run on Itanium, but with what level of
>performance?

See above.

>Could the open source community, essentially founded on
>x86, turn on a dime and compete with Microsoft running away with
>Itanium? Maybe with IBM's muscle behind Linux, open source would have
>stood a chance, but I'm not so sure. After all, IBM would always have
>preferred an Itanium-free world. Had I been at Microsoft, I might
>have seen a Wintanium future as really attractive.

Microssoft obviously did not see it that way, because they eventually
decided against IA-64 and for AMD64. I don't know why they decided
that way, but I see two flaws in your scenario:

To run away with IA-64, Windows software would have had to run on
IA-64 at all. Most of it is not controlled by Microsoft, and even the
software controlled by Microsoft does not appear to be that portable
(looking at the reported dearth of applications (including
applications from Microsoft) for Windows NT on Alpha). In contrast,
free software tends to be much more portable, so the situation on
IA-64 would have been: Windows with mostly emulated applications
against Linux with native applications.

And would Microsoft have produced a better compiler for IA-64 than
SGI?

- anton
--
M. Anton Ertl Some things have to be seen to be believed
anton(a)mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html