What will Microsoft use its ARM license for? [Computer Architecture]

Prev: Effects of Memory Latency and Bandwidth onSupercomputer,Application Performance
Next: Effects of Memory Latency and Bandwidth on Supercomputer,Application Performance

From: Tim McCaffrey on 11 Aug 2010 16:34

In article <lro9j7-38b.ln1(a)ntp.tmsw.no>, "terje.mathisenattmsw.no" says...
>

>I sort of accept all that, what I don't get is the fact that for more or
>less my entire IT career, I've been told that all the special x86
>instructions with fixed and/or implied register operands made it very
>hard/impossible to generate really good compiled code, and that this
>problem was solved by having more registers and an othogonal instruction
>set, i.e. RISC. :-)
>
>(Personally I've never really understood what was so hard about x86,
>except for register pressure, mapping algorithms onto the
>register/instruction set have felt quite natural.)
>

With most compilers I've worked with (not a huge number), there is a series of
transforms the code goes through, until you get to the actual code emitter.
At that point, you get intermediate ops something like:
T2 = VAR1
T3 = VAR2
T1 = T2<<T3
VAR1 = T1

So, the straight forward thing to do is load T2 & T3 into a register:
MOV EAX,VAR1
MOV EBX,VAR2

Do the operation (oops, loaded it into the wrong register for that!)
MOV ECX,EBX
SHL EAX,ECX ;simple reference counting allows us to reuse EAX
; because we know T2 will not be used again.
And store it:
MOV VAR1, EAX

Now, most modern code generators will do the correct register assignment,
getting rid of the above MOV, but it requires multiple passes to get the code
right. The code emitter is a fair amount more complicated, especially when
you get into more complex code & register allocation mechanisms.

In a lot of ways, the code emitter is the hardest part of the compiler. That
is because you have a hard target to transform to. Other stages of the
compiler if you don't like target, change it to something more convienent, or
create some intermeidiate stage and translate back & forth. You also lose a
lot of semantic information, when you load a register with a value you pretty
much lose any range or type information at that point, unless your emitter is
unusually sophisticated.

- Tim

From: Terje Mathisen "terje.mathisen at on 12 Aug 2010 01:22

Nick Maclaren wrote:
> In article<s95bj7-ib7.ln1(a)laptop.reistad.name>,
> Morten Reistad<first(a)last.name> wrote:
>> The x86 is a little weird, but not overly so. The various instructions
>> have lots of implied associations, but nothing totally exotic. For
>> exotic, try the VAX, or the Prime 50-series.

The nice thing about x86 is that it is almost totally a load-operate
architecture, the operate-on-mem is useful for atomics and for those
situations when you really have to update some seldom-used location.

Avoiding instructions with multiple independent addresses and/or
multiple targets made it easier to create optimized microarchitectures
that speeded it up a lot.
>
> Yes. The same remarks were made by the same dogmatists about the
> System/370 series, and they were even less justified. There were
> some weird instructions, but they were used only by people who wrote
> assembler procedures and run-time systems. The basic instruction set
> was very simple, and that is all that almost all compilers used.

I have been told (read somewhere) that one early IBM compiler was known
for being able to generate, when compiling regular programs, every
single opcode in the instruction set.

I don't remember which language, maybe Cobol or APL?

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

From: Terje Mathisen "terje.mathisen at on 12 Aug 2010 01:35

Tim McCaffrey wrote:
> In article<lro9j7-38b.ln1(a)ntp.tmsw.no>, "terje.mathisenattmsw.no" says...
>> (Personally I've never really understood what was so hard about x86,
>> except for register pressure, mapping algorithms onto the
>> register/instruction set have felt quite natural.)
>>
>
> With most compilers I've worked with (not a huge number), there is a series of
> transforms the code goes through, until you get to the actual code emitter.
> At that point, you get intermediate ops something like:
> T2 = VAR1
> T3 = VAR2
> T1 = T2<<T3
> VAR1 = T1
>
> So, the straight forward thing to do is load T2& T3 into a register:
> MOV EAX,VAR1
> MOV EBX,VAR2
>
> Do the operation (oops, loaded it into the wrong register for that!)
> MOV ECX,EBX
> SHL EAX,ECX ;simple reference counting allows us to reuse EAX
> ; because we know T2 will not be used again.
> And store it:
> MOV VAR1, EAX
>
> Now, most modern code generators will do the correct register assignment,
> getting rid of the above MOV, but it requires multiple passes to get the code

The way I do this is to do a first pass that looks for any of those
fixed register assignment instructions (mostly ECX/CX/CL today) and
check which are located in the innermost loops.

I might need to do the same with ESI/EDI, so anything that looks like
buffer input/output will start out with those assignments.

The next stage is to figure out the number of live registers at all
locations, then figure out if there are any obvious candidates for
overflow into stack local.

One of the harder consideration is regarding the frame pointer, EBP: If
I use that for temporary variables, I lose the capability to reload any
stack parameters/temps, at least if I'm also pushing/popping stuff
dynamically.

> right. The code emitter is a fair amount more complicated, especially when
> you get into more complex code& register allocation mechanisms.

Yep.
>
> In a lot of ways, the code emitter is the hardest part of the compiler. That
> is because you have a hard target to transform to. Other stages of the
> compiler if you don't like target, change it to something more convienent, or
> create some intermeidiate stage and translate back& forth. You also lose a
> lot of semantic information, when you load a register with a value you pretty
> much lose any range or type information at that point, unless your emitter is
> unusually sophisticated.

I.e. human?
:-)

I have seen a 16-bit x86 compiler though which was smart enough to
rearrange stack variables so as to pair byte/char temps: This allowed it
to automatically use a single 16-bit operations on both variables, while
making sure that all 16-bit temps were properly aligned.

Sort of SIMD in ~1985.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

From: Nick Maclaren on 12 Aug 2010 04:43

In article <d1dcj7-q7j.ln1(a)ntp.tmsw.no>,
Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:
>>
>> Yes. The same remarks were made by the same dogmatists about the
>> System/370 series, and they were even less justified. There were
>> some weird instructions, but they were used only by people who wrote
>> assembler procedures and run-time systems. The basic instruction set
>> was very simple, and that is all that almost all compilers used.
>
>I have been told (read somewhere) that one early IBM compiler was known
>for being able to generate, when compiling regular programs, every
>single opcode in the instruction set.
>
>I don't remember which language, maybe Cobol or APL?

Cobol - or, in those days, COBOL :-) And it was for System/360 only,
and only the unprivileged instructions in the system-independent set.
I can't tell you if it was true, but I heard it, too ....

Regards,
Nick Maclaren.

From: Gubbi on 12 Aug 2010 07:55

On Aug 10, 2:15 pm, Terje Mathisen <"terje.mathisen at tmsw.no">
wrote:
> Brett Davis wrote:

> > So is CMOVE still implemented internally as a branch?
> > (I know this is crazy sounding, but that is what both did...)
>
> Not a real branch, but it did hold up the pipeline for a short while afair?

You have three data dependencies, the two source registers and the
condition. With something like cmovc you end up waiting for the last
ALU op to complete and set the carry before the CMOV can be executed.

Compared to a correctly predicted branch where you end up with just
one data dependency.

Cheers

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10
Prev: Effects of Memory Latency and Bandwidth onSupercomputer,Application Performance
Next: Effects of Memory Latency and Bandwidth on Supercomputer,Application Performance