polling / interrupt [Computer Architecture]

Prev: Nvidia Secret Sauce for ECC?
Next: A post to comp.risks that everyone on comp.arch should read

From: Terje Mathisen "terje.mathisen at on 5 Jun 2010 01:29

robertwessel2(a)yahoo.com wrote:
> On Jun 4, 3:33 am, Terje Mathisen<"terje.mathisen at tmsw.no"> wrote:
>> The 16450 rs232 chip could be programmed to delay interrupts until a
>> given percentage of the 16-entry FIFO buffer had been consumed, but a
>> receive irq handler could still see that the buffer was non-empty by
>> polling the status.
>
>
> To be pedantic, that was the 16550A. The 16450 was basically an 8250
> clone, with official support for higher speeds.

Thanks, you're right. My memory isn't what it was 25 (or so) years ago.
>
> Of course programming the 16550 and using the buffer was complicated
> by a number of bugs, not least being its propensity of the write FIFO
> getting stuck if you put a single byte into it at just the wrong time.

Aha!

So that was the reason _some_ PCs could fail unless the write buffer was
skipped...

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

From: Rob Warnock on 6 Jun 2010 06:43

Rick Jones <rick.jones2(a)hp.com> wrote:
+---------------
| Rob Warnock <rpw3(a)rpw3.org> wrote:
| > 3. "Coalescing": The device driver interrupt service routine shall
| > *continue* to poll for attention status and continue to service
| > the device until the attention status goes false. [In fact, in
| > some systems it's a good idea if it polls the *first* time into
| > the ISR as well, to deal with potential spurious interrupts. But
| > that's another story...]
|
| I trust that is something other than a PIO Read?
+---------------

Sometimes a PIO Read was all you had, and in that case performance
could still suck, even with all the other optimizations. (*sigh*)

But in all the devices I worked on at SGI & later, the attention/status
"poll" was a main memory read of the completion/status circular queue
that the device updated with DMA, so the poll was quite cheap on the
host CPU side.

[Note that there are tricks you can play (and *need* to, for best
performance) to avoid the usual cache line ownership thrashing you
get with DMA that uses naive "producer/consumer" pointers in its
command/status queues. But that's another story...]

+---------------
| > In most typical applications, the optimal dally time will be a
| > small fraction of the holdoff time. And in any case, the peaks
| > for "good" values of both parameters are rather broad.
|
| Sounds a little like deciding how long to spin on a mutex before going
| into the "other guy has it" path :)
+---------------

Just so! ;-}

-Rob

-----
Rob Warnock <rpw3(a)rpw3.org>
627 26th Avenue <URL:http://rpw3.org/>
San Mateo, CA 94403 (650)572-2607

From: Morten Reistad on 6 Jun 2010 10:17

In article <X8ydnWeXrKN85pbRnZ2dnUVZ_j2dnZ2d(a)speakeasy.net>,
Rob Warnock <rpw3(a)rpw3.org> wrote:
>Rick Jones <rick.jones2(a)hp.com> wrote:
>+---------------
>| Rob Warnock <rpw3(a)rpw3.org> wrote:
>| > 3. "Coalescing": The device driver interrupt service routine shall
>| > *continue* to poll for attention status and continue to service
>| > the device until the attention status goes false. [In fact, in
>| > some systems it's a good idea if it polls the *first* time into
>| > the ISR as well, to deal with potential spurious interrupts. But
>| > that's another story...]
>|
>| I trust that is something other than a PIO Read?
>+---------------
>
>Sometimes a PIO Read was all you had, and in that case performance
>could still suck, even with all the other optimizations. (*sigh*)
>
>But in all the devices I worked on at SGI & later, the attention/status
>"poll" was a main memory read of the completion/status circular queue
>that the device updated with DMA, so the poll was quite cheap on the
>host CPU side.
>
>[Note that there are tricks you can play (and *need* to, for best
>performance) to avoid the usual cache line ownership thrashing you
>get with DMA that uses naive "producer/consumer" pointers in its
>command/status queues. But that's another story...]

All of these I/O methods overlays interrupts and DMA and polls into
the classic von Neumann model. The costs of these overlays go up
radically as we tune the core cpus for pipeline length, memory
bandwith etc.

When we test modern systems for performance we usually end up
benchmarking how the core system I/O is implemented.

The best news in systems design we have seen since the cpu
frequencies stalled somewhere between 2 and 4 GHz has been the
hyperchannel. It is really effective in what it says it will
do.

So why not use that, or a similar mechanism, to do the master
I/O on our systems. Instead of 137 interrupts (really, the last,
large server we tested had this many interrupts enabled) we
can do with around 5. We can put the vast bulk on the I/O into
the wide, fast, low latency pipe, and have one low-level
driver demultiplex it where we want the data, and ship from
where we want to ship stuff.

OK; we still need to schedule, wake other processors up, etc
but we don't need all that priceise interrupts for that. If the
preformance-critical I/O is done, we can send signals to the
instruction dispatcher instead of blowing away the pipeline
with a high priority interrupt. Once timing and high-speed
stuff is dealt with we can easily live with a 10us latency
for the rest.

Then make some version of the "south/north bridge" chips that
break out the hyperchannel fifo into the I/O channels we
already know. Like ethernet, usb, sata and pci; and let the
stock interfaces take it from there.

>+---------------
>| > In most typical applications, the optimal dally time will be a
>| > small fraction of the holdoff time. And in any case, the peaks
>| > for "good" values of both parameters are rather broad.
>|
>| Sounds a little like deciding how long to spin on a mutex before going
>| into the "other guy has it" path :)
>+---------------
>
>Just so! ;-}

For the real high speed I/O, let the hardware do it.

-- mrr

From: Rick Jones on 7 Jun 2010 14:36

Morten Reistad <first(a)last.name> wrote:
> The best news in systems design we have seen since the cpu
> frequencies stalled somewhere between 2 and 4 GHz has been the
> hyperchannel. It is really effective in what it says it will
> do.

> So why not use that, or a similar mechanism, to do the master
> I/O on our systems. Instead of 137 interrupts (really, the last,
> large server we tested had this many interrupts enabled) we
> can do with around 5. We can put the vast bulk on the I/O into
> the wide, fast, low latency pipe, and have one low-level
> driver demultiplex it where we want the data, and ship from
> where we want to ship stuff.

Ah the one channel to feed them all. That reminds me of the HP9000
K-Class systems - when they first shipped in the mid 1990's it was
felt that just the one or two "HSC" (aka GSC+) I/O slots would be
sufficient to feed the beast. There was I believe a two slot I/O
expansion card one could install.

Before the system went off the price-list, HP were shipping a
four-slot HSC expansion module for the thing.

Going-back farther, there was the "DTC" (aka Avesta) on the PA-RISC
HP3000's (and used with the HP9000s) - all that ugly slow serial stuff
was put out into the DTC and it was linked with the host via a (by
then standards) blazing-fast 10 Mbit/s Ethernet link. The X.25
functionality wsa offloaded into that thing too.

Trouble was, there ended-up being an entire networking stack for
talking to the DTC...

rick jones
--
The computing industry isn't as much a game of "Follow The Leader" as
it is one of "Ring Around the Rosy" or perhaps "Duck Duck Goose."
- Rick Jones
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...

From: Morten Reistad on 7 Jun 2010 17:12

In article <huje81$5kq$5(a)usenet01.boi.hp.com>,
Rick Jones <rick.jones2(a)hp.com> wrote:
>Morten Reistad <first(a)last.name> wrote:
>> The best news in systems design we have seen since the cpu
>> frequencies stalled somewhere between 2 and 4 GHz has been the
>> hyperchannel. It is really effective in what it says it will
>> do.
>
>> So why not use that, or a similar mechanism, to do the master
>> I/O on our systems. Instead of 137 interrupts (really, the last,
>> large server we tested had this many interrupts enabled) we
>> can do with around 5. We can put the vast bulk on the I/O into
>> the wide, fast, low latency pipe, and have one low-level
>> driver demultiplex it where we want the data, and ship from
>> where we want to ship stuff.
>
>Ah the one channel to feed them all. That reminds me of the HP9000
>K-Class systems - when they first shipped in the mid 1990's it was
>felt that just the one or two "HSC" (aka GSC+) I/O slots would be
>sufficient to feed the beast. There was I believe a two slot I/O
>expansion card one could install.

No, not necessarily one channel. But few channels, and a channel,
not a bus. With some chip like a south bridge on the other side,
to let hardware handle what it does best. Eat larger chunks of
data with no hands-on cpu involvment. It the hardware knows where
stuff goes, no need to interrupt the cpu.

>Before the system went off the price-list, HP were shipping a
>four-slot HSC expansion module for the thing.
>
>Going-back farther, there was the "DTC" (aka Avesta) on the PA-RISC
>HP3000's (and used with the HP9000s) - all that ugly slow serial stuff
>was put out into the DTC and it was linked with the host via a (by
>then standards) blazing-fast 10 Mbit/s Ethernet link. The X.25
>functionality wsa offloaded into that thing too.
>
>Trouble was, there ended-up being an entire networking stack for
>talking to the DTC...

The error they, and Prime, and DEC, and HP did was to think
outboarding, and use slow links. Think FAST link. Sufficiently
fast to run the L2->memory interface on. Like hyperchannel, only
better integrated into hardware. Maybe we can use hyperchannel
without changing the hardware much.

Besides, we use full network stacks today, to talk to USB, SCSI,
IP, SATA, PCI, SCI, Firewire, even the sons of PCMCIA. Just use
them through a single digit number of really fast links and let
hardware demultiplex that. Exactly what we would do with 10G
ethernet, only with a little lower latency.

-- mrr

First | Prev | Next | Last
Pages: 2 3 4 5 6 7 8 9 10 11 12 13
Prev: Nvidia Secret Sauce for ECC?
Next: A post to comp.risks that everyone on comp.arch should read