From: D Yuniskis on
Hi,

I'm designing a system using lots of COTS hardware.
IME, few of these things are ever designed *thinking*
about their roles in a *system*. Instead, someone
throws together a set of features, wraps some sort of
syntax around them (in/out) and throws it out into
the market. :<

As such, it's often difficult to know, with certainty,
that all of the devices you are attached to are
actually working, working *properly* or even POWERED UP!

<big frown>

Essentially, I have a processor, touch panel (EIA232),
printer (probably parallel), display, barcode scanner
(EIA232), electronic scale (EIA232) and *possibly*
a keyboard (probably *never* see a mouse!).

Each of these things has smarts. And, each was designed
without concern for any of the others *or* the processor
that talks to all of them.

So, it is possible for the touch panel to get "hung"
(i.e., you can't count on getting valid input from
it!). Or, the printer. Or the barcode scanner. Or
the scale. Or...

Much of the user's interaction is designed to have an
incredibly lightweight user interface. I.e., seldom
even *looking* at the display. Also, the individual
components may not be closely colocated (so, you
can't count on the user to *see* that the printer
isn't working, etc.)

The software is set up with a daemon for each device
to (hopefully) detect communication problems, devices
that are powered down or misconfigured, etc. But,
many devices haven't been designed with keep-alive
protocols in mind. And, most don't formally specify
how they behave when you try talking to them
"regularly" (i.e., trying to exploit configuration
commands that IN THEORY shouldn't affect normal
operation -- but end up doing so! :< )

Each "node" is configured independantly of the others.
E.g., one might have a barcode scanner but no printer;
another might have a printer but no touch screen; another
might have a scale but no *display*! Managing the
configuration isn't a problem. *But*, the variations
mean that you can't rely on any particular device
being present at *each* node (except the processor).
I.e., you can't just flash messages on a screen;
or tell someone to type "REBOOT", etc.

There are only about 30 of these at each location. But,
there won't be any MTS around to support them. So, if
something doesn't *seem* to be working correctly, I need
a simple protocol for (nontechnical) users to get things
back to a known/running condition.

It *seems* like the only realistic AND INTUITIVE protocol
for "recovery" is to sequence power to the devices in
question. Ideally, to *every* device at a node -- though
remembering to do so may be a problem (so I need to deal
with the possibility that some devices might get reset while
others aren't).

And, of course, the software has to take measures to
protect pending transactions as this sort of "problem"
can come up at any time.

What problems am I failing to foresee? Are there any
other (practical) ways of doing this?

Thx,
--don