From: David Schwartz on
On May 6, 4:40 am, Rainer Weikusat <rweiku...(a)mssgmbh.com> wrote:

> There is none. POLLHUP and POLLERR are two revents-only values which
> report possibly interesting connection state changes other than 'data
> to read available' or 'bufferspace to write data into
> available'. Since both errors and 'hangups' can occur during normal
> I/O operations, the respective input- and output-handlers need to be
> able to deal with these conditions, anyway, and there is no reason to
> care specially for either of both. Because of this, I usually do
> something like
>
>         if (revents & ~(POLLIN | POLLOUT)) revents |= POLLIN;
>
> and let the ordinary input handler deal with it.

I sometimes wonder why 'poll' didn't do that by default and let you
ignore IN if HUP or ERR is set if you want to. Either way would work,
but settings POLLIN would be more consistent with how 'select'
behaves. I wonder if this was felt to be a deficiency in 'select'.

In any event, it's a minor issue. You can certainly treat HUP/ERR the
same way as IN if you want. Sometimes it makes your code more
efficient not to, but it does complicate handling half-open
connections if you need to.

The point is that both 'select' and 'poll' have small wrinkles that
can bite a first-time user. None of this is a big deal though, every
interface is like that.

DS
From: Ersek, Laszlo on
On Thu, 6 May 2010, Rainer Weikusat wrote:

> "Ersek, Laszlo" <lacos(a)caesar.elte.hu> writes:
>> On Wed, 5 May 2010, Rainer Weikusat wrote:
>
> [...]
>
>>> and usually map anything 'weird' which might be returned in revents
>>> to POLLIN. If it is an EOF, read will detect that. The same is true
>>> for any kind of error condition.
>>
>> (I kind of feel a contradiction between this and:
>>
>> ----v----
>> From davids(a)webmaster.com Wed May 5 12:49:19 2010
>> Date: Wed, 5 May 2010 03:49:19 -0700 (PDT)
>> From: David Schwartz <davids(a)webmaster.com>
>> Newsgroups: comp.unix.programmer
>> Subject: Re: experienced opinions
>>
>> [snip]
>>
>> In fact, the only common error I see with select is thinking that
>> poll' will return writability or readability if the connection closes
>> or errors.
>
> [...]
>
> There is none. POLLHUP and POLLERR are two revents-only values which
> report possibly interesting connection state changes other than 'data
> to read available' or 'bufferspace to write data into
> available'. Since both errors and 'hangups' can occur during normal
> I/O operations, the respective input- and output-handlers need to be
> able to deal with these conditions, anyway, and there is no reason to
> care specially for either of both. Because of this, I usually do
> something like
>
> if (revents & ~(POLLIN | POLLOUT)) revents |= POLLIN;
>
> and let the ordinary input handler deal with it.


Oh, now I see it. I misunderstood your topmost sentence:

>> On Wed, 5 May 2010, Rainer Weikusat wrote:

>>> and usually map anything 'weird' which might be returned in revents to
>>> POLLIN. [...]

I interpreted this as

"and usually map anything 'weird' which might be returned in revents to
POLLIN [in *events*]",

while what you actually meant was

"and usually map anything 'weird' which might be returned in revents to
POLLIN [in *revents*]".

The second form corresponds to the code you posted.

The first (misinterpreted) form does match what David describes as an
error, doesn't it? The first form says "I'll just set POLLIN in /events/
and I'll get POLLIN in /revents/ too if anything weird happens". Stuck in
the select() mindset, I am :)

Thanks again,
lacos
From: Ersek, Laszlo on
On Tue, 4 May 2010, David Schwartz wrote:

> On May 4, 10:01�am, "Ersek, Laszlo" <la...(a)caesar.elte.hu> wrote:
>
>> I like to understand function specifications in depth before calling
>> said functions.
>
> As I recall, 'select' actually takes three FD sets. Does anyone know
> precisely what that third set is for?

SUSv4 XSH 2.10.11 "Socket Receive Queue" [0] and onwards describes
"out-of-band data" and "out-of-band data mark". The word "segment" is used
in a logical sense, not TCP segment.

The select() spec [1] says:

----v----
The pselect() function shall examine the file descriptor sets whose
addresses are passed in the /readfds/, /writefds/, and /errorfds/
parameters to see whether some of their descriptors are ready for reading,
are ready for writing, or have an exceptional condition pending,
respectively.

[...]

If a socket has a pending error, it shall be considered to have an
exceptional condition pending. Otherwise, what constitutes an exceptional
condition is file type-specific. For a file descriptor for use with a
socket, it is protocol-specific except as noted below. [...]

If a descriptor refers to a socket, the implied input function is the
/recvmsg()/ function with parameters requesting normal and ancillary data,
such that the presence of either type shall cause the socket to be marked
as readable. The presence of out-of-band data shall be checked if the
socket option SO_OOBINLINE has been enabled, as out-of-band data is
enqueued with normal data. [...]

[...]

A socket shall be considered to have an exceptional condition pending if a
receive operation with O_NONBLOCK clear for the open file description and
with the MSG_OOB flag set would return out-of-band data without blocking.
(It is protocol-specific whether the MSG_OOB flag would be used to read
out-of-band data.) A socket shall also be considered to have an
exceptional condition pending if an out-of-band data mark is present in
the receive queue. Other circumstances under which a socket may be
considered to have an exceptional condition pending are protocol-specific
and implementation-defined.
----^----

<rant>

The text lists "state -> exceptional condition" implications. When one
checks a bit in the third fd_set, he needs the reverse direction:
"exceptional condition -> what state?". In my interpretation, at least the
following situations are possible:

(1) Pending error -- use getsockopt(sock, SOL_SOCKET, SO_ERROR, ...).

Continuing with TCP in mind, which should
- support out-of-band data,
- support the out-of-band data mark,
- enqueue out-of-band data at the end of the queue,
- not place ancillary-data-only segments in the queue (that is, segments
with neither normal nor out-of-band data),

(2) An out-of-band data mark is present in the Receive Queue. (Regardless
of whether SO_OOBINLINE was set.)

Out-of-band data may not be readable without blocking when the third
fd_set fires -- only the mark may be present. Supposing TCP over IP(v4), a
large TCP segment may be fragmented. When the first fragment (containing
the TCP header and thus the urgent pointer) is processed, the mark may be
placed immediately in the queue.

logical segment | hole to be filled | mark | expected logical segment
with normal data | with normal data | | with out-of-band data

For me this means that the third fd_set can't be used at all in a select()
that is meant to block, as such a select may return immediately, and a
subsequent blocking receive call may still block (or a nonblocking one may
still return with -1/EAGAIN (or EWOULDBLOCK), resulting in spinning).

(The figure above is intended to display the in-line enqueueing of
out-of-band data, but it really doesn't matter now -- it would only effect
how that data would be available (or how that data would become lost) once
we got to the mark.)

My approach is to set SO_OOBINLINE. This allows me to work with the first
two sets only. Errors are returned with read()/write(). The finalization
of a pending connect() is signalled as writability, and the result can be
queried via SO_ERROR.

The presence of an out-of-band data *mark* without the OOB data itself
won't wake the select(). When woken and FD_ISSET() reports readability,
out-of-band data is simply checked for with sockatmark() before each
read(). No read() will coalesce normal and OOB data. If sockatmark()
returns 1, the next byte to read() is the urgent byte. (No MSG_OOB is
needed.)

If the kernel processes multiple TCP headers with urgent pointers before I
get to call sockatmark(), each single urgent byte but the last one will be
inlined in the normal data stream. (I seem to remember that I experimented
with this on Linux, and this was the case even with SO_OOBINLINE turned
off. An urgent byte was only dropped if SO_OOBINLINE was turned off and I
actively read past the mark, before the mark was moved. So I didn't really
care about race conditions.)

(I'm not sure how one could do without SO_OOBINLINE. A TCP segment
carrying a single urgent byte wouldn't wake a select() that didn't pass a
third fd_set. On the other hand, a select() passing a third fd_set could
lead to indefinite spinning, or indefinite blocking.)

The problem with sockatmark() is that it was first introduced in SUSv3. I
wrote my port forwarder for SUSv1. I think I worked it around by calling
select() in a non-blocking manner right before and after the read(), to
see if there was an exceptional condition pending that ceased due to the
read(). (If the "before" condition was true due to an error, then the
"after" condition wasn't checked, because read() returned that error
first.) I chose this as the default workaround.

I (hopefully) implemented this before-after check "primitive" with recv()
too. Once the SO_OOBINLINE socket signalled readability, I temporarily
turned off SO_OOBINLINE, and tried to read a single byte with recv(...,
MSG_PEEK | MSG_OOB), then turned SO_OOBINLINE back on. A successful
receive meant "at the mark" (and left the urgent byte over to the normal,
subsequent read), -1/EINVAL meant "no mark", and -1/EAGAIN (or
EWOULDBLOCK) meant "mark nearby". (This was no problem, because a
before-after "mark nearby" -> "no mark" transition, due to the normal,
in-lined read(), was interpreted as "urgent data consumed" just the same.)

Looking back, perhaps I could have found a third way: recvmsg() can report
on output, in msg_flags, whether it consumed out-of-band data. Since OOB
(logical) segments don't coalesce with normal (logical) segments,
recvmsg() would have consumed a sole byte at these times. I'm not sure how
I would have had to fiddle with SO_OOBINLINE, though.

I never considered SIOCATMARK.

(If anyone cares, the code is "forward3.c" under [2]; most recently, it's
been running on my workstation since Mar 25 to log SOAP.)

I would never touch out-of-band data again; not with a ten foot pole.

--o--

If we're already talking about what socket functions precisely do, can
anyone judge whether both Solaris' and Linux' handling of accept() are
POSIX conformant, when accept() fails with -1/EMFILE (or due to another
resource scarcity)? On Linux, such an event throws away the pending
connection (I think the peer gets a FIN instead of an RST because the
kernel pre-completes the handshake). On Solaris, the incoming connection
remains pending, and one must not FD_SET the listening socket before the
next close(), or else the select()-accept() loop will spin.

Any constructive criticism is greatly appreciated.

</rant>

Cheers,
lacos

[0] http://www.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_10_11
[1] http://www.opengroup.org/onlinepubs/9699919799/functions/select.html
[2] http://lacos.hu
From: David Schwartz on
On May 6, 3:06 pm, "Ersek, Laszlo" <la...(a)caesar.elte.hu> wrote:

> I would never touch out-of-band data again; not with a ten foot pole.

Agreed. It was a bad idea, poorly executed, that has festered since.

> If we're already talking about what socket functions precisely do, can
> anyone judge whether both Solaris' and Linux' handling of accept() are
> POSIX conformant, when accept() fails with -1/EMFILE (or due to another
> resource scarcity)? On Linux, such an event throws away the pending
> connection (I think the peer gets a FIN instead of an RST because the
> kernel pre-completes the handshake). On Solaris, the incoming connection
> remains pending, and one must not FD_SET the listening socket before the
> next close(), or else the select()-accept() loop will spin.

The 'select'/'accept' loop should spin in that case. This is why you
must perform operations that reduce resource consumption before
operations that increase them. And if you make no forward progress,
you must implement a rate-limiter.

And if you detect resource exhaustion, you must do something about it!
If the implementation tells you that you have too many open files, it
is not sensible to react by trying to open more files.

I find both behaviors sensible, FWIW. I like Solaris' behavior
because, knowing the behavior, it's easier to code around it sensibly.
Linux's behavior works better if you don't consider the case
specifically.

Generally, in the case where you can't accept any more incoming
connections, you need/want to close any attempts as quickly as
possible. Sensible clients will interpret this as an overload
condition.

DS
From: Ersek, Laszlo on
On Thu, 6 May 2010, David Schwartz wrote:

> Generally, in the case where you can't accept any more incoming
> connections, you need/want to close any attempts as quickly as possible.
> Sensible clients will interpret this as an overload condition.

Thank you for the advice.

I considered closing the server socket and setting it up again, so as to
"flush" all connection requests pending in the listen queue. However, I
was afraid that I might not be able to re-bind the same local address
(even with SO_REUSEADDR, another process might "steal" the port), and then
all would be lost.

I implemented a primitive "rate limiter": no more connections accepted
until a living socket is closed. A client waiting for an acknowledgement
of its connect() may time out, but for the caliber at hand this approach
seemed workable.

How could the Solaris implementation be made refuse new connections if the
"rate limiter" is in effect? Simply by setting up a low backlog value with
the initial listen()? Or by manipulating the backlog dinamically with
repeated listen() calls during peaking loads?

(I don't know if that's possible at all. The listen() spec in the SUSv4
[0] doesn't seem to disallow it or to define an error condition for it. It
could be an interesting experiment to see whether, with say 16 connections
pending and waiting for an accept(), a listen(srv, 10) would immediately
reset the last six connection requests; in effect flushing the listen
queue and protecting further clients from waiting.)

Thank you,
lacos

[0] http://www.opengroup.org/onlinepubs/9699919799/functions/listen.html