From: Urs Thuermann on
I am developing a client-server application using TCP sockets. After
connection establishment the server application first writes a
variably-sized small header (typically around 40-60 bytes) followed by
many writes of 256 bytes of data (everything using write(2) on the
socket). The socket doesn't have the O_NONBLOCK flag set and, AFAICS,
no signals are being sent to the process. In case it does matter, the
server has many threads running and serves several clients at the same
time (a couple of threads producing data, one thread listening on the
socket, and one thread per connected client).

I am surprised that sometimes the write system call on the socket
returns with less than 256 bytes written.

My understanding is that according to POSIX, this shouldn't happen:

When attempting to write to a file descriptor (other than a
pipe or FIFO) that supports non-blocking writes and cannot
accept the data immediately:

* If the O_NONBLOCK flag is clear, write() shall block the
calling thread until the data can be accepted.

* If the O_NONBLOCK flag is set, write() shall not block the
thread. If some data can be written without blocking the
thread, write() shall write what it can and return the
number of bytes written. Otherwise, it shall return -1 and
set errno to [EAGAIN].

Therefore, I expected the write(2) system call to return immediately
with 256 if the send buffer has enough space, or to block until 256
bytes can be written to the send buffer and then also return with 256.


urs
From: Rick Jones on
Perhaps your platform's stack is slightly buggy. Or not strictly
conformant to POSIX. While I don't know there is one in this case,
POSIX has been known to hae "loopholes."

Drifting...if ever so slightly...

How many of these 256 byte writes does your application make? Is
there a specific reason it is a stream of comparatively tiny 256 byte
writes rather than larger writes? Are they "spread-out" in time or do
they get sent "back-to-back?"

rick jones
--
oxymoron n, Hummer H2 with California Save Our Coasts and Oceans plates
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...
From: Urs Thuermann on
Rick Jones <rick.jones2(a)hp.com> writes:

> Perhaps your platform's stack is slightly buggy.

It's a Debian testing Linux with kernel 2.6.32:

urs(a)ha:~$ uname -a
Linux ha 2.6.32-3-686 #1 SMP Thu Feb 25 06:14:20 UTC 2010 i686 GNU/Linux

> How many of these 256 byte writes does your application make? Is
> there a specific reason it is a stream of comparatively tiny 256 byte
> writes rather than larger writes? Are they "spread-out" in time or do
> they get sent "back-to-back?"

There are millions of these small writes. The first few thousands are
written very fast, i.e. as fast as the network connection allows, then
the rate is rougly 24 writes per seconds, i.e. 48 kbit/s.

In the server several threads are producing data each into its own
FIFO buffer with roughly 48 kbit/s:

char b[4096];
while (1) {
produce_data(b, sizeof(b));
write_to_fifo_buffer(b, sizeof(b));
}

For each client connected to the server, I start a thread that selects
one of these buffers and then sends data from it:

char b[256];
select_buffer();
write(sock, small_header, header_size);
while (1) {
get_from_fifo_buffer(b, sizeof(b));
n = write(sock, b, sizeof(b));
if (n < 0) {
perror("write");
break;
} else if (n < sizeof(b)) {
/* This is where I didn't expect to get into
* and I currently don't handle this case properly.
*/
fprintf(stderr, "Incomplete write...", ...);
}
}

After the client connects the buffer it typically has some megabytes
of data so the while loop can send as fast as the network connection
allows. Following that the thread will wait in each loop iteration in
get_from_fifo_buffer() one a pthread_mutex and is limited to the 48
kbit/s the buffer is filled at.

The size of 256 was selected somewhat arbitrarily, I have chosen a
small value to get less bursty behavior of the thread. But the
observed behavior of the write(2) system call means that I get
incomplete writes and have to change the code to handle the unwritten
data in some way. This would probably also be necessary with a larger
buffer b of say 4096 bytes.

I see the incomplete writes very seldomly, but when they occur it
happens mostly after only a couple writes on the socket after
connection establishment when the rate of writes is still very high.


urs
From: Rick Jones on
Urs Thuermann <urs(a)isnogud.escape.de> wrote:
> Rick Jones <rick.jones2(a)hp.com> writes:
> > Perhaps your platform's stack is slightly buggy.

> It's a Debian testing Linux with kernel 2.6.32:

I cannot say that I do many tests with netperf writing 256 bytes at a
time, but you could try a netperf TCP_STREAM test and/or some "burst
mode" TCP_RR tests with 256 byte sends/requests/responses to see if
you can make it happen with other code.

> urs(a)ha:~$ uname -a
> Linux ha 2.6.32-3-686 #1 SMP Thu Feb 25 06:14:20 UTC 2010 i686 GNU/Linux

> > How many of these 256 byte writes does your application make? Is
> > there a specific reason it is a stream of comparatively tiny 256 byte
> > writes rather than larger writes? Are they "spread-out" in time or do
> > they get sent "back-to-back?"

> There are millions of these small writes. The first few thousands are
> written very fast, i.e. as fast as the network connection allows, then
> the rate is rougly 24 writes per seconds, i.e. 48 kbit/s.

The reason I ask is that sending bulk data 256 bytes at a time isn't
terribly efficient...

> In the server several threads are producing data each into its own
> FIFO buffer with roughly 48 kbit/s:

> char b[4096];
> while (1) {
> produce_data(b, sizeof(b));
> write_to_fifo_buffer(b, sizeof(b));
> }

> For each client connected to the server, I start a thread that selects
> one of these buffers and then sends data from it:

> char b[256];
> select_buffer();
> write(sock, small_header, header_size);
> while (1) {
> get_from_fifo_buffer(b, sizeof(b));
> n = write(sock, b, sizeof(b));
> if (n < 0) {
> perror("write");
> break;
> } else if (n < sizeof(b)) {
> /* This is where I didn't expect to get into
> * and I currently don't handle this case properly.
> */
> fprintf(stderr, "Incomplete write...", ...);
> }
> }

> After the client connects the buffer it typically has some megabytes
> of data so the while loop can send as fast as the network connection
> allows. Following that the thread will wait in each loop iteration in
> get_from_fifo_buffer() one a pthread_mutex and is limited to the 48
> kbit/s the buffer is filled at.

> The size of 256 was selected somewhat arbitrarily, I have chosen a
> small value to get less bursty behavior of the thread.

Are you also disabling Nagle? That burst of 256 byte sends at the
beginning may get chunked-up by a combination of Nagle and TSO.

You can see the difference in stack efficiency with netperf TCP_STREAM
tests:

netperf -H <remote> -c -C -l 30 -- -m 256

vs

netperf -H <remote> -c -C -l 30 -- -m <something much larger>

If you are setting TCP_NODELAY in your application, add a -D option to
the end of the netperf command lines.

> But the observed behavior of the write(2) system call means that I
> get incomplete writes and have to change the code to handle the
> unwritten data in some way. This would probably also be necessary
> with a larger buffer b of say 4096 bytes.

> I see the incomplete writes very seldomly, but when they occur it
> happens mostly after only a couple writes on the socket after
> connection establishment when the rate of writes is still very high.

That would also be when the Linux stack is still "autotuning" the
socket buffer size - at least if you haven't made an explicit
setsockopt(SO_SNDBUF) call before-hand.

Sending that big burst at the beginning will probably result in a
rather larger than necessary SO_SNDBUF with autotuning - and similarly
for the SO_RCVBUF at the receiver if it recv()s the data as fast as it
arrives.

If you want explicit socket buffer sizes with netperf tack-on a -s
<size> at the end for the local end and a -S <size> for the remote.

rick jones
--
A: Because it fouls the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?
From: David Schwartz on
On Jun 11, 2:39 am, Urs Thuermann <u...(a)isnogud.escape.de> wrote:

> I am surprised that sometimes the write system call on the socket
> returns with less than 256 bytes written.
....
> Therefore, I expected the write(2) system call to return immediately
> with 256 if the send buffer has enough space, or to block until 256
> bytes can be written to the send buffer and then also return with 256.

What happens when you immediately follow up with a 'write' call for
the remaining bytes? Does it succeed? If so, the solution is pretty
obvious (though it's not clear why you should need it), just call
'write' again. You should be doing that anyway.

DS