From: Ersek, Laszlo on
On Wed, 30 Jun 2010, arnuld wrote:

> Problem is how can I be sure that particular length of data will arrive
> in recv(). It can come in any of these partial recv()s
>
> 1st recv(): Content-L
> 2nd recv(): ength: 1345
>
> 1st recv(): Conte
> 2nd recv(): -Length: 1234
>
> 1st recv(): Content-Length:
> 2nd recv(): 1234
>
> 1st recv(): Content-Leng
> 2nd recv(): th: 1234

Yes. You'll have a parser state which stands for "parsing the
Content-Length header". This parser state (and the whole parser itself)
can be implemented on various levels of sophistication. Probably one of
the simplest is:


enum state {
ST_0,
ST_1,
ST_CONTENT_LENGTH,
ST_READ_BODY,
...
};

struct parser
{
char unsigned *recvbuf;
size_t alloc, used;
enum state state;
size_t content_length;
/* ... */
}


and you have a function which adds data to "recvbuf" (managing "alloc" and
"used" as well), and once added, tries to kick the "state" member as far
as possible. (For more, see below.) After processing some data out of
recvbuf, you may memmove() the rest to the beginning of "recvbuf". (This
is terribly inefficient, but easier to implement and is enough for
discussion, hopefully.)

So you have a feed() function which takes the newly received bytes (in
fact you'd probably read() directly into recvbuf, but let's ignore that
for a moment), "appends" them to recvbuf, and then retries to handle the
current state (ie. advance out of the current state, as far as possible,
using the recently received bytes). This can be implemented by
state-dependent function pointers or with switch statements, among others.
The state handler should be level-triggered, like select(). For example,
you should be able to call it consecutively, without adding any new bytes,
with no harm.�(This is not a hard requirement at all, just a very basic
idea for discussion).

Anyway, the following could be a pseudo-implementation that handles the
ST_CONTENT_LENGTH state:

static int
try_to_advance(struct parser *p)
{
switch (p->state) {
/* ... */

case ST_CONTENT_LENGTH:
{
/* "Content-Length:" in ASCII */
static const char unsigned hdr_content_length[] = {
0x43u, 0x6fu, 0x6eu, 0x74u, 0x65u, 0x6eu, 0x74u, 0x2du, 0x4cu,
0x65u, 0x6eu, 0x67u, 0x74u, 0x68u, 0x3au
};
const char unsigned *newline; /* newline terminating the header */
size_t advlen; /* size of parsed header, newline included */

if (sizeof hdr_content_length > p->used) {
/* header too short, come back with more */
return 0;
}

if (0 != memcmp(hdr_content_length, p->recvbuf,
sizeof hdr_content_length)) {
/* header invalid, shed client */
return -1;
}

if (0 == (newline = memchr(p->recvbuf + sizeof hdr_content_length,
0x0Au, p->used - sizeof hdr_content_length))) {
/* header not yet terminated, come back with more */
return 0;
}

/* We have a terminated line, try to parse the integer */
if (-1 == x_str_to_size_t(&p->content_length,
p->recvbuf + sizeof hdr_content_length, newline)) {
/* malformed header, terminate client's connection in outer loop */
return -1;
}

/*
p->content_length is set up here. Remove the parsed part, advance
p->state, and fall through to the next case label (ie. state).
*/
advlen = newline - p->recvbuf + 1u;
p->used -= advlen;
(void)memmove(p->recvbuf, newline + 1u, p->used);
p->state = ST_READ_BODY;
}

case ST_READ_BODY:
/* we can rely on p->content_length being filled in here */
/* ... */


/* ... */
}
}


Note that theoretically the header-parsing code has a worst-case behavior
that is (at least) quadratic in time. This is because we restart memchr()
from the same point after each recv(). We could save the offset where we
gave up the last time and retry only from there. (More precisely, we could
save the state *within* ST_CONTENT_LENGTH in more detail.) But in this
form, if each recv() reads a single byte before we find the newline, we
check 1 + 2 + 3 + 4 + ... bytes until we succeed.


Going back to your examples above,

> 1st recv(): Content-L
> 2nd recv(): ength: 1345
>
> 1st recv(): Conte
> 2nd recv(): -Length: 1234
>
> 1st recv(): Content-Length:
> 2nd recv(): 1234
>
> 1st recv(): Content-Leng
> 2nd recv(): th: 1234

the "sizeof hdr_content_length > p->used" check will hold in all cases
after the first recv(). Now suppose

1st recv(): Content-Length: 123
2nd recv(): 4\n

then no newline will be found after the first recv().

x_str_to_size_t() does a number of things. I'd probably base it on
strtol() (although that would require the platform to be ASCII-based,
since our protocol is ASCII-based, and that would simplify the
initialization of hdr_content_length too). x_str_to_size_t() must check
whether the value can be parsed as a size_t (in fact, the allowed range
would be

[1 .. min { LONG_MAX, (size_t)-1 }]

) and that the parsed decimal string ends where we found the newline (to
exclude "Content-Length: 1234XXXX\n"). Whitespace between the colon ":"
and the beginning of the subject sequence of strtol() (ie. the decimal
string) is swallowed by strtol().

Note that the above "protocol" is not HTTP at all.

This stuff is very messy and it is very easy to introduce undefined
behavior. (I probably did in the code above.) That's why I would suggest
describing a protocol so that the parser's implementation can be
generated. The generated parser should not try to read the data itself,
but expect the programmer to feed it.

lacos
From: Rainer Weikusat on
Nicolas George <nicolas$george(a)salle-s.org> writes:
> Rainer Weikusat wrote in message <87eifosg5m.fsf(a)fever.mssgmbh.com>:
>> When data is read into a buffer of some maximum size and then parsed,
>> anyway, your assertion that 'using \n as line terminator would be
>> annoying' doesn't make any sense anymore, at least to me. Care to
>> elaborate what those 'annoyances' are supposed to be?
>
> Knowing in advance the size of the data avoids all the dynamic
> reallocation:

There is no need to do 'dynamic reallocation' when parsing the contents
of some input buffer provided the buffer size is larger than the
record size. Usually, one would have a start pointer and a 'run'
pointer an whenever the run pointer points to a \n (or any other kind
of 'record terminating marker'), a record with a length of run - start
starting at start has been found.

[...]

>
>> IIRC, the last time I saw an actual character-based terminal was about
>> a decade ago and it was already a rare curiousity at these times. Also
>
> So what?

You claimed that

,----
| The original idea is that the terminals randomly added \r in what they
| emitted and sometimes required them to display things properly.
|
| Fortunately, the days where network protocols were directly connected to a
| terminal ended a good decade ago.
`----

A decade ago, nobody was using character-based terminals anymore, and
especially not 'connecting them to network protocols' whatever that is
supposed to mean. In addition to this,

>> the original SMTP RFC (822) specifically allowed both \r and \n as
>> part of the user data (this has meanwhile been retracted) and
>> consequently, the at least the SMTP line terminator must be something
>> different from either of both, indepdently of what you were referring
>> to above.
>
> And what is it supposed to prove?

That SMTP needs to terminate lines with something other than \r or \n
because the original SMTP RFC specifically allowed \r or \n as part of
the data payload. Neither this RFC (nor any other I am aware of)
contains anything regarding the need to work around broken 'terminals'
of any kind and especially not 'broken terminals' which were in use
only a decade ago, as you stated.
From: Ersek, Laszlo on
On Wed, 30 Jun 2010, Ersek, Laszlo wrote:

> On Wed, 30 Jun 2010, arnuld wrote:
>
>> 1st recv(): Content-L
>> 2nd recv(): ength: 1345
>>
>> 1st recv(): Conte
>> 2nd recv(): -Length: 1234
>>
>> 1st recv(): Content-Length:
>> 2nd recv(): 1234
>>
>> 1st recv(): Content-Leng
>> 2nd recv(): th: 1234
>
> the "sizeof hdr_content_length > p->used" check will hold in all cases after
> the first recv().

Except in the third one, sorry. In that case, memchr() will search zero
bytes, and then return with a null pointer. (No ASCII NL found.)

I changed >= to > in the first check, but then failed to update this
section completely.

lacos
From: Nicolas George on
Rainer Weikusat wrote in message <87aaqcsd19.fsf(a)fever.mssgmbh.com>:
> There is no need to do 'dynamic reallocation' when parsing the contents
> of some input buffer provided the buffer size is larger than the
> record size.

I do not know how you like to program, or even if you can program at all,
but when I design a program, I like it to be able to deal with big inputs
when necessary, but not allocates huge amounts of memory each time it reads
a few dozens octets.

Now, I do not know why I need to explain this: either you have already
implemented anything remotely related to network protocols and what I wrote
should be obvious, or you have not and I suggest you try some before
annoying everyone here further.

> A decade ago, nobody was using character-based terminals anymore, and
> especially not 'connecting them to network protocols' whatever that is
> supposed to mean. In addition to this,

> That SMTP needs to terminate lines with something other than \r or \n
> because the original SMTP RFC specifically allowed \r or \n as part of
> the data payload. Neither this RFC (nor any other I am aware of)
> contains anything regarding the need to work around broken 'terminals'
> of any kind and especially not 'broken terminals' which were in use
> only a decade ago, as you stated.

You really do not want to understand anything ever, do you?
From: Rainer Weikusat on
Nicolas George <nicolas$george(a)salle-s.org> writes:
> Rainer Weikusat wrote in message <87aaqcsd19.fsf(a)fever.mssgmbh.com>:
>> There is no need to do 'dynamic reallocation' when parsing the contents
>> of some input buffer provided the buffer size is larger than the
>> record size.
>
> I do not know how you like to program, or even if you can program at
> all, but when I design a program, I like it to be able to deal with big inputs
> when necessary, but not allocates huge amounts of memory each time it reads
> a few dozens octets.

Fine. Back to square one: Assuming you send an a priory unknown record
size which has neither a practical nor a theoretical limit, you
may need to do 'dynamic buffer reallocation' after having received the
length and possible even while receiving the length, so this buys you
exactly nothing. In the real world, sizes of 'records' used for
network communication are usually bounded, so this issue doesn't exist.

[...]

>> A decade ago, nobody was using character-based terminals anymore, and
>> especially not 'connecting them to network protocols' whatever that is
>> supposed to mean. In addition to this,
>
>> That SMTP needs to terminate lines with something other than \r or \n
>> because the original SMTP RFC specifically allowed \r or \n as part of
>> the data payload. Neither this RFC (nor any other I am aware of)
>> contains anything regarding the need to work around broken 'terminals'
>> of any kind and especially not 'broken terminals' which were in use
>> only a decade ago, as you stated.
>
> You really do not want to understand anything ever, do you?

So far, you have posted a couple of assertions I have refuted two
times and your only 'argument' has been 'being abusive'. I understand
that you are probably just a jerk. Better?