htclient: new experimental HTTP Client [TCL]

Prev: Scrolling in tile
Next: Tcl 8.6 & IncrTcl...

From: tom.rmadilo on 24 Oct 2009 21:45

Last week there was discussion of a problem with the http client which
ships with Tcl. The cause of the problem (an ignored timeout) was code
which places the channel into blocking mode during reading chunked
encoded transfers. Although it seems like a bad idea on first read, I
think the overall cause was the monolithic nature of the original code
(before chunked handling was added).

Anyway, Alexandre Ferrieux suggested that I should contribute a new
style (maybe coro), or a direct fix. I don't think a direct fix is
possible, so I went with a new style.

This new style does not use coroutines, although that seems
interesting I have no experience and I don't see the commands included
in the Tcl distribution.

My compromise is to require the use of at least Tcl8.5.? I use [chan
pending] and unnecessarily [lsearch -nocase -index] .

[chan pending] allows you to see how much data is available after a
fileevent. This allows you to safely read more than a single byte.
Strangely, 50% of the time a [chan pending] after a fileevent returns
zero, theoretically meaning that no data is available, even though one
byte is guaranteed. If no bytes are read on that fileevent, the next
fileevent is also followed with zero pending bytes. But if you read
one byte, everything seems to work. The buffer fills up again allowing
a substantial read.

The initial experimental version of htclient is available here:

http://www.junom.com/gitweb/gitweb.perl?p=htclient.git

This version prints debugging info (to a selected channel, default
stderr) for every event. The above link includes several logs of
different requests.

The architecture is very different from the http package. The best way
to explain it is to just try it out, look at the log files, or read
the code.

Only thing required to test it is to source the htclient.tcl file. The
end of the file contains a url to fetch, you can change this url, or
comment it out and repeat the command at the prompt.

From: Alexandre Ferrieux on 27 Oct 2009 04:10

On Oct 25, 2:45 am, "tom.rmadilo" <tom.rmad...(a)gmail.com> wrote:
>
> Anyway, Alexandre Ferrieux suggested that I should contribute a new
> style (maybe coro), or a direct fix. I don't think a direct fix is
> possible, so I went with a new style.

Thank you for this. I admit I have not looked at the details, but the
separation of the state machines looks cleaner than the current http
package's (but then, the coverage is not the same yet, right ? ;-).

Also, removing vwait from the library is a Good Idea (tm). It could be
the responsibility of an extra layer to provide the blocking variety
of ::http::geturl (with a vwait of course) for compatibility though.

A few random questions:

(1) you seem to be avoiding [gets] systematically. Why ?

(2) I don't understand the necessity of [chan pending] in this code
(are you targeting DoS attacks as explained in the manpage ?)

(3) you're parsing hex bytes (for chunk lengths) by hand. Any reason
not to use [scan] ?

(4) clientHeaders could benefit from becoming an array instead of a
linearly-scanned list, especially when more headers are taken into
account, don't you think ?

-Alex

From: tom.rmadilo on 27 Oct 2009 15:04

On Oct 27, 1:10 am, Alexandre Ferrieux <alexandre.ferri...(a)gmail.com>
wrote:
> On Oct 25, 2:45 am, "tom.rmadilo" <tom.rmad...(a)gmail.com> wrote:
>
>
>
> > Anyway, Alexandre Ferrieux suggested that I should contribute a new
> > style (maybe coro), or a direct fix. I don't think a direct fix is
> > possible, so I went with a new style.
>
> Thank you for this. I admit I have not looked at the details, but the
> separation of the state machines looks cleaner than the current http
> package's (but then, the coverage is not the same yet, right ? ;-).

Absolutely right, this only does GET (actually not completely true)
and doesn't have anywhere near the functionality of the current http
client. I don't really think that the code which creates the request
should be on the same level as the protocol. With this API. Here is
the basic API:

::htclient::addClient host port url {method GET} {headers {}} {content
{}} {version 1.1} {timeout 0} args

This automatically generates the host header, and if content is
provided, it adds a content-length header. The version can be used to
avoid chunked transfer encoding (use 1.0). So you could use POST or
HEAD. POST probably works, HEAD would require some special code to
realize that no content is expected, but it may work anyway since the
content-length header isn't currently used to find the end of content.
There are a bunch of rules for figuring this out and I didn't care to
spend any time on it.

> Also, removing vwait from the library is a Good Idea (tm). It could be
> the responsibility of an extra layer to provide the blocking variety
> of ::http::geturl (with a vwait of course) for compatibility though.
>

If you look at the next-to-last modification in the git repository,
you will notice that I added in a vwait on a per client basis. This
basically serializes requests, so I removed it. My suggestion is to
use wish, or somehow make sure the event loop is already running (or
do requests in batches?).

Something completely missing right now, actually one of my original
goals was a non-blocking client with a timeout. There is no timeout.
However, a fixed timeout value seems not exactly right. What might be
more useful is a rate limit monitor. If the short-term transfer rate
falls below some value, consider aborting, or put a timer on and check
back later. Again, this doesn't need to be a part of the protocol
layer there are probably a number of good ways to do this.

The current version of this code will likely become a server testing
version, and a production version would have nearly all the logging
code removed. One log per event helps in code development and
understanding of what is going on, but obviously not useful for real
world use.

> A few random questions:
>
> (1) you seem to be avoiding [gets] systematically. Why ?

HTTP status line and headers must be parsed char by char. Many
potential security issues are attributed to not correctly parsing
headers before passing them on. I decided to spend the time parsing
the headers into tokens/words. This doesn't completely validate the
headers, each of which as a special form (who know why). So each
header will need an individual interpreter.

For instance, the date header. There are three date formats accepted
as valid. Parsing these different formats as a string is not pretty.
But once the date is tokenized, it is easy to figure out which type is
in use, or if it is invalid. To turn the tokens back into a date, a
simple [format] can be used.

One security issue with headers has to do with whitespace between the
field name and the colon. The standards say you have to reject headers
of this form. I don't know if it is important for a client, but the
code will serve as a template for a server or proxy.

Also, tokenizing is required to remove "bad whitespace" and "optional
whitespace" before either passing on the header, or interpreting the
content.

> (2) I don't understand the necessity of [chan pending] in this code
> (are you targeting DoS attacks as explained in the manpage ?)

Otherwise the [read] would block until all content is received. This
could be allow a DoS attack, although I'm not sure if it would keep
other clients from continuing. At some point you have to stop reading
content and start reading the <CR><LF> and then the chunk
size<CR><LF>, so you can't just read everything at once.

> (3) you're parsing hex bytes (for chunk lengths) by hand. Any reason
> not to use [scan] ?

I guess I could, wrap it in a [catch] and use [gets]? You then have to
figure out why it failed. You also get the same issue with reading
headers. It isn't safe. I also think that a production client would
limit the chunk size and abort when the value gets too big. If you
slowly increase the number, this is easy to check, if you scan first
I'm not sure what the possible problems are.

The main problem is that HTTP is not a line oriented protocol, at
least the current version isn't. Since you have to parse headers,
reading one char at a time isn't really a problem (the read is from a
buffer).

> (4) clientHeaders could benefit from becoming an array instead of a
> linearly-scanned list, especially when more headers are taken into
> account, don't you think ?

Unfortunately you have to maintain header order, and in some cases you
have to combine same named headers into a comma separated list. But
the basic reason is the current format allows the developer to use
[lsearch -inline -all -nocase] to search for all headers of a given
name. I provided two examples to demonstrate how to search for
headers. Also, note that the clientHeaders is itself an array. each
client's headers are stored in clientHeaders($client). This gives
additional code easy access to the headers if you know the clientID.
There is another array clientContentLength, with the same keys. As new
data is appended to clientContent, the length is incremented. It
should be easy to setup a trace on this variable and track progress.

Try running htclient with multiple requests before you do a vwait (or
use wish). Tcl's very egalitarian event loop gives every client a
turn, although the order gets scrambled over the history of the
responses, but this just indicates that each client is independent of
the others (I think?).

From: tom.rmadilo on 27 Oct 2009 15:18

On Oct 27, 1:10 am, Alexandre Ferrieux <alexandre.ferri...(a)gmail.com>
wrote:
> (4) clientHeaders could benefit from becoming an array instead of a
> linearly-scanned list, especially when more headers are taken into
> account, don't you think ?

Shoot, almost forgot: the set-cookie header cannot be combined with
the comma operator, you have to maintain two separate keys with the
same name. Also, the exact header field search is [lsearch -inline -
all -nocase -index ...], which should make it easy to write specific
header handling code.

From: Alexandre Ferrieux on 30 Oct 2009 13:21

On Oct 27, 8:04 pm, "tom.rmadilo" <tom.rmad...(a)gmail.com> wrote:
>
> > (1) you seem to be avoiding [gets] systematically. Why ?
>
> HTTP status line and headers must be parsed char by char. Many
> potential security issues are attributed to not correctly parsing
> headers before passing them on. [...]
> I don't know if it is important for a client, but the
> code will serve as a template for a server or proxy.

Ah OK. Let me just point out that for a generic enabler like the HTTP
client library in Tcl, the dominant pattern is very far from this
corner case. I'd risk a 99.9%-0.1% balance between simple client and
proxy. I'm saying this because this decision to work hard on the
extreme case has a cost on code size and readability, which is too bad
since you came here to improve on the existing code ;-)

> > (2) I don't understand the necessity of [chan pending] in this code
> > (are you targeting DoS attacks as explained in the manpage ?)
>
> Otherwise the [read] would block until all content is received.

Uh ? You're already using nonblocking reads and fileevents. So ?

> > (3) you're parsing hex bytes (for chunk lengths) by hand. Any reason
> > not to use [scan] ?
>
> I guess I could, wrap it in a [catch] and use [gets]? You then have to
> figure out why it failed. You also get the same issue with reading
> headers. It isn't safe. I also think that a production client would
> limit the chunk size and abort when the value gets too big. If you
> slowly increase the number, this is easy to check, if you scan first
> I'm not sure what the possible problems are.

So really you're doing all this in case a _server_ would attack the
client ? That's your right of course, but that dimension is orthogonal
to what you set out to fix in the first place (readability and
modularity of the generic HTTP client library).

> The main problem is that HTTP is not a line oriented protocol, at
> least the current version isn't. Since you have to parse headers,
> reading one char at a time isn't really a problem (the read is from a
> buffer).

At some point heavy char-by-char processing at script level will cost
you more than what the buffered input saves in syscalls...

> > (4) clientHeaders could benefit from becoming an array instead of a
> > linearly-scanned list, especially when more headers are taken into
> > account, don't you think ?
>
> Unfortunately you have to maintain header order, and in some cases you
> have to combine same named headers into a comma separated list.

Header order, yes, but only within the multiple values of a given
header, right ?
This is easily done with

lappend arr($header) $value

> But
> the basic reason is the current format allows the developer to use
> [lsearch -inline -all -nocase] to search for all headers of a given
> name.

Normalize on writing and you get the same:

lappend arr([string toupper $header]) $value

-Alex

| Next | Last
Pages: 1 2 3 4 5 6 7
Prev: Scrolling in tile
Next: Tcl 8.6 & IncrTcl...