From: Ole-Hjalmar Kristensen on
I see the C++ version uses fgets, which is a reasonably fast way of
getting a line. Is there any reason not to use fgets?
fgets (OpenSolaris version) uses memccpy to actually copy the buffer
until the newline:

43 extern int _filbuf();
44 extern char *memccpy();
45
46 char *
47 fgets(ptr, size, iop)
48 char *ptr;
49 register int size;
50 register FILE *iop;
51 {
52 char *p, *ptr0 = ptr;
53 register int n;
54
55 if ( !(iop->_flag & (_IOREAD|_IORW)) ) {
56 iop->_flag |= _IOERR;
57 return (NULL);
58 }
59
60 for (size--; size > 0; size -= n) {
61 if (iop->_cnt <= 0) { /* empty buffer */
62 if (_filbuf(iop) == EOF) {
63 if (ptr0 == ptr)
64 return (NULL);
65 break; /* no more data */
66 }
67 iop->_ptr--;
68 iop->_cnt++;
69 }
70 n = MIN(size, iop->_cnt);
71 if ((p = memccpy(ptr, (char *) iop->_ptr, '\n', n)) != NULL)
72 n = p - ptr;
73 ptr += n;
74 iop->_cnt -= n;
75 iop->_ptr += n;
76 _BUFSYNC(iop);
77 if (p != NULL)
78 break; /* found '\n' in buffer */
79 }
80 *ptr = '\0';
81 return (ptr0);
82 }
>>>>> "GB" == Georg Bauhaus <rm.dash-bauhaus(a)futureapps.de> writes:

GB> Ole-Hjalmar Kristensen schrieb:
>> Character'read could very well be slower if there is no buffering
>> involved. The way to read efficiently is to read a large buffer
>> (multiple of disk block size) and chop it into lines yourself.
>>

GB> Indeed, stream attributes won't help.

GB> However, replacing Text_IO.Get_Line---which we use now---
GB> might still be considered controversial:

GB> - Is "chopping" an acceptable interpretation of "read line by line"?

GB> - writing a correct double buffering scheme (or a buffer
GB> with end-of-buffer triggers for Read_A_Line_From_Buffer)
GB> is still a bit of work, unless there is something we could
GB> copy.

GB> A package containing a quick getline will definitely be
GB> useful in general Unix programming, I should think.
GB> What is being used in the Socket packages, BTW?

GB> Equally useful will be a package for fast unbounded strings.
GB> Replace_Slice is what prevents showcasing GNAT's Spitbol
GB> pattern matching as one of the fastest things on this planet!

--
C++: The power, elegance and simplicity of a hand grenade.
From: Georg Bauhaus on
Ole-Hjalmar Kristensen schrieb:
> I see the C++ version uses fgets, which is a reasonably fast way of
> getting a line. Is there any reason not to use fgets?
> fgets (OpenSolaris version) uses memccpy to actually copy the buffer
> until the newline:

There were two reasons not to import fgets:

1 - fgets turns out not to be faster than Stream_IO.
2 - It's C. ;-)


The fasta entry at the Shootout which uses Unchecked_Conversion,
not 'Address could even be improved a small bit if we change
it to use 'Address. But then, this adds to the line count.
Is it worth it, for a shorter program?

There was a discussion leading to this
(<4a7b4daa$0$31333$9b4e6d93(a)newsspool4.arcor-online.net>)
early in August.
From: Ole-Hjalmar Kristensen on
Or you could use the Unix read system call and roll your own get_line function
along these lines. It will read 1000000 lines of 60 characters line by line in about a second.

with Interfaces.C_Streams; use Interfaces.C_Streams;
with Ada.Text_Io;

procedure Get_Line_Test is
In_Buf : String(1..8192);
for In_Buf'Alignment use 8192;

First: Integer := In_Buf'last;
Last: Integer := 0;
N : Size_T := 1;

function Read(Fd : Int; Buffer : Voids; Nbyte : Size_T) return Size_T;
pragma Import(C,read);

function Empty_Buffer return Boolean is
begin
return First > Last;
end Empty_Buffer;
pragma Inline(Empty_Buffer);

function End_File return Boolean is
begin
return N <= 0;
end End_File;
pragma Inline(End_File);

-- Get line from stdin, return "" if end of file
function Get_Line return String is
begin
-- Get buffer from file if empty
if Empty_buffer then
N := Read(0,In_Buf(1)'Address, Size_T(In_Buf'Last));
First := 1;
Last := Integer(N);
end if;
if Empty_Buffer then
return "";
end if;

-- Look for end of line and return it
for I in First..Last loop
if In_Buf(I) = Ascii.LF then
declare
Start : Integer := First;
begin
First := I + 1;
return In_Buf(Start..I-1);
end;
end if;
end loop;

-- Did not find end of line, seek further
declare
Copy: string := In_Buf(First..Last);
begin
Last := 0;
return Copy & Get_Line;
end;
end Get_Line;
pragma Inline(Get_Line);

begin
while not End_File loop
declare
Line : String := Get_Line;
begin
-- Do something with line here
null;
end;
end loop;
end Get_Line_Test;

astra06:~/ada/div-ada> uname -a
SunOS astra06 5.10 Generic_137137-09 sun4u sparc SUNW,Sun-Fire-V440
astra06:~/ada/div-ada> ls -l x.txt
-rw------- 1 ok145024 clustra 60000000 Sep 1 12:03 x.txt
astra06:~/ada/div-ada> time ./get_line_test < x.txt
0.49u 0.09s 0:00.59 98.3%

--
C++: The power, elegance and simplicity of a hand grenade.
From: Ole-Hjalmar Kristensen on
>>>>> "GB" == Georg Bauhaus <rm.dash-bauhaus(a)futureapps.de> writes:
OK, I see. In GNAT, the stream_io is a relatively thin layer on top of
the C streams, isn't it? So you will get essentially the same speed as
the C standard IO library. But does Stream_Io have a version of
get_line? I know the C streams interface has a binding to fgets, but
that is just using fgets directly. How did you read line by line with
Stream_Io?

The only way I can think of to improve on fgets is to get
rid of the copying, but would you then strictly be reading line by
line?

I am thinking of something like

procedure get_line(buffer : in out string; first : out natural; last :out natural);

This get_line would only need to copy when it encounters a string which
extends past the end of the buffer, in the normal case it would only
return the indices of the slice.

GB> Ole-Hjalmar Kristensen schrieb:
>> I see the C++ version uses fgets, which is a reasonably fast way of
>> getting a line. Is there any reason not to use fgets?
>> fgets (OpenSolaris version) uses memccpy to actually copy the buffer
>> until the newline:

GB> There were two reasons not to import fgets:

GB> 1 - fgets turns out not to be faster than Stream_IO.
GB> 2 - It's C. ;-)


GB> The fasta entry at the Shootout which uses Unchecked_Conversion,
GB> not 'Address could even be improved a small bit if we change
GB> it to use 'Address. But then, this adds to the line count.
GB> Is it worth it, for a shorter program?

GB> There was a discussion leading to this
GB> (<4a7b4daa$0$31333$9b4e6d93(a)newsspool4.arcor-online.net>)
GB> early in August.

--
C++: The power, elegance and simplicity of a hand grenade.
From: Georg Bauhaus on
Ole-Hjalmar Kristensen schrieb:

> astra06:~/ada/div-ada> uname -a
> SunOS astra06 5.10 Generic_137137-09 sun4u sparc SUNW,Sun-Fire-V440
> astra06:~/ada/div-ada> ls -l x.txt
> -rw------- 1 ok145024 clustra 60000000 Sep 1 12:03 x.txt
> astra06:~/ada/div-ada> time ./get_line_test < x.txt
> 0.49u 0.09s 0:00.59 98.3%
>


Comparing the two programs, they are on a par, almost
exactly.

However, yours is IMHO less messy. It also has

return Copy & Get_Line;

which, I guess, solves the problem of long lines, even
at the expense of copying.

Can we reduce the copying overhead?
The copying would be have O(f(Item'Length)) with
f(n) ~= s*(s+1)/2 where s = n/BUFSIZ, so it is in O(n**2).
We could, I think, use a storage pool:

- whenever copying is needed, allocate another buffer
right next to the current buffer. Use the newly allocated
buffer for the recursive call.

- on reaching of line/file, reinterpret the sequence of
buffers in the storage pool as a String. Finally, Free.

Can this be done?

(This won't make a Shootout program any smaller, but
seems good for the general case...)