Improving Performance Of ReadFile [Win32 Kernel]

Prev: IEEE1284 / USB converter
Next: Sending an IPI to a target CPU

From: Le Chaud Lapin on 22 Oct 2009 01:00

Hi All,

I doth seek to remain a sloth...

I have an application that does a bunch of ReadFile's against a 470MB
file. I have to read the entire file from end to end, always
sequentially. Each read can be anywhere from a few bytes to several
kilobytes. Process Explorer shows roughly 41,000,000 reads during an 8-
minute run. Mean is roughly 50 bytes, with stand-dev I'm guessing
maybe 10 bytes.

I realized that performance will improve dramatically when I eliminate
so many U/K transitions by doing block reads, but in the meantime, I
was wondering how much improvement to expect by doing a kind of
priming, where I pump all blocks to RAM at least once. I just found
the following link for setting cache size on Windows:

http://support.microsoft.com/kb/837331

I would essentially write a bit of code that slammed the entire 470MB
into RAM before doing my ReadFiles, which, btw, is supporting a kind
of serialization.

I'd like to know what I can expect for improvement (roughly of
course).

TIA,

-Le Chaud Lapin-

From: Mihai N. on 22 Oct 2009 04:37

> I realized that performance will improve dramatically when I eliminate
> so many U/K transitions by doing block reads,

Somewhere at low level there are already block reads.
So I don't expect you will gain much.

It depends a lot on the read pattern. Is it sequencial, or
do you have to jump back and forth? You read "records" knowing the length,
or you read lines (looking for "\n")?

First rule of optimization: measure, to make sure you know what to optimize.
(ok, it is not the first rule, I think it is the 3rd :-)

--
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email

From: Uwe Sieber on 22 Oct 2009 06:58

http://support.microsoft.com/kb/837331 says
"There is no limit to the physical cache size"
and this seems to be true.
I've no idea whan "Virtual cache size" means.
The values 512 MB and 960 MB sound familiar,
these seem to be the max cache's working set
sizes.

To get the desired 'all in RAM' effect, open
the file with FILE_FLAG_RANDOM_ACCESS and
read it from start to end. If there was enough
free memory then it will completely be hold in
RAM.

Uwe

Le Chaud Lapin wrote:
> Hi All,
>
> I doth seek to remain a sloth...
>
> I have an application that does a bunch of ReadFile's against a 470MB
> file. I have to read the entire file from end to end, always
> sequentially. Each read can be anywhere from a few bytes to several
> kilobytes. Process Explorer shows roughly 41,000,000 reads during an 8-
> minute run. Mean is roughly 50 bytes, with stand-dev I'm guessing
> maybe 10 bytes.
>
> I realized that performance will improve dramatically when I eliminate
> so many U/K transitions by doing block reads, but in the meantime, I
> was wondering how much improvement to expect by doing a kind of
> priming, where I pump all blocks to RAM at least once. I just found
> the following link for setting cache size on Windows:
>
> http://support.microsoft.com/kb/837331
>
> I would essentially write a bit of code that slammed the entire 470MB
> into RAM before doing my ReadFiles, which, btw, is supporting a kind
> of serialization.
>
> I'd like to know what I can expect for improvement (roughly of
> course).
>
> TIA,
>
> -Le Chaud Lapin-

From: Paul Baker [MVP, Windows Desktop Experience] on 22 Oct 2009 09:03

If it's sequential, consider using FILE_FLAG_SEQUENTIAL_SCAN.

If it's random access, consider using FILE_FLAG_RANDOM_ACCESS.

For performance reasons, I wrote a class many years ago that will use a
buffer, by default of 64 KB, when reading sequentially. This was many years
ago and at the time it seemed to improve performance on Windows 9x hard
drives a little and on Windows NT 4.0 floppy drives a lot. But in other
cases, there was no noticable difference and on modern systems, I am
doubting there is any difference at all. Mihai is correct, the system is
caching reads and ought to be able to take care of this for you.

I never saw any improvement under any circumstances with a buffer over 64
KB. Don't make the mistake of using a huge buffer in a single call to
ReadFile. This could actually negatively impact performance as well as
system resources and stability. We saw this problem recently. On some
systems I could use a buffer of hundreds of megabytes without issue, whereas
on others I could not use a buffer size over about 64 MB. After digging
through low level documentation, I am unable to explain it fully, but it is
something to do with the fact that the device driver needs contiguous
physical memory and depending on the type of buffer management used, may
allocate a new buffer equal in size to the callers. And it may be from
kernel memory (nonpaged pool), which is a scarce resource.

However, the only way to truly know what the performance is is to measure
it, as Mihai said. See if any of the above things help. Don't forget to
reboot between tests and repeat them multiple times to ensure you are
getting consistent results. I believe SysInternals has a tool that clears
the system cache, but rebooting is easy and eliminates some other possible
causes of interference too.

If you want to "slam" the entire file into memory, why not use a memory
mapped file?

I don't think I would mess with system wide caching settings. The defaults
should be fine.

Paul

"Le Chaud Lapin" <jaibuduvin(a)gmail.com> wrote in message
news:74647255-58cc-4a84-b8af-bdc6a93e166a(a)d4g2000vbm.googlegroups.com...
> Hi All,
>
> I doth seek to remain a sloth...
>
> I have an application that does a bunch of ReadFile's against a 470MB
> file. I have to read the entire file from end to end, always
> sequentially. Each read can be anywhere from a few bytes to several
> kilobytes. Process Explorer shows roughly 41,000,000 reads during an 8-
> minute run. Mean is roughly 50 bytes, with stand-dev I'm guessing
> maybe 10 bytes.
>
> I realized that performance will improve dramatically when I eliminate
> so many U/K transitions by doing block reads, but in the meantime, I
> was wondering how much improvement to expect by doing a kind of
> priming, where I pump all blocks to RAM at least once. I just found
> the following link for setting cache size on Windows:
>
> http://support.microsoft.com/kb/837331
>
> I would essentially write a bit of code that slammed the entire 470MB
> into RAM before doing my ReadFiles, which, btw, is supporting a kind
> of serialization.
>
> I'd like to know what I can expect for improvement (roughly of
> course).
>
> TIA,
>
> -Le Chaud Lapin-

From: Chris M. Thomasson on 22 Oct 2009 13:51

"Le Chaud Lapin" <jaibuduvin(a)gmail.com> wrote in message
news:74647255-58cc-4a84-b8af-bdc6a93e166a(a)d4g2000vbm.googlegroups.com...
> Hi All,
>
> I doth seek to remain a sloth...
>
> I have an application that does a bunch of ReadFile's against a 470MB
> file. I have to read the entire file from end to end, always
> sequentially. Each read can be anywhere from a few bytes to several
> kilobytes. Process Explorer shows roughly 41,000,000 reads during an 8-
> minute run. Mean is roughly 50 bytes, with stand-dev I'm guessing
> maybe 10 bytes.
>
> I realized that performance will improve dramatically when I eliminate
> so many U/K transitions by doing block reads, but in the meantime, I
> was wondering how much improvement to expect by doing a kind of
> priming, where I pump all blocks to RAM at least once. I just found
> the following link for setting cache size on Windows:
>
> http://support.microsoft.com/kb/837331
>
> I would essentially write a bit of code that slammed the entire 470MB
> into RAM before doing my ReadFiles, which, btw, is supporting a kind
> of serialization.
>
> I'd like to know what I can expect for improvement (roughly of
> course).

Use a memory mapped file, and process the memory in sequential order (e.g.,
base to base + size_of_file). BTW, what type of processing are you doing?
Does processing of one part of the file always depend on the processing
results of a previous portion of the file?

| Next | Last
Pages: 1 2 3
Prev: IEEE1284 / USB converter
Next: Sending an IPI to a target CPU