From: sasuke on
Hello to all Java programmers out there. :-)

I was just wondering what would be the most time / space efficient way
of concatenating contents of different files to a single file. Sample
usage would be:
java Concat targetFile.txt sourceFileOne.txt sourceFileTwo.txt ...

Using threads to open a stream to the source files is out of question
since the data needs to be written in a ordered manner in which it
exists in the source files i.e. no ad hoc writing. Reading the entire
contents of the file into memory (by using a StingBuffer /
StringBuilder) also isn't a good choice considering that we can come
across really large text files (~10 MB, typical for db dumps). Reading
the source file line by line doesn't seem attractive given that it
would increase I/O and again for really large files might turn out to
be a I/O bottleneck. One solution which comes to mind is to read the
file in chunks; i.e. read the data in char array of 8KB or a string
array of size 100.

My question here is -» Is there any ideal solution which comes to
mind when solving this problem or does the solution really depend on
the domain in consideration and the kind of sacrifices we are ready to
make (e.g. lose the ordering of data, memory trade off when reading
entire file in a buffer, I/O hit)?

Pardon me for asking such trivial / silly question but just a
thought. :-)

Regards,
/~sasuke
From: RedGrittyBrick on
sasuke wrote:
> Hello to all Java programmers out there. :-)
>
> I was just wondering what would be the most time / space efficient way
> of concatenating contents of different files to a single file. Sample
> usage would be:
> java Concat targetFile.txt sourceFileOne.txt sourceFileTwo.txt ...

The most efficient usage of your time is not to reinvent wheels.

>
> Using threads to open a stream to the source files is out of question
> since the data needs to be written in a ordered manner in which it
> exists in the source files i.e. no ad hoc writing.

Having multiple threads doing I/O to the same disk is likely to slow
things down.


> Reading the entire
> contents of the file into memory (by using a StingBuffer /
> StringBuilder) also isn't a good choice considering that we can come
> across really large text files (~10 MB, typical for db dumps).

I see no benefit in reading a whole file into memory.


> Reading
> the source file line by line doesn't seem attractive given that it
> would increase I/O and again for really large files might turn out to
> be a I/O bottleneck.

You don't need the JVM to be doing conversion to UTC-16, or pointless
line-oriented processing (e.g. scanning for line-endings).


> One solution which comes to mind is to read the
> file in chunks; i.e. read the data in char array of 8KB or a string
> array of size 100.
>
> My question here is -� Is there any ideal solution which comes to
> mind when solving this problem

:-)

cat sourceFileOne.txt sourceFileTwo.txt ... targetFile.txt

or

copy sourceFileOne.txt+sourceFileTwo.txt ... targetFile.txt

depending on operating system

> or does the solution really depend on
> the domain in consideration and the kind of sacrifices we are ready to
> make (e.g. lose the ordering of data, memory trade off when reading
> entire file in a buffer, I/O hit)?


I wouldn't reinvent this wheel but if you are doing it I suggest you
treat the files as binary not as text (especially not using anything
that translates encodings). Reading in large fixed-size chunks would
seem to be sensible. Given that the task is I/O bound I wouldn't try too
hard to optimise anything else.

--
RGB
From: Zig on
On Wed, 02 Jul 2008 12:51:55 -0400, sasuke <database666(a)gmail.com> wrote:

> Hello to all Java programmers out there. :-)
>
> I was just wondering what would be the most time / space efficient way
> of concatenating contents of different files to a single file. Sample
> usage would be:
> java Concat targetFile.txt sourceFileOne.txt sourceFileTwo.txt ...

What encoding are your text files in? If the source and target files are
in the same encoding, and do not have a BOM character at the beginning of
the file, then a binary transfer is the way to go. Take a look at
java.nio.channels.FileChannel.transferTo / transferFrom
http://java.sun.com/javase/6/docs/api/java/nio/channels/FileChannel.html#transferTo(long,
long, java.nio.channels.WritableByteChannel)

As those methods should give you very fast file content transferal for
binary data.

> One solution which comes to mind is to read the
> file in chunks; i.e. read the data in char array of 8KB or a string
> array of size 100.

If you need to deal with different encodings (from your example usage, you
might check to see if your source files were using different BOMs), then
reading a block of characters (decoding from source), and writing them
back to the target (encoding them with the target file's encoding) may be
more appropriate. If they all have the same encoding, but use BOMs, then
you can use a binary transfer, skipping the BOM character from all but the
first source file.

Reading & decoding blocks of data will also give you the flexiblity to
support more options, such as reading seperately gzip'ed log files, and
writing them out as a single gzip'ed text file.

HTH,

-Zig
From: Eric Sosman on
sasuke wrote:
> Hello to all Java programmers out there. :-)
>
> I was just wondering what would be the most time / space efficient way
> of concatenating contents of different files to a single file. Sample
> usage would be:
> java Concat targetFile.txt sourceFileOne.txt sourceFileTwo.txt ...
> [...]

The fastest and most efficient way of all is -- Don't Do That.
Do you really *need* a second copy of the contents of all those
files? Or could you use a java.io.SequenceInputStream to read
the originals /in situ/?

If you actually do need to concatenate, it's highly unlikely
that anything you can do in Java will be as fast as the platform's
own file-concatenation utility. Those beasts tend to be heavily
optimized, using platform-specific trickery and undocumented API's
to move the data from hither to yon at great speed. If it's speed
you care about, spend your time figuring out how to launch the
native utility instead of spending it trying to optimize an
alternative that's hobbled by portability concerns.

--
Eric.Sosman(a)sun.com
From: Abhijat Vatsyayan on
sasuke wrote:
> Hello to all Java programmers out there. :-)
>
> I was just wondering what would be the most time / space efficient way
> of concatenating contents of different files to a single file. Sample
> usage would be:
> java Concat targetFile.txt sourceFileOne.txt sourceFileTwo.txt ...
>
> Using threads to open a stream to the source files is out of question
> since the data needs to be written in a ordered manner in which it
> exists in the source files i.e. no ad hoc writing. Reading the entire
> contents of the file into memory (by using a StingBuffer /
> StringBuilder) also isn't a good choice considering that we can come
> across really large text files (~10 MB, typical for db dumps). Reading
> the source file line by line doesn't seem attractive given that it
> would increase I/O and again for really large files might turn out to
> be a I/O bottleneck. One solution which comes to mind is to read the
> file in chunks; i.e. read the data in char array of 8KB or a string
> array of size 100.
>
> My question here is -� Is there any ideal solution which comes to
> mind when solving this problem or does the solution really depend on
> the domain in consideration and the kind of sacrifices we are ready to
> make (e.g. lose the ordering of data, memory trade off when reading
> entire file in a buffer, I/O hit)?
>
> Pardon me for asking such trivial / silly question but just a
> thought. :-)
>
> Regards,
> /~sasuke
Why not use concat task that comes with ant? Or if you can use shell on
a nix box, use "cat". Or install cat binary from cygwin on the windows
box (the list goes on). There are many solutions out there, the least
recommended being writing something like this from scratch (unless you
are doing this just for learning or for fun).
Abhijat