|
Prev: Eclipse formatter options. Was: Re: JScrollPane in GridBagLayoutis either at minimum or contents' size
Next: aligning components within boxes
From: sasuke on 2 Jul 2008 12:51 Hello to all Java programmers out there. :-) I was just wondering what would be the most time / space efficient way of concatenating contents of different files to a single file. Sample usage would be: java Concat targetFile.txt sourceFileOne.txt sourceFileTwo.txt ... Using threads to open a stream to the source files is out of question since the data needs to be written in a ordered manner in which it exists in the source files i.e. no ad hoc writing. Reading the entire contents of the file into memory (by using a StingBuffer / StringBuilder) also isn't a good choice considering that we can come across really large text files (~10 MB, typical for db dumps). Reading the source file line by line doesn't seem attractive given that it would increase I/O and again for really large files might turn out to be a I/O bottleneck. One solution which comes to mind is to read the file in chunks; i.e. read the data in char array of 8KB or a string array of size 100. My question here is -» Is there any ideal solution which comes to mind when solving this problem or does the solution really depend on the domain in consideration and the kind of sacrifices we are ready to make (e.g. lose the ordering of data, memory trade off when reading entire file in a buffer, I/O hit)? Pardon me for asking such trivial / silly question but just a thought. :-) Regards, /~sasuke
From: RedGrittyBrick on 2 Jul 2008 13:18 sasuke wrote: > Hello to all Java programmers out there. :-) > > I was just wondering what would be the most time / space efficient way > of concatenating contents of different files to a single file. Sample > usage would be: > java Concat targetFile.txt sourceFileOne.txt sourceFileTwo.txt ... The most efficient usage of your time is not to reinvent wheels. > > Using threads to open a stream to the source files is out of question > since the data needs to be written in a ordered manner in which it > exists in the source files i.e. no ad hoc writing. Having multiple threads doing I/O to the same disk is likely to slow things down. > Reading the entire > contents of the file into memory (by using a StingBuffer / > StringBuilder) also isn't a good choice considering that we can come > across really large text files (~10 MB, typical for db dumps). I see no benefit in reading a whole file into memory. > Reading > the source file line by line doesn't seem attractive given that it > would increase I/O and again for really large files might turn out to > be a I/O bottleneck. You don't need the JVM to be doing conversion to UTC-16, or pointless line-oriented processing (e.g. scanning for line-endings). > One solution which comes to mind is to read the > file in chunks; i.e. read the data in char array of 8KB or a string > array of size 100. > > My question here is -� Is there any ideal solution which comes to > mind when solving this problem :-) cat sourceFileOne.txt sourceFileTwo.txt ... targetFile.txt or copy sourceFileOne.txt+sourceFileTwo.txt ... targetFile.txt depending on operating system > or does the solution really depend on > the domain in consideration and the kind of sacrifices we are ready to > make (e.g. lose the ordering of data, memory trade off when reading > entire file in a buffer, I/O hit)? I wouldn't reinvent this wheel but if you are doing it I suggest you treat the files as binary not as text (especially not using anything that translates encodings). Reading in large fixed-size chunks would seem to be sensible. Given that the task is I/O bound I wouldn't try too hard to optimise anything else. -- RGB
From: Zig on 2 Jul 2008 16:32 On Wed, 02 Jul 2008 12:51:55 -0400, sasuke <database666(a)gmail.com> wrote: > Hello to all Java programmers out there. :-) > > I was just wondering what would be the most time / space efficient way > of concatenating contents of different files to a single file. Sample > usage would be: > java Concat targetFile.txt sourceFileOne.txt sourceFileTwo.txt ... What encoding are your text files in? If the source and target files are in the same encoding, and do not have a BOM character at the beginning of the file, then a binary transfer is the way to go. Take a look at java.nio.channels.FileChannel.transferTo / transferFrom http://java.sun.com/javase/6/docs/api/java/nio/channels/FileChannel.html#transferTo(long, long, java.nio.channels.WritableByteChannel) As those methods should give you very fast file content transferal for binary data. > One solution which comes to mind is to read the > file in chunks; i.e. read the data in char array of 8KB or a string > array of size 100. If you need to deal with different encodings (from your example usage, you might check to see if your source files were using different BOMs), then reading a block of characters (decoding from source), and writing them back to the target (encoding them with the target file's encoding) may be more appropriate. If they all have the same encoding, but use BOMs, then you can use a binary transfer, skipping the BOM character from all but the first source file. Reading & decoding blocks of data will also give you the flexiblity to support more options, such as reading seperately gzip'ed log files, and writing them out as a single gzip'ed text file. HTH, -Zig
From: Eric Sosman on 2 Jul 2008 17:57 sasuke wrote: > Hello to all Java programmers out there. :-) > > I was just wondering what would be the most time / space efficient way > of concatenating contents of different files to a single file. Sample > usage would be: > java Concat targetFile.txt sourceFileOne.txt sourceFileTwo.txt ... > [...] The fastest and most efficient way of all is -- Don't Do That. Do you really *need* a second copy of the contents of all those files? Or could you use a java.io.SequenceInputStream to read the originals /in situ/? If you actually do need to concatenate, it's highly unlikely that anything you can do in Java will be as fast as the platform's own file-concatenation utility. Those beasts tend to be heavily optimized, using platform-specific trickery and undocumented API's to move the data from hither to yon at great speed. If it's speed you care about, spend your time figuring out how to launch the native utility instead of spending it trying to optimize an alternative that's hobbled by portability concerns. -- Eric.Sosman(a)sun.com
From: Abhijat Vatsyayan on 2 Jul 2008 19:30
sasuke wrote: > Hello to all Java programmers out there. :-) > > I was just wondering what would be the most time / space efficient way > of concatenating contents of different files to a single file. Sample > usage would be: > java Concat targetFile.txt sourceFileOne.txt sourceFileTwo.txt ... > > Using threads to open a stream to the source files is out of question > since the data needs to be written in a ordered manner in which it > exists in the source files i.e. no ad hoc writing. Reading the entire > contents of the file into memory (by using a StingBuffer / > StringBuilder) also isn't a good choice considering that we can come > across really large text files (~10 MB, typical for db dumps). Reading > the source file line by line doesn't seem attractive given that it > would increase I/O and again for really large files might turn out to > be a I/O bottleneck. One solution which comes to mind is to read the > file in chunks; i.e. read the data in char array of 8KB or a string > array of size 100. > > My question here is -� Is there any ideal solution which comes to > mind when solving this problem or does the solution really depend on > the domain in consideration and the kind of sacrifices we are ready to > make (e.g. lose the ordering of data, memory trade off when reading > entire file in a buffer, I/O hit)? > > Pardon me for asking such trivial / silly question but just a > thought. :-) > > Regards, > /~sasuke Why not use concat task that comes with ant? Or if you can use shell on a nix box, use "cat". Or install cat binary from cygwin on the windows box (the list goes on). There are many solutions out there, the least recommended being writing something like this from scratch (unless you are doing this just for learning or for fun). Abhijat |