From: John Kelly on
On Tue, 15 Jun 2010 16:46:41 +0200, Tuxedo <tuxedo(a)mailinator.com>
wrote:

>Is there any other possible way to split such a huge file in in for example
>three or four parts?

Yes with text tools as I suggested. But not many people want to devise
your solution for free. You might try the awk group, they sometimes go
beyond the call of duty to help.


>I guess no one anticipated that users mailboxes would grow to the size
>they actually do after years of usage, until one day they simply day stop
>working and cannot easily be restored.

Some people like to archive mail, but it's dangerous. Eliot Spitzer
advised against email. Same goes for Usenet. Never say anything you
don't want repeated in a court of law.



--
Web mail, POP3, and SMTP
http://www.beewyz.com/freeaccounts.php

From: John Kelly on
On Tue, 15 Jun 2010 17:06:58 +0200, Tuxedo <tuxedo(a)mailinator.com>
wrote:

>John Kelly wrote:

>> I was not talking about text editors, where you read the whole file into
>> memory all at once. Tools like grep, sed, and awk read one line at at
>> time. Or you could write a simple while loop in bash to read a file one
>> line at a time.
>
>Aah. I'm familar with grep procedures and have used sed and awk but I'm
>not really any good at it. But it sounds like the right solution to my
>problem!

>I tried Chris idea but had some type of error "bigmbox is not an mbox
>file.." described in my previous post. I'm not quite sure what the cause of
>the error may be.

The problem with canned solutions is, they will choke on improper data.
You may have some inconsistency in the data, where you need to get down
into the dirty details to figure it out and fix it.

Read the wikipedia article on mbox, it explains how to recognize mbox
messages boundaries. Understanding that, you may be able to make some
progess.



--
Web mail, POP3, and SMTP
http://www.beewyz.com/freeaccounts.php

From: John Kelly on
On Tue, 15 Jun 2010 15:21:51 +0000, John Kelly <jak(a)isp2dial.com> wrote:

>Read the wikipedia article on mbox, it explains how to recognize mbox
>messages boundaries. Understanding that, you may be able to make some
>progess.

And for a wild idea, try this:

Use dd to copy the file into 10 equally sized pieces. That will break
the messages on the tail and nose of each piece, but then you can use a
text editor to extract the broken pieces and put them back together
where they belong.

And if you do have some inconsistency in the data, you can learn which
of the 10 pieces has the problem, and proceed to fix it.

Divide and conquer. It rarely fails.



--
Web mail, POP3, and SMTP
http://www.beewyz.com/freeaccounts.php

From: Tuxedo on
John Kelly wrote:

[...]

> And for a wild idea, try this:
>
> Use dd to copy the file into 10 equally sized pieces. That will break
> the messages on the tail and nose of each piece, but then you can use a
> text editor to extract the broken pieces and put them back together
> where they belong.
>
> And if you do have some inconsistency in the data, you can learn which
> of the 10 pieces has the problem, and proceed to fix it.
>
> Divide and conquer. It rarely fails.

It sound like the perfect poor mans solution :-).

I did 'man dd' but remain somewhat puzzled... I'm not familiar with the
command line options. I know I'm asking free advise here, but how would you
divide a file named for example myBigCrapBox into ten pieces using dd ?

Many thanks,
Tuxedo





From: Tuxedo on
Chris Nehren wrote:

> On 2010-06-15, Tuxedo scribbled these curious markings:
> > Chris Nehren wrote:
> >
> >> On 2010-06-15, Tuxedo scribbled these curious markings:
> > Thanks for this tip! In interesting and useful tool. While this worked
> > on a
> > smaller mbox with or without the print $message->body('Body') bit it
> > also failed on the large mailbox unfortunately:
> >
> > #!/usr/bin/perl
> > use strict;
> > use warnings;
> >
> > use lib "/tmp/perl/lib/perl5/site_perl/5.8.8";
> >
> > use Email::Folder;
> >
> > my $folder = Email::Folder->new("bigmbox");
> > while(my $message = $folder->next_message) {
> > print $message->header('Subject'), "\n";
> > print $message->body('Body'), "\n";
> > }
> >
> > Is there any other possible way to split such a huge file in in for
> > example three or four parts? Then it would probably be "small enough" to
> > process.
>
> Yes, basically you'd read n messages, write them to a file, read n more,
> write them to another file. I'm out of fish for the day, though: please
> see e.g. http://p3rl.org/bp for a proper tutorial on Perl if you don't
> know where to start.

I like reading Perl tutorials, especially since there are so many :-)

> > Actually, I incurred a different error with the above perl procedure
> > (not "Out of memory"). While the system was working really hard and
> > after a bit less than a minute of file crunching, the error was:
> > "bigmbox is not an mbox file at
> > /tmp/perl/lib/perl5/site_perl/5.8.8/Email/Folder.pm line 81".
> > (The /tmp path just referers to my temporary external module
> > installation).
> >
> > Maybe there is something wrong with my 'bigmbox'. The original file (on
> > a Windows drive) is reported to be 2.8 GB while the version I
> > transferred to my Linux box is only 2.0G for some reason. I downloaded
> > it via lan using Mozilla (as I think FTP has a maximum file size
> > transfer limit).
>
> The file wasn't transferred completely, yes, you hit the file size
> limit. What does "downloaded it via lan" mean? You'll need to download
> the file in toto before being able to process it in toto. Please keep
> in mind that this is not the fault of the Perl code but rather that you
> have an incomplete file on your hands.

Thanks for advising me on this too, one more mystery solved...

Tuxedo