From: John Kelly on
On Tue, 15 Jun 2010 06:26:28 +0000 (UTC), Chris Nehren
<apeiron(a)invalid.isuckatdomains.localhost.net> wrote:

>> It's easy to shoot yourself in the foot with Perl.

>It's easy to shoot yourself in the foot with any language. The above
>code assumes that the file is already stored in the variable $mboxfile.
>Any language--even $YOUR_FAVORITE_LANGUAGE--can do this. Please don't
>spread FUD about something that's possible with any tool.

Sorry, I couldn't resist. But it is "easy" with Perl ;-)


--
Web mail, POP3, and SMTP
http://www.beewyz.com/freeaccounts.php

From: John Kelly on
On Tue, 15 Jun 2010 06:26:28 +0000 (UTC), Chris Nehren
<apeiron(a)invalid.isuckatdomains.localhost.net> wrote:

>Back on topic: please don't parse email with regex. You don't know what
>you're doing and you will get it wrong

I filter mail with procmail regexes. Works for me.


--
Web mail, POP3, and SMTP
http://www.beewyz.com/freeaccounts.php

From: Tuxedo on
Chris Nehren wrote:

> On 2010-06-15, Tuxedo scribbled these curious markings:
> > Chris Nehren wrote:
> >
> >> Use a module, like Mail::Box or
> >> Email::Folder::Mbox, something that's been tested and in production use
> >> at large ESPs for decades.
> >
> > How can I use these Perl modules to split the mbox? Will they not also
> > attempt to read the entire file in one go and run out of memory...
>
> Borrowing from the Email::Folder docs:
>
> #!/usr/bin/perl
> use strict;
> use warnings;
>
> use Email::Folder;
>
> my $folder = Email::Folder->new("some_file");
> while(my $message = $folder->next_message) {
> print $message->header('Subject'), "\n";
> }
>
> Or thereabouts. No, it will not read the entire file all at once, unless
> you call ->messages on the Email::Folder object. For more information on
> what you can do with the $message object, see Email::Simple's docs.
>
> Mail::Box not covered here because, while it is the swiss-army chainsaw
> of mail modules, it's also more complex with a higher learning curve.
>

Thanks for this tip! In interesting and useful tool. While this worked on a
smaller mbox with or without the print $message->body('Body') bit it also
failed on the large mailbox unfortunately:

#!/usr/bin/perl
use strict;
use warnings;

use lib "/tmp/perl/lib/perl5/site_perl/5.8.8";

use Email::Folder;

my $folder = Email::Folder->new("bigmbox");
while(my $message = $folder->next_message) {
print $message->header('Subject'), "\n";
print $message->body('Body'), "\n";
}

Is there any other possible way to split such a huge file in in for example
three or four parts? Then it would probably be "small enough" to process.

Actually, I incurred a different error with the above perl procedure (not
"Out of memory"). While the system was working really hard and after a bit
less than a minute of file crunching, the error was:
"bigmbox is not an mbox file at
/tmp/perl/lib/perl5/site_perl/5.8.8/Email/Folder.pm line 81".
(The /tmp path just referers to my temporary external module installation).

Maybe there is something wrong with my 'bigmbox'. The original file (on a
Windows drive) is reported to be 2.8 GB while the version I transferred to
my Linux box is only 2.0G for some reason. I downloaded it via lan using
Mozilla (as I think FTP has a maximum file size transfer limit).

Also, the previous "Out of memory" error that occured while processing by
formail, what could it relate to specifically? Ram, CPU, swap? I tested
processing the file on a Windows box which has 4GB RAM, which I guess
should be large enough for the 2.8GB file and any concurrent processes.

In theory, given a fast enough system, I guess any regular Mozilla
application should be able to process and generate the index needed to view
and modify the troubled mbox.

It may be fair to think that to allow such large files to accrue is down to
user error or even stupidity, but at the same time, non-technical users do
not necessarily think about what size their mailboxes may be or that one
single file in mbox format could grow so large. After all, to them it would
only be logical to think whatever fits on a drive should be Ok to store in
a 'mail folder'. In this sense, GUI mail app. designers have failed to make
this very clear in user interfaces. Maybe the advise is given in some
readme document, which nobody reads. I guess no one anticipated that users
mailboxes would grow to the size they actually do after years of usage,
until one day they simply day stop working and cannot easily be restored.

Tuxedo

From: Chris Nehren on
On 2010-06-15, Tuxedo scribbled these curious markings:
> Chris Nehren wrote:
>
>> On 2010-06-15, Tuxedo scribbled these curious markings:
> Thanks for this tip! In interesting and useful tool. While this worked on a
> smaller mbox with or without the print $message->body('Body') bit it also
> failed on the large mailbox unfortunately:
>
> #!/usr/bin/perl
> use strict;
> use warnings;
>
> use lib "/tmp/perl/lib/perl5/site_perl/5.8.8";
>
> use Email::Folder;
>
> my $folder = Email::Folder->new("bigmbox");
> while(my $message = $folder->next_message) {
> print $message->header('Subject'), "\n";
> print $message->body('Body'), "\n";
> }
>
> Is there any other possible way to split such a huge file in in for example
> three or four parts? Then it would probably be "small enough" to process.

Yes, basically you'd read n messages, write them to a file, read n more,
write them to another file. I'm out of fish for the day, though: please
see e.g. http://p3rl.org/bp for a proper tutorial on Perl if you don't
know where to start.

> Actually, I incurred a different error with the above perl procedure (not
> "Out of memory"). While the system was working really hard and after a bit
> less than a minute of file crunching, the error was:
> "bigmbox is not an mbox file at
> /tmp/perl/lib/perl5/site_perl/5.8.8/Email/Folder.pm line 81".
> (The /tmp path just referers to my temporary external module installation).
>
> Maybe there is something wrong with my 'bigmbox'. The original file (on a
> Windows drive) is reported to be 2.8 GB while the version I transferred to
> my Linux box is only 2.0G for some reason. I downloaded it via lan using
> Mozilla (as I think FTP has a maximum file size transfer limit).

The file wasn't transferred completely, yes, you hit the file size
limit. What does "downloaded it via lan" mean? You'll need to download
the file in toto before being able to process it in toto. Please keep
in mind that this is not the fault of the Perl code but rather that you
have an incomplete file on your hands.

--
Thanks and best regards,
Chris Nehren
Unless noted, all content I post is CC-BY-SA.
From: Tuxedo on
John Kelly wrote:

> On Tue, 15 Jun 2010 09:39:16 +0200, Tuxedo <tuxedo(a)mailinator.com>
> wrote:
>
> >John Kelly wrote:
>
> >> IOW, it's not hard identify message boundaries. You can use common
> >> text processing tools to split the big file into smaller ones.
> >
> >Thanks for the tip but I'm not sure what processing tools can be used to
> >split the file into smaller ones? At least no editor that I know will
> >open the file. It's simply too big.
>
> I was not talking about text editors, where you read the whole file into
> memory all at once. Tools like grep, sed, and awk read one line at at
> time. Or you could write a simple while loop in bash to read a file one
> line at a time.

Aah. I'm familar with grep procedures and have used sed and awk but I'm
not really any good at it. But it sounds like the right solution to my
problem!

>
> while read; do
>
> # each line is in $REPLY
> # do something with it
>
> done < mybigfile
>
> If you don't have enough knowledge of these tools to devise a solution,
> Chris idea of Email::Folder may work for you.

I tried Chris idea but had some type of error "bigmbox is not an mbox
file.." described in my previous post. I'm not quite sure what the cause of
the error may be.

Thanks,
Tuxedo