From: Chris Nehren on
On 2010-06-15, John Kelly scribbled these curious markings:
> On Mon, 14 Jun 2010 21:17:26 -0400, Maxwell Lol <nospam(a)com.invalid>
> wrote:
>
>>John Kelly <jak(a)isp2dial.com> writes:
>>
>>> On Tue, 15 Jun 2010 01:18:19 +0200, Tuxedo <tuxedo(a)mailinator.com>
>>> wrote:
>
>>>>customer just gave me a massive mail file in mbox format
>>>>Are there any particular Unix tools to split such huge message files
>
>>> IOW, it's not hard identify message boundaries. You can use common text
>>> processing tools to split the big file into smaller ones.
>>
>>You can even use perl and use something like
>>
>> @mail = split(/\nFrom /,$mboxfile);
>
> That will read it into memory all at once, which may cause thrashing
> with his 3GB file. In his scenario, better to read and write one line
> at a time, and open a new output file every so many messages.
>
> It's easy to shoot yourself in the foot with Perl.

It's easy to shoot yourself in the foot with any language. The above
code assumes that the file is already stored in the variable $mboxfile.
Any language--even $YOUR_FAVORITE_LANGUAGE--can do this. Please don't
spread FUD about something that's possible with any tool. Beyond that,
the above regex will run into problems parsing mailboxes.

Back on topic: please don't parse email with regex. You don't know what
you're doing and you will get it wrong (Google "Email Hates the Living"*
and watch the Google Video to see why). Use a module, like Mail::Box or
Email::Folder::Mbox, something that's been tested and in production use
at large ESPs for decades.

* Disclaimer: I used to work for that man and have seen first-hand the
true nature of email.

--
Thanks and best regards,
Chris Nehren
From: Tuxedo on
Michael Vilain wrote:

[...]

> This is the problem with not enforcing email quotas. I have a friend
> who does just what your user did--kept everything in a single mbox file

Problem is that it is on a local system application, which is a Windows PC.
The users do what they want there.

[..-]

> messages. You could use vi or emacs to read the mbox file and chop it
> into monthly parts, then see if formail will work on that. I think
> that's the best you're going to do for this user.

I'm not sure how to do that with vi and emacs. I don't think any editor
will actually open the file. Or is there a command line sequence for emacs
or the likes. Yearly batches would probably work Ok, but as soon as a
procedure tries to read the full file into memory the process just runs out
of memory and terminates.

> Users who do this sort of thing give me gas.

The annoying thing it's not the first time this same customer is doing this
same thing. The mail applications are not idiot proof or designed for
non-technical people. Perhaps there should be a warning in Mozilla GUI
applications, such as:
"Your is soon larger than your system can handle, please divide message box
into separate segments".

Tuxedo
From: Tuxedo on
John Kelly wrote:

[...]

> IOW, it's not hard identify message boundaries. You can use common text
> processing tools to split the big file into smaller ones.

Thanks for the tip but I'm not sure what processing tools can be used to
split the file into smaller ones? At least no editor that I know will open
the file. It's simply too big.

Tuxedo

From: Tuxedo on
Chris Nehren wrote:

> Back on topic: please don't parse email with regex. You don't know what
> you're doing and you will get it wrong (Google "Email Hates the Living"*
> and watch the Google Video to see why).

Sound like a good film - will watch!

> Use a module, like Mail::Box or
> Email::Folder::Mbox, something that's been tested and in production use
> at large ESPs for decades.

How can I use these Perl modules to split the mbox? Will they not also
attempt to read the entire file in one go and run out of memory...

Thanks for any further tips.

Tuxedo

From: Tuxedo on
Chris F.A. Johnson wrote:

[...]

> Use formail:
>
> formail -s savemail < "$mbox"
>
> Where savemail is a script containing:
>
> cat > $(date +%Y-%m-%d_%H:%M:%S)-$(uuidgen)
>
> This will put each message in a separate file. Adjust to taste if
> you want to put more than one message into each file or to use
> different filenames.

Thanks for this proceure, it works fine on a not-too-large mbox. However,
it fails with the huge file that that the system runs out of memory, as I
guess cat or formail tries to read in the full file to process. But it's a
good example how to split an mbox into individual files. I will probably
use this idea for something else.

Many thanks,
Tuxedo.