From: Tuxedo on
Hi,

I customer just gave me a massive mail file in mbox format which has
accrued over several years. The file was rescued from an old drive of a
previous but now broken system, and so I would like to restore the mailbox
in a mail application on a new system.

The mail file was readable on the previous system in Mozilla Thunderbird,
as there it had a corresponding .msf index. However, the .msf file no
longer exists and the mbox itself is nearly 3GB. When placing this in a new
T-Bird mail folder, the mail application tries but soon fails to generate
the index which is necessary to display the messages.

At first I thought the file may be corrupt so I tried running:
formail -zds < big_mbox >> fixed_mbox

But soon after formail began munching its way into the big_mbox there was
an "Out of memory" error returned by the shell, which I guess was also what
the mail client silently did.

I guess I need more ram to process such big file and that any mail
application, formail included, simply needs more than the filesize, which
unfortunately I do not have. In any case, I think the file is probably Ok
since it worked fine on the previous system.

What methods exists to process and restore this huge file? How about for
example splitting it into parts, such as 5 or 10 different files, obviously
cut at the right points between messages. I guess the individual mbox files
can then easily be readable in more or less any mail application. Can this
be done via the shell and if so how?

Are there any particular Unix tools to split such huge message files or
create an .msf index without running out of memory in the process?

Many thanks for any ideas and advise.

Tuxedo
From: John Kelly on
On Tue, 15 Jun 2010 01:18:19 +0200, Tuxedo <tuxedo(a)mailinator.com>
wrote:

>customer just gave me a massive mail file in mbox format

>Are there any particular Unix tools to split such huge message files


http://en.wikipedia.org/wiki/Mbox says:

mbox is a generic term for a family of related file formats used for
holding collections of electronic mail messages. All messages in an
mbox mailbox are concatenated and stored as plain text in a single
file. The beginning of each message is indicated by a line whose first
five characters consist of "From" followed by a space (the so-called
"From_ line" or "'From ' line") and the return path e-mail address. A
blank line is appended to the end of each message.

IOW, it's not hard identify message boundaries. You can use common text
processing tools to split the big file into smaller ones.



--
Web mail, POP3, and SMTP
http://www.beewyz.com/freeaccounts.php

From: Chris F.A. Johnson on
On 2010-06-14, Tuxedo wrote:
> Hi,
>
> I customer just gave me a massive mail file in mbox format which has
> accrued over several years. The file was rescued from an old drive of a
> previous but now broken system, and so I would like to restore the mailbox
> in a mail application on a new system.
>
> The mail file was readable on the previous system in Mozilla Thunderbird,
> as there it had a corresponding .msf index. However, the .msf file no
> longer exists and the mbox itself is nearly 3GB. When placing this in a new
> T-Bird mail folder, the mail application tries but soon fails to generate
> the index which is necessary to display the messages.
>
> At first I thought the file may be corrupt so I tried running:
> formail -zds < big_mbox >> fixed_mbox
>
> But soon after formail began munching its way into the big_mbox there was
> an "Out of memory" error returned by the shell, which I guess was also what
> the mail client silently did.
>
> I guess I need more ram to process such big file and that any mail
> application, formail included, simply needs more than the filesize, which
> unfortunately I do not have. In any case, I think the file is probably Ok
> since it worked fine on the previous system.
>
> What methods exists to process and restore this huge file? How about for
> example splitting it into parts, such as 5 or 10 different files, obviously
> cut at the right points between messages. I guess the individual mbox files
> can then easily be readable in more or less any mail application. Can this
> be done via the shell and if so how?
>
> Are there any particular Unix tools to split such huge message files or
> create an .msf index without running out of memory in the process?

Use formail:

formail -s savemail < "$mbox"

Where savemail is a script containing:

cat > $(date +%Y-%m-%d_%H:%M:%S)-$(uuidgen)

This will put each message in a separate file. Adjust to taste if
you want to put more than one message into each file or to use
different filenames.


--
Chris F.A. Johnson, author <http://shell.cfajohnson.com/>
===================================================================
Shell Scripting Recipes: A Problem-Solution Approach (2005, Apress)
Pro Bash Programming: Scripting the GNU/Linux Shell (2009, Apress)

From: Maxwell Lol on
John Kelly <jak(a)isp2dial.com> writes:

> On Tue, 15 Jun 2010 01:18:19 +0200, Tuxedo <tuxedo(a)mailinator.com>
> wrote:
>
>>customer just gave me a massive mail file in mbox format
>
>>Are there any particular Unix tools to split such huge message files
>
>
> http://en.wikipedia.org/wiki/Mbox says:
>
> mbox is a generic term for a family of related file formats used for
> holding collections of electronic mail messages. All messages in an
> mbox mailbox are concatenated and stored as plain text in a single
> file. The beginning of each message is indicated by a line whose first
> five characters consist of "From" followed by a space (the so-called
> "From_ line" or "'From ' line") and the return path e-mail address. A
> blank line is appended to the end of each message.
>
> IOW, it's not hard identify message boundaries. You can use common text
> processing tools to split the big file into smaller ones.
>

You can even use perl and use something like


@mail = split(/\nFrom /,$mboxfile);

That assume your mail system uses the "put a '>' before 'From' in all
email" option.
From: John Kelly on
On Mon, 14 Jun 2010 21:17:26 -0400, Maxwell Lol <nospam(a)com.invalid>
wrote:

>John Kelly <jak(a)isp2dial.com> writes:
>
>> On Tue, 15 Jun 2010 01:18:19 +0200, Tuxedo <tuxedo(a)mailinator.com>
>> wrote:

>>>customer just gave me a massive mail file in mbox format
>>>Are there any particular Unix tools to split such huge message files

>> IOW, it's not hard identify message boundaries. You can use common text
>> processing tools to split the big file into smaller ones.
>
>You can even use perl and use something like
>
> @mail = split(/\nFrom /,$mboxfile);

That will read it into memory all at once, which may cause thrashing
with his 3GB file. In his scenario, better to read and write one line
at a time, and open a new output file every so many messages.

It's easy to shoot yourself in the foot with Perl.


--
Web mail, POP3, and SMTP
http://www.beewyz.com/freeaccounts.php