Huge mailbox split? [Shell]

Prev: NEWBIE: how exclude files in ftp
Next: The daemon helper, dh is its name

From: Janis Papanagnou on 16 Jun 2010 02:27

Tuxedo wrote:
> Janis Papanagnou wrote:
>
> [...]
>
>>> awk '/^From / { f = "mbox_"$NF"-"$4 } { print > f }' mbox
>> To prevent a message body line starting with "From [...]" you can defined
>> the pattern more accurate, instead of /^From / specify (for example)...
>>
>> /^From - [A-Z][a-z][a-z] [A-Z][a-z][a-z] .* [0-9][0-9][0-9][0-9]$/ {...}
>>
>> or perhaps just
>>
>> NF==7 && /^From / {...}
>>
>>> (If the number of created files will exceed some number of allowed open
>>> file descriptors, please tell us, then the code needs some adjustments.)
>>>
>>> Janis
>
> Thanks for this awk tip!
>
> But you are right, the first one catches message body text that simply
> begin a line with "From":

(ITYM, a line starting with "From ".)

> awk '/^From / { f = "mbox_"$NF"-"$4 } { print > f }' mbox
>
> The other versions, however, I get some errors with. I presume I am
> replicating it in some wrong way:
>
> awk '/^From - [A-Z][a-z][a-z] [A-Z][a-z][a-z] .* [0-9][0-9][0-9][0-9]$/ { f
> = "mbox_"$NF"-"$4 } { print > f }' mbox

Have you put the whole statement in a single line? Or in more lines as
follows...

awk '
/^From - [A-Z][a-z][a-z] [A-Z][a-z][a-z] .* [0-9][0-9][0-9][0-9]$/ {
f = "mbox_"$NF"-"$4
}
{ print > f }
' mbox

>
> The error for the above is "redirection has null string value".

Please compare the defined awk pattern with your "From " line in the mbox
file.

grep '^From ' mbox | head -1 | od -c

might help to spot unvisible white space characters.

>
> awk 'NF==7 && /^From { f = "mbox_"$NF"-"$4 } { print > f }' sent-mail

awk 'NF==7 && /^From / { f = "mbox_"$NF"-"$4 } { print > f }' sent-mail

>
> The error here is "unterminated regexp".
>
> Perhaps you can correct the above or type your two last examples in full?

Please retry.

Janis

>
> Thanks,
> Tuxedo

From: Christian on 15 Jun 2010 10:09

For that purpose, I usually run

awk '/^From / {n++} {print >"msg" n ".mbx" }' big_mbox

but never tried on such big files

"Tuxedo" <tuxedo(a)mailinator.com> a �crit dans le message de news:
hv6dbr$hn0$00$1(a)news.t-online.com...
> Hi,
>
> I customer just gave me a massive mail file in mbox format which has
> accrued over several years. The file was rescued from an old drive of a
> previous but now broken system, and so I would like to restore the mailbox
> in a mail application on a new system.
>
> The mail file was readable on the previous system in Mozilla Thunderbird,
> as there it had a corresponding .msf index. However, the .msf file no
> longer exists and the mbox itself is nearly 3GB. When placing this in a
> new
> T-Bird mail folder, the mail application tries but soon fails to generate
> the index which is necessary to display the messages.
>
> At first I thought the file may be corrupt so I tried running:
> formail -zds < big_mbox >> fixed_mbox
>
> But soon after formail began munching its way into the big_mbox there was
> an "Out of memory" error returned by the shell, which I guess was also
> what
> the mail client silently did.
>
> I guess I need more ram to process such big file and that any mail
> application, formail included, simply needs more than the filesize, which
> unfortunately I do not have. In any case, I think the file is probably Ok
> since it worked fine on the previous system.
>
> What methods exists to process and restore this huge file? How about for
> example splitting it into parts, such as 5 or 10 different files,
> obviously
> cut at the right points between messages. I guess the individual mbox
> files
> can then easily be readable in more or less any mail application. Can this
> be done via the shell and if so how?
>
> Are there any particular Unix tools to split such huge message files or
> create an .msf index without running out of memory in the process?
>
> Many thanks for any ideas and advise.
>
> Tuxedo

From: Tuxedo on 16 Jun 2010 03:08

John Kelly wrote:

[..]

> 100 bytes is not enough to see the big picture. Try more, 1,000 or
> 10,000, or whatever it takes until you see some data that looks like
> mail messages. Then use the skip feature of dd to read past that when
> copying.

I thought the mbox format was meant to begin with "From" on the first line
of the file. At least that's how mboxes look on my Linux box. But who knows
what could have been inserted by some Windows application.

So I tried the larger values, bs=10000 etc, but same result.

The likely broken mbox file appears to be all binary, yet it doesn't look
like a typical binary file in my editor. I'm not sure what it is...

I tested the awk trick posted by Janis Papanagnou today:
awk '/^From / { f = "mbox_"$NF"-"$4 } { print > f }' myBigCrapBox

While this works on another good mailbox, the output I get with
myBigCrapBox is:
awk: cmd. line:1: fatal: grow_iop_buffer: iop->buf: can't allocate
-2147483646 bytes of memory (Cannot allocate memory)

The computer worked away at it's maximum power for 30 seconds or so until
the process died out without managing to find a single occurence of a ^From
string.

Tuxedo

From: Tuxedo on 16 Jun 2010 03:24

John Kelly wrote:

[...]

> But maybe it's not really mbox format, and there is extra garbage
> between each message. Or worse, some kind of compressed format where
> you can't really see what you have just by looking at the data.

Yes I think the file must be in some compressed format. There are other
smaller mail folders from the same Windows drive and hat were used with the
same mail application but which are plain text. It's only the huge 2.8GB
file that is unreadable. Perhaps Mozilla Thunderbird is compressing any and
only those mboxes that exceed a certain file size, but if so, they do not
get a suffix, like .zip, .gz etc. If this theory is true, I'm still not
sure what compression format is used, in case it's even a standard format.

Tuxedo

From: Tuxedo on 16 Jun 2010 03:50

Janis Papanagnou wrote:

> Tuxedo wrote:
> > Janis Papanagnou wrote:
> >
> > [...]
> >
> >>> awk '/^From / { f = "mbox_"$NF"-"$4 } { print > f }' mbox
> >> To prevent a message body line starting with "From [...]" you can
> >> defined
> >> the pattern more accurate, instead of /^From / specify (for
> >> example)...
> >>
> >> /^From - [A-Z][a-z][a-z] [A-Z][a-z][a-z] .* [0-9][0-9][0-9][0-9]$/
> >> {...}
> >>
> >> or perhaps just
> >>
> >> NF==7 && /^From / {...}
> >>
> >>> (If the number of created files will exceed some number of allowed
> >>> open file descriptors, please tell us, then the code needs some
> >>> adjustments.)
> >>>
> >>> Janis
> >
> > Thanks for this awk tip!
> >
> > But you are right, the first one catches message body text that simply
> > begin a line with "From":
>
> (ITYM, a line starting with "From ".)
>
> > awk '/^From / { f = "mbox_"$NF"-"$4 } { print > f }' mbox
> >
> > The other versions, however, I get some errors with. I presume I am
> > replicating it in some wrong way:
> >
> > awk '/^From - [A-Z][a-z][a-z] [A-Z][a-z][a-z] .* [0-9][0-9][0-9][0-9]$/
> > { f = "mbox_"$NF"-"$4 } { print > f }' mbox
>
> Have you put the whole statement in a single line? Or in more lines as
> follows...
>
> awk '
> /^From - [A-Z][a-z][a-z] [A-Z][a-z][a-z] .* [0-9][0-9][0-9][0-9]$/ {
> f = "mbox_"$NF"-"$4
> }
> { print > f }
> ' mbox
>
> >
> > The error for the above is "redirection has null string value".
>
> Please compare the defined awk pattern with your "From " line in the mbox
> file.
>
> grep '^From ' mbox | head -1 | od -c
>
> might help to spot unvisible white space characters.
>
> >
> > awk 'NF==7 && /^From { f = "mbox_"$NF"-"$4 } { print > f }' sent-mail
>
> awk 'NF==7 && /^From / { f = "mbox_"$NF"-"$4 } { print > f }' sent-mail
>
> >
> > The error here is "unterminated regexp".
> >
> > Perhaps you can correct the above or type your two last examples in
> > full?
>
> Please retry.
>
> Janis
>
> >
> > Thanks,
> > Tuxedo

Thanks, it works, not with the actual huge mail box becuase I've just found
that one is something else wrong with (it's probably compressed). But the
above examples work great with any 'normal' mbox I have.

Very useful to know.

Tuxedo

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Prev: NEWBIE: how exclude files in ftp
Next: The daemon helper, dh is its name