From: Nix on
On 6 Dec 2009, Martin Gregorie verbalised:
> I run my own PostgreSQL-based mail archive, automatically fed from
> Postfix via the magic of the 'always_bcc' directive. Its benefits are:
> - spam isn't archived
> - effectively unlimited archival storage
> - fast searching and retrieving by any combination of address, subject,
> date range and body text.
>
> I'm planning to make it available shortly: details can be found at
> www.libelle-systems.com

This looks very neat. I'm halfway through an INN backend to do this to
newsfeeds: maybe I can arrange to use a compatible schema...
From: Martin Gregorie on
On Mon, 21 Dec 2009 22:24:20 +0000, Nix wrote:

> On 6 Dec 2009, Martin Gregorie verbalised:
>> I run my own PostgreSQL-based mail archive, automatically fed from
>> Postfix via the magic of the 'always_bcc' directive. Its benefits are:
>> - spam isn't archived
>> - effectively unlimited archival storage - fast searching and
>> retrieving by any combination of address, subject,
>> date range and body text.
>>
>> I'm planning to make it available shortly: details can be found at
>> www.libelle-systems.com
>
> This looks very neat. I'm halfway through an INN backend to do this to
> newsfeeds: maybe I can arrange to use a compatible schema...

The schema should work as it stands, since by definition it can handle a
pure text (non-MIME) mail message which is indexed by the address(es),
date sent and subject. Searches operate on these three terms plus the
plaintext part of the body, so searching and message retrieval should
also work with few if any changes.

Header parsing is handled by JavaMail. I know that also handles NNTP
traffic but have no understanding of the detail of how it does that,
since I'm not currently planning to handle NNTP.

If you go the same way I have the loader path will need modification and
so will the retrieval method: my MTA Bcc's all mail to a POP3 mailbox and
a cron job batch loads that into the MailArchive database. A search and
retrieve tool selects matching messages for inspection and retrieves
interesting ones by mailing them to the search user.

Both database operations are quick. Last night the loader scanned 117
messages in 10 seconds, loading 83 of them and discarding the rest - the
loader discards anything that SA has marked as spam and anything that
doesn't pass a set of configurable address filters, e.g. I don't archive
system messages such as logwatch, archive loader or backup reports. A
single forename body text search over all 73,600 messages in the
database) took 36 seconds to pull all 30 matches out. That's searching a
PostgreSQL database on a 866 MHz single P3 box with the search tool on a
1.6HGz CoreDuo on the other side of my 100mbit LAN.


--
martin@ | Martin Gregorie
gregorie. | Essex, UK
org