From: Wietse Venema on
Victor Duchovni:
> On Sat, Mar 27, 2010 at 08:53:03PM -0400, Wietse Venema wrote:
>
> > Currently, sites that send valid UTF-8 in MAIL/RCPT commands can
> > make meaningful LDAP queries in Postfix. Lots of MTAs are 8-bit
> > clean internally, so this can actually work today.
> >
> > Do we want to remove this ability from Postfix, or should we add
> > a valid_utf_8() routine in anticipation of a future standardization
> > of UTF8SMTP?
>
> I am a bit reluctant at this time to assume that untyped data coming in
> that looks like UTF-8, really is UTF-8. Even if the LDAP lookup returns
> plausibly useful results, will the UTF-8 envelope survive related
> processing in Postfix?
>
> - PCRE lookups don't currently request UTF-8 support

Meaning it will blow up, or what?

> - Logs don't support non-destructive recording of UTF-8
> envelopes.

I expect that in the long term, UTF-8 will be the canonical
representation of text in *NIX files, and that we should plan
for that future.

Wietse

From: Victor Duchovni on
On Wed, Apr 14, 2010 at 12:54:47PM -0400, Wietse Venema wrote:

> > I am a bit reluctant at this time to assume that untyped data coming in
> > that looks like UTF-8, really is UTF-8. Even if the LDAP lookup returns
> > plausibly useful results, will the UTF-8 envelope survive related
> > processing in Postfix?
> >
> > - PCRE lookups don't currently request UTF-8 support
>
> Meaning it will blow up, or what?

When passing UTF-8 data to a regexp engine, we need to tell the engine
that it is handling UTF-8 data, or it may produce match sub-expressions
that consist of pieces of characters. Should "a.b" match a Unicode string
where there is a multibyte character between "a" and "b"? What should ${1}
be for "(a*.)" when "a" is followed by a multi-byte character?

More generally, the issue is that we need a larger design in which we
have a canonical data representation inside all the pieces of Postfix,
and conversion logic at all system boundaries. This is much bigger than
LDAP lookups.

> > - Logs don't support non-destructive recording of UTF-8
> > envelopes.
>
> I expect that in the long term, UTF-8 will be the canonical
> representation of text in *NIX files, and that we should plan
> for that future.

Yes, of course. The LDAP IS_ASCII check will be easy to remove, and and
LDAP supports Unicode, so that will be the easy part, but first we need a
"contract" that all inputs to the dictionary layer are UTF-8, and the
"dict_<your-type-here>" clients will need to ensure that this is so.

After that, we can just let the UTF-8 data flow into the database engine
if supported, or try to translate to the database charset if not. Probably
each table's charset is declared as part of the table configuration, and
the generic dictionary layer handles translation of inputs and outputs...

Anyway, I am still reluctant to make use of UTF-8 without a larger
context in which this makes sense.

--
Viktor.

P.S. Morgan Stanley is looking for a New York City based, Senior Unix
system/email administrator to architect and sustain our perimeter email
environment. If you are interested, please drop me a note.

From: Wietse Venema on
Victor Duchovni:
> On Wed, Apr 14, 2010 at 12:54:47PM -0400, Wietse Venema wrote:
>
> > > I am a bit reluctant at this time to assume that untyped data coming in
> > > that looks like UTF-8, really is UTF-8. Even if the LDAP lookup returns
> > > plausibly useful results, will the UTF-8 envelope survive related
> > > processing in Postfix?
> > >
> > > - PCRE lookups don't currently request UTF-8 support
> >
> > Meaning it will blow up, or what?
>
> When passing UTF-8 data to a regexp engine, we need to tell the engine
> that it is handling UTF-8 data, or it may produce match sub-expressions
> that consist of pieces of characters. Should "a.b" match a Unicode string
> where there is a multibyte character between "a" and "b"? What should ${1}
> be for "(a*.)" when "a" is followed by a multi-byte character?
>
> More generally, the issue is that we need a larger design in which we
> have a canonical data representation inside all the pieces of Postfix,
> and conversion logic at all system boundaries. This is much bigger than
> LDAP lookups.

Speaking of canonical representation, Postfix by design strips off
the encapsulation on input (CRLF in SMTP, newline in local submission,
and length+value in QMQP) and adds the encapsulation back upon delivery.
This is sufficient for 7BIT or 8BITMIME content as we know it today.

Note that by doing this, Postfix normalizes only the end-of-line
convention, not the payload of the message. This means that with
well-formed mail, the SMTP input is guaranteed to be identical to
the SMTP output (ignoring the extra Received: header), and so on.

I don't think it is necessarily a good idea to "normalize" message
and envelope content into a canonical format (UTF-8 or otherwise),
do all processing in the canonical domain, and then do another
transformation on delivery.

More likely, one would transform a non-ASCII lookup string into
the character set of the lookup table mechanism and back, whatever
that character set might be, and return "not found" when the
transformation is not possible or when it is not implemented.

Although gateway MTAs have a choice to either downgrade 8BITMIME
to 7BIT or return mail as undeliverable, there is no equivalent
choice for envelope addresses with non-ASCII localparts. A gateway
into today's SMTP world would have to return envelopes with non-ASCII
localparts as undeliverable.

I would not be surprised if someone will come up with the equivalent
of RFC 2047 for SMTP envelope localparts, so that mail can be
tunneled through a legacy SMTP infrastructure, between systems that
support 8-bit usernames.

Wietse