From: Peter Olcott on
On 5/31/2010 11:35 AM, Daniel T. wrote:
> Peter Olcott<NoSpam(a)OCR4Screen.com> wrote:
>
>> I used the two tables from this link as the basis for my design:
>> http://en.wikipedia.org/wiki/UTF-8
>
> I suggest you use http://unicode.org/ for your source. Why use a
> secondary source when the primary source is easily available?

Wading though all that to get what I need takes too long.

>
>> I would like this reviewed for algorithm correctness:
>
> Surely your tests have already shown whether the algorithm is correct.
>
>> void UnicodeEncodingConversion::
>> toUTF8(std::vector<uint32_t>& UTF32, std::vector<uint8_t>& UTF8) {
>> uint8_t Byte;
>> uint32_t CodePoint;
>> UTF8.reserve(UTF32.size() * 4); // worst case
>> for (uint32_t N = 0; N< UTF32.size(); N++) {
>> CodePoint = UTF32[N];
>
> I suggest you use an iterator instead of an integer for the loop. That
> way you wont need the extraneous variable.
>
>> if (CodePoint<= 0x7F) {
>> Byte = CodePoint;
>> UTF8.push_back(Byte);
>> }
>> else if (CodePoint<= 0x7FF) {
>> Byte = 0xC0 | (CodePoint>> 6);
>> UTF8.push_back(Byte);
>> Byte = 0x80 | (CodePoint& 0x3F);
>> UTF8.push_back(Byte);
>> }
>> else if (CodePoint<= 0xFFFF) {
>> Byte = 0xE0 | (CodePoint>> 12);
>> UTF8.push_back(Byte);
>> Byte = 0x80 | ((CodePoint>> 6)& 0x3F);
>> UTF8.push_back(Byte);
>> Byte = 0x80 | (CodePoint& 0x3F);
>> UTF8.push_back(Byte);
>> }
>> else if (CodePoint<= 0x10FFFF) {
>
> The codes 10FFFE and 10FFFF are guaranteed not to be unicode
> characters...

So then Wikipedia is wrong?
http://en.wikipedia.org/wiki/Unicode
16 100000�10FFFF Supplementary Private Use Area-B

>
>> Byte = 0xF0 | (CodePoint>> 18);
>> UTF8.push_back(Byte);
>> Byte = 0x80 | ((CodePoint>> 12)& 0x3F);
>> UTF8.push_back(Byte);
>> Byte = 0x80 | ((CodePoint>> 6)& 0x3F);
>> UTF8.push_back(Byte);
>> Byte = 0x80 | (CodePoint& 0x3F);
>> UTF8.push_back(Byte);
>> }
>> else
>> printf("%d is outside of the Unicode range!\n", CodePoint);
>
> Throw is more appropriate here.
>
>> }
>> }

So it looks otherwise correct?

I am guessing that I should probably also screen out the high and low
surrogates.

From: Giovanni Dicanio on
"Joseph M. Newcomer" <newcomer(a)flounder.com> wrote:

>> UTF8.reserve(UTF32.size() * 4); // worst case
> ****
> Note that this will call malloc(), which will involve setting a lock, then
> searching for a
> block to allocate, then releasing the lock. Since you have been a fanatic
> about
> performance, why is it you put a very expensive operation like 'reserve'
> in your code?
>
> While it is perfectly reasonable, it seems inconsistent with your
> previously-stated goals.

Joe: I'm not sure if you are ironic or something :) ... but I believe that
std::vector::reserve() with a proper capacity value, followed by several
push_back()s, is very efficient.
Sure, not as efficient as a static stack-allocated array, but very
efficient.


> No, the CORRECT way to write such code is to either throw an exception (if
> you are in C++,
> which you clearly are) or return a value indicating the error (for
> example, in C, an

In this case, I'm for exception.
Thanks to exception, you could use the precious function return value to
actually return the resulting buffer (UTF8 string), instead of passing it as
a reference to the function:

// Updated prototype:
// - use 'const' correctness for utf32
// - return resulting utf8
// - may throw on error
std::vector<uint8_t> toUTF8(const std::vector<uint32_t> & utf32);

Note that thanks to the move semantics (i.e. the new "&&" thing of C++0x,
available in VC10 a.k.a. VS2010), you don't pay for extra useless copies in
returning potentially big objects.

Giovanni



From: Daniel T. on
Peter Olcott <NoSpam(a)OCR4Screen.com> wrote:
> On 5/31/2010 11:35 AM, Daniel T. wrote:

> > The codes 10FFFE and 10FFFF are guaranteed not to be unicode
> > characters...
>
> So then Wikipedia is wrong?
> http://en.wikipedia.org/wiki/Unicode
> 16 100000�10FFFF Supplementary Private Use Area-B

According to unicode.org, apparently yes. You'd know that if you hadn't
been lazy and only consulted a secondary source.

> So it looks otherwise correct?

Does it pass all your tests? You do have tests don't you?
From: Daniel T. on
"Leigh Johnston" <leigh(a)i42.co.uk> wrote:
> "Daniel T." <daniel_t(a)earthlink.net> wrote:
> > Peter Olcott <NoSpam(a)OCR4Screen.com> wrote:
> >
> > > void UnicodeEncodingConversion::
> > > toUTF8(std::vector<uint32_t>& UTF32, std::vector<uint8_t>& UTF8) {
> > > uint8_t Byte;
> > > uint32_t CodePoint;
> > > UTF8.reserve(UTF32.size() * 4); // worst case
> > > for (uint32_t N = 0; N < UTF32.size(); N++) {
> > > CodePoint = UTF32[N];
> >
> > I suggest you use an iterator instead of an integer for the loop. That
> > way you wont need the extraneous variable.
>
> Then the iterator would be extraneous surely? Unless you mean CodePoint is
> the extraneous variable which it isn't as it is accessed multiple times and
> dereferencing an iterator multiple times would not be as efficient modulo
> any compiler optimizations; it certainly is not as clear as using a
> temporary (IMO).

Our opinions must differ then. Such micro-optimizations would be barely
perceptible even in a contrived example.

"Every piece of knowledge must have a single, unambiguous, authoritative
representation within a system." -- Andrew Hunt

CodePoint and UTF32[N] are two representations that both refer to the
same piece of knowledge. Why the unnecessary duplication?
From: Peter Olcott on
On 5/31/2010 1:16 PM, Giovanni Dicanio wrote:
> "Joseph M. Newcomer" <newcomer(a)flounder.com> wrote:
>
>>> UTF8.reserve(UTF32.size() * 4); // worst case
>> ****
>> Note that this will call malloc(), which will involve setting a lock,
>> then searching for a
>> block to allocate, then releasing the lock. Since you have been a
>> fanatic about
>> performance, why is it you put a very expensive operation like
>> 'reserve' in your code?
>>
>> While it is perfectly reasonable, it seems inconsistent with your
>> previously-stated goals.
>
> Joe: I'm not sure if you are ironic or something :) ... but I believe
> that std::vector::reserve() with a proper capacity value, followed by
> several push_back()s, is very efficient.
> Sure, not as efficient as a static stack-allocated array, but very
> efficient.

He needed to find some excuse to denigrate my code. He has had a
personal grudge against me for several months. I don't really know what
I said to offend him, but, it must have occurred sometime after he sung
very high praises about my patent a few months ago.

>> No, the CORRECT way to write such code is to either throw an exception
>> (if you are in C++,
>> which you clearly are) or return a value indicating the error (for
>> example, in C, an
>

The "correct" way to handle an error when testing code for the first
time is to use a printf() statement, or other easy to use debugging
construct. When the code moves to production, then either of the other
two suggestions may be appropriate.

> In this case, I'm for exception.
> Thanks to exception, you could use the precious function return value to
> actually return the resulting buffer (UTF8 string), instead of passing
> it as a reference to the function:
>
> // Updated prototype:
> // - use 'const' correctness for utf32
> // - return resulting utf8
> // - may throw on error
> std::vector<uint8_t> toUTF8(const std::vector<uint32_t> & utf32);

For most compilers this requires making an extra copy.

>
> Note that thanks to the move semantics (i.e. the new "&&" thing of
> C++0x, available in VC10 a.k.a. VS2010), you don't pay for extra useless
> copies in returning potentially big objects.
>
> Giovanni
>
>
>
Counting on this results in code that does not have the same performance
characteristics across multiple platforms.