From: Peter Olcott on
On 5/31/2010 1:24 PM, Daniel T. wrote:
> Peter Olcott<NoSpam(a)OCR4Screen.com> wrote:
>> On 5/31/2010 11:35 AM, Daniel T. wrote:
>
>>> The codes 10FFFE and 10FFFF are guaranteed not to be unicode
>>> characters...
>>
>> So then Wikipedia is wrong?
>> http://en.wikipedia.org/wiki/Unicode
>> 16 100000�10FFFF Supplementary Private Use Area-B
>
> According to unicode.org, apparently yes. You'd know that if you hadn't
> been lazy and only consulted a secondary source.

I simply don't have the time to read all of the Unicode stuff to find
the two or three paragraphs that I really need to know. I already know
about High and Low surrogates. Why is the range that you specified not
valid codepoints?

>
>> So it looks otherwise correct?
>
> Does it pass all your tests? You do have tests don't you?

I am using the results of this function to mutually exhaustively test
the results of another function that does the conversion in the other
direction. These tests pass.

From: Peter Olcott on
On 5/31/2010 1:41 PM, Daniel T. wrote:
> "Leigh Johnston"<leigh(a)i42.co.uk> wrote:
>> "Daniel T."<daniel_t(a)earthlink.net> wrote:
>>> Peter Olcott<NoSpam(a)OCR4Screen.com> wrote:
>>>
>>>> void UnicodeEncodingConversion::
>>>> toUTF8(std::vector<uint32_t>& UTF32, std::vector<uint8_t>& UTF8) {
>>>> uint8_t Byte;
>>>> uint32_t CodePoint;
>>>> UTF8.reserve(UTF32.size() * 4); // worst case
>>>> for (uint32_t N = 0; N< UTF32.size(); N++) {
>>>> CodePoint = UTF32[N];
>>>
>>> I suggest you use an iterator instead of an integer for the loop. That
>>> way you wont need the extraneous variable.
>>
>> Then the iterator would be extraneous surely? Unless you mean CodePoint is
>> the extraneous variable which it isn't as it is accessed multiple times and
>> dereferencing an iterator multiple times would not be as efficient modulo
>> any compiler optimizations; it certainly is not as clear as using a
>> temporary (IMO).
>
> Our opinions must differ then. Such micro-optimizations would be barely
> perceptible even in a contrived example.
>
> "Every piece of knowledge must have a single, unambiguous, authoritative
> representation within a system." -- Andrew Hunt
>
> CodePoint and UTF32[N] are two representations that both refer to the
> same piece of knowledge. Why the unnecessary duplication?

Here is the best reason:

bool UnicodeEncodingConversion::toUTF8
(const std::vector<uint32_t>& UTF32,
std::vector<uint8_t>& UTF8) {

(see the added const ?)
From: Leigh Johnston on


"Daniel T." <daniel_t(a)earthlink.net> wrote in message
news:daniel_t-A330C0.14413831052010(a)70-3-168-216.pools.spcsdns.net...
> "Leigh Johnston" <leigh(a)i42.co.uk> wrote:
>> "Daniel T." <daniel_t(a)earthlink.net> wrote:
>> > Peter Olcott <NoSpam(a)OCR4Screen.com> wrote:
>> >
>> > > void UnicodeEncodingConversion::
>> > > toUTF8(std::vector<uint32_t>& UTF32, std::vector<uint8_t>& UTF8) {
>> > > uint8_t Byte;
>> > > uint32_t CodePoint;
>> > > UTF8.reserve(UTF32.size() * 4); // worst case
>> > > for (uint32_t N = 0; N < UTF32.size(); N++) {
>> > > CodePoint = UTF32[N];
>> >
>> > I suggest you use an iterator instead of an integer for the loop. That
>> > way you wont need the extraneous variable.
>>
>> Then the iterator would be extraneous surely? Unless you mean CodePoint
>> is
>> the extraneous variable which it isn't as it is accessed multiple times
>> and
>> dereferencing an iterator multiple times would not be as efficient modulo
>> any compiler optimizations; it certainly is not as clear as using a
>> temporary (IMO).
>
> Our opinions must differ then. Such micro-optimizations would be barely
> perceptible even in a contrived example.

The optimization could be perceptible if the compiler was not optimizing
multiple iterator dereferences and a lot of data was being processed. I
don't consider to be just an optimization, the code is clearer when storing
it in a temporary, spreading the code with "*it" would result in a visual
check to make sure "it" was not changing inside the loop.

>
> "Every piece of knowledge must have a single, unambiguous, authoritative
> representation within a system." -- Andrew Hunt

I disagree, the same piece of "knowledge" can have more than one
representation in a system, e.g. "a person's name" represented as UTF-8 in a
lower-level "model" class and the same "person's name" represented as UTF-16
in a higher-level "edit box GUI" class. The definition of "knowledge"
perhaps requires some enhancement.

>
> CodePoint and UTF32[N] are two representations that both refer to the
> same piece of knowledge. Why the unnecessary duplication?

It is not unnecessary *if* there is a noticeable performance improvement. I
agree however that premature optimization should be avoided (obviously)
which is why profiling should be performed but it is also a matter of
writing clear code which is easy to parse (understand). There is no real
disadvantage to storing the result of "*it" or "UTF32[N]" in a temporary.

/Leigh

From: Sam on
Peter Olcott writes:

> On 5/31/2010 1:41 PM, Daniel T. wrote:
>> CodePoint and UTF32[N] are two representations that both refer to the
>> same piece of knowledge. Why the unnecessary duplication?
>
> Here is the best reason:
>
> bool UnicodeEncodingConversion::toUTF8
> (const std::vector<uint32_t>& UTF32,
> std::vector<uint8_t>& UTF8) {
>
> (see the added const ?)

And even better:

template<typename input_iter_t, typename output_iter_t>
bool toUTF8(input_iter_t beg_iter, input_iter_t end_iter,
output_iter_t output_iter)

So that your masterpiece could be used with not just vectors, but any
container, or any suitable stream.

But, I'm sure you have no time to learn all this complicated stuff.


From: Giovanni Dicanio on
"Peter Olcott" <NoSpam(a)OCR4Screen.com> wrote:

> He needed to find some excuse to denigrate my code. He has had a
> personal grudge against me for several months. I don't really know what
> I said to offend him, but, it must have occurred sometime after he sung
> very high praises about my patent a few months ago.

I don't think so.

Joe helps lots of people here (and is a nice guy in person!).

You must have misunderstood.


>> std::vector<uint8_t> toUTF8(const std::vector<uint32_t> & utf32);
>
> For most compilers this requires making an extra copy.

Before move semantics, I think several C++ compilers implemented the RVO.

Giovanni