From: Peter Olcott on
I am aiming to produce the best balance of speed and space efficiency
against reliability and maintainability placing a higher weight on the
latter criteria.
http://www.ocr4screen.com/UTF8.h

If there are any improvements that can be made within the criteria
provided, or other improvements that do not detract from the above
criteria input would be welcomed.

This support file is required for compiling. Please do not critique the
support file it is not in final form.
http://www.ocr4screen.com/Array2D.h

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Nick Hounsome on
On 8 June, 10:54, Peter Olcott <NoS...(a)OCR4Screen.com> wrote:
> I am aiming to produce the best balance of speed and space efficiency
> against reliability and maintainability placing a higher weight on the
> latter criteria.
> http://www.ocr4screen.com/UTF8.h
>
> If there are any improvements that can be made within the criteria
> provided, or other improvements that do not detract from the above
> criteria input would be welcomed.

1) uint8_t and uint32_t are already standard types so why are you
typedefing your own?
2) Precalculate the state table - It is logically a static const
State[][].
3) Eliminate all the global stuff - It's not comp.lang.c here.
4) Your class is logically stateless hence methods (as well as the
state array) should be probably be static
5) It is highly unlikely that these methods would be inlined by the
compiler (inline is a hint not an order) and if they were it could be
a space wasting disaster. Make them non-inline - The call overhead
will be dwarfed by the processing in almost all cases.
6) Use vector::reserve so that push_back wont do multiple
reallocations/copies
7) Template the methods and use iterators rather than collections -
it's more flexible (you can keep the current signatures as overloads)
8) Why are you using uint8_t as a State instead of an enum???????
NextState = 0 !!!???? Same goes for ActionCode.



--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Goran on
On Jun 8, 11:54 am, Peter Olcott <NoS...(a)OCR4Screen.com> wrote:
> I am aiming to produce the best balance of speed and space efficiency
> against reliability and maintainability placing a higher weight on the
> latter criteria.
> http://www.ocr4screen.com/UTF8.h

A better solution to your problem seems to exist since some time.

http://utfcpp.sourceforge.net/

Goran.


--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Peter Olcott on
On 6/8/2010 12:44 PM, Nick Hounsome wrote:
> On 8 June, 10:54, Peter Olcott<NoS...(a)OCR4Screen.com> wrote:
>> I am aiming to produce the best balance of speed and space efficiency
>> against reliability and maintainability placing a higher weight on the
>> latter criteria.
>> http://www.ocr4screen.com/UTF8.h
>>
>> If there are any improvements that can be made within the criteria
>> provided, or other improvements that do not detract from the above
>> criteria input would be welcomed.
>
> 1) uint8_t and uint32_t are already standard types so why are you
> typedefing your own?

Because they are not defined on my platform. MS VS 2008.

> 2) Precalculate the state table - It is logically a static const
> State[][].

I did it this way so that it would be small enough to fit on the stack,
and also so that it could be included as a header file.

> 3) Eliminate all the global stuff - It's not comp.lang.c here.
I don't know what you mean, everything is a member of a class.

> 4) Your class is logically stateless hence methods (as well as the
> state array) should be probably be static
Maybe, what is the advantage?

> 5) It is highly unlikely that these methods would be inlined by the
> compiler (inline is a hint not an order) and if they were it could be
> a space wasting disaster. Make them non-inline - The call overhead
> will be dwarfed by the processing in almost all cases.
The compiler requires inlining if the methods are included in the
header. I hate creating the typically required pair of files it seems so
redundant.

> 6) Use vector::reserve so that push_back wont do multiple
> reallocations/copies
Yes. I had initially allocated the worst case memory, then I thought
that this wasted memory. When I benchmarked it on 100 instances of the
entire Unicode set, it only took about 30% more time, but, 50% less
memory. It might be a good idea to at least reserve() the best case
memory requirement.

> 7) Template the methods and use iterators rather than collections -
> it's more flexible (you can keep the current signatures as overloads)
> 8) Why are you using uint8_t as a State instead of an enum???????
> NextState = 0 !!!???? Same goes for ActionCode.
>
>
Yes this is the best suggestion.


--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Kenneth 'Bessarion' Boyd on
On Jun 8, 4:54 am, Peter Olcott <NoS...(a)OCR4Screen.com> wrote:
> I am aiming to produce the best balance of speed and space efficiency
> against reliability and maintainability placing a higher weight on the
> latter criteria.
> http://www.ocr4screen.com/UTF8.h
>
> If there are any improvements that can be made within the criteria
> provided, or other improvements that do not detract from the above
> criteria input would be welcomed.

Ignoring the existence of other libraries:

1) Include-guard #ifndef, please? Especially since this is intended
to be a header-only implementation? (There is no other naively
obvious reason for not using a *.cpp file to avoid needing inline
function specifiers to get past the linker.)

2) Encoding should use NextByteMask as well.

3) While the action codes should be stored as uint8_t (more generally
unsigned char) for memory efficiency reasons, it would be better not
to gamble on the compiler's optimizer: this should be an enumeration.

4) uint8_t and uint32_t collide with #include <stdint.h> (and possibly
#include <cstdint>) if that particular C++0X support is present. If
you have stdint.h, I think uint_least8_t and uint_least32_t are better
for portability to exotic hardware. In any case, have some sort of
project configuration check for stdint.h and use it if present.
4a) As only analytically non-conformant hardware would force
CHAR_BIT<8 , I'd consider just using unsigned char in place of uint8_t/
uint_least8_t.
4b) Preprocessor detection (using limits.h) of a suitable integer type
for uint_least32_t is painful; my attempts to hand-roll this have been
unconvincing.

5) The bootstrapping of State is a speed/maintainability trade off.
Speed would expect this to be a static const uint8_t State[9][256],
but
* avoiding linking errors from initializing this requires either
templating, or a *.cpp file to hold the initialization. Assuming
header-only implementation leaves templating.
* Debug-mode build would have to fully check (e.g., assert macros)
that the table is correctly initialized anyway.
* the proofreading behind both the static initialization, and the
debug-mode checking, is awful and would strongly bias towards
automatic code generation -- which simply replicates the existing
constructor code.
5a) I'm not sold on a state table for this, either. The comment
describing the table is perfectly good pseudocode for a static member
function, and I have a hard time seeing that function checking in at
more than 1152 (9*128, easiest simple compressed format viable) bytes
object size.
5b) As maintainability is important, forget about memset/std::memset
as a replacement for the loops. If you actually *have* a speed
problem you should reimplement State as one of static data, or static
member function.


--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]