From: Tegiri Nenashi on
String tokenizer is the simplest parser of all. In java, arguably, it
is much more frequently used than regular expressions. Yet I fail to
see any parsing theory book ever mentioning it.

In my practical experience it is easer write a scanner on string
tokenizer foundation rather than to use off the shellf reg exp engine.
I'm interested however what is the language theory perspective onto
this phenomenon.

Formally, a set of terminals is partitioned into a separator or a set
of separators, and the rest of terminals. Then, string tokenizer
translates a given word into a set (or list) of words. Here we have
the first technical difficulty, what exactly this translation is? Is
it a mapping from Monoid to Idempotent Semiring? Certainly not if we
insist on the result being a list of words, and not the set. Then if
we try exrtend this mapping naturally to the domain of sets, then we
have to deal with set of sets (or set of lists) on the range side of
the mapping?

Consider an example, an alphabet of terminals {a,b,c} with the "a"
being a separator. Can you suggest a grammar that be able to tokenize
the word "babcac" into {b,bc,c}? Sure something like

s >= a
t >= b
t >= c
w >= wt
w >= 1
u >= ws
v >= 1
v >= uv

would work, but does this 8(!) rule grammar present any new insight to
what string tokenizor is? Besides, assuming this grammar produces an
unambigious parse tree, there is still an extra step of extracting the
set of words from the tree.
From: Mitch on
On Apr 8, 1:44 pm, Tegiri Nenashi <TegiriNena...(a)gmail.com> wrote:
> String tokenizer is the simplest parser of all. In java, arguably, it
> is much more frequently used than regular expressions. Yet I fail to
> see any parsing theory book ever mentioning it.

'String tokenizer', used by software engineering people, is a synonym
for 'lexical analyzer', which is the term tending to be used by the
parsing/theory crowd. Just a difference in language use.

> In my practical experience it is easer write a scanner on string
> tokenizer foundation rather than to use off the shellf reg exp engine.

A string tokenizer (or lexical analyzer) may use regular expressions
to define the tokens, the off-the-shelf tool you want to use is
something like lex, flex, javacc.

> I'm interested however what is the language theory perspective onto
> this phenomenon.
>
> Formally, a set of terminals is partitioned into a separator or a set
> of separators, and the rest of terminals. Then, string tokenizer
> translates a given word into a set (or list) of words.

If all you're doing is managing delimiters (as opposed to keywords and
floating point), then that's modeled by a pretty simple regular
expression and so I can see that it'd be much easier to write one
yourself than use an off-the-shelf package. (but doesn't even C's
'strtok' take care of that?

> Here we have the first technical difficulty, what exactly this
> translation is? Is it a mapping from Monoid to Idempotent Semiring?

The map is from monoid to monoid; the implementation of the map uses a
semiring (the set described using a regexp).

....
> does this 8(!) rule grammar present any new insight to
> what string tokenizor is?

I don't think so.

> Besides, assuming this grammar produces an unambigious parse tree,
> there is still an extra step of extracting the set of words from the
> tree.

A regular language (described by a reg exp) always has an unambiguous
parse tree, and that tree is really an unnested list. As to extra
step, that is just a matter of how you implement your parsing. You
could extract each token as you go along, rather than producing a tree
then scanning it a second time to produce the separate tokens.

I think the idea of tokenizing via regular expressions is used because
it is so easy to do that part of the parsing on-line, as each
character comes in (no backtracking, you don't have to complete the
token parsing before you know you have the correct token. (of course
this doesn't apply to extended regular expressions with variables).

Mitch
From: Hans-Peter Diettrich on
Tegiri Nenashi wrote:

> String tokenizer is the simplest parser of all. In java, arguably, it
> is much more frequently used than regular expressions. Yet I fail to
> see any parsing theory book ever mentioning it.

Just a shot into the dark:

Tokenizers typically are described as automatons, which are not bound
to the restrictions of grammars [questionable, I know ;-]. With
regards to the Chomsky hierarchy, a tokenizer is not bound or
restricted to some specific Chomsky class.

The most important difference IMO is the goal, where an automaton is
allowed to stop *before* reaching the end of some input, whereas a
grammar requires that only the *entire* input can be a valid element
of the language. OTOH an automaton is allowed to never stop, what at
least is undesireable for an tokenizer, and for an parser as well.

Of course one can argue that the language, recognized by an
tokenizer/lexer, simply consists of a concatenation of valid tokens.
Even then I'd prefer to distinguish between "words" and "sentences" of
a language, where the tokenizer recognizes the words, and a parser
recognizes sentences, consisting of words. Kind of a multi-level
grammar, whose levels can be analyzed separately. At least I found
considerable problems in the transformation of languages, which were
*designed* for distinct lexer and parser "grammars", into a single
grammar for an scannerless parser.

I leave it to somebody else, to find a more "mathematical" definition
for my thoughts, so that my "obvious" differences can be discussed in a
more formal way ;-)


> Consider an example, an alphabet of terminals {a,b,c} with the "a"
> being a separator. Can you suggest a grammar that be able to tokenize
> the word "babcac" into {b,bc,c}?
[...]

goal ::= [separator] { word [separator] } EOF.
separator ::= "a".
word ::= { "b" | "c" }.

> would work, but does this 8(!) rule grammar present any new insight to
> what string tokenizor is?

IMO that's a matter of the grammar notation.

> Besides, assuming this grammar produces an
> unambigious parse tree, there is still an extra step of extracting the
> set of words from the tree.

That's why a lexer does not normally produce an parse tree, but a
stream of tokens. The lexer can be considered as a filter, sitting in
a pipeline in between the input and the parser. In the practice, of
e.g. an C parser, more filters sit in the tool chain, implementing
e.g. the preprocessor with macro expansion and include file
handling. But again this only is an argument for the separation of an
parser into distinct parts, for easy modelling and implementation, not
a formal discussion of the underlying math or algebra.

DoDi
From: sipayi on
On Apr 11, 11:15 am, Hans-Peter Diettrich <DrDiettri...(a)aol.com>
wrote:
> Tegiri Nenashi wrote:
> > String tokenizer is the simplest parser of all. In java, arguably, it
> > is much more frequently used than regular expressions. Yet I fail to
> > see any parsing theory book ever mentioning it.


Isn't this a case of simple Mealy machine?
Q = {q0, q1}
Q0 = q0
S = {1, b, c}
O = {"b", "c", "bb", "bc", "cc", "cb".... }
D =
q0 q1
+------------------------
a | q0 q0
b | q1 q1
c | q1 q1


G =
q0 q1
+------------------------
q0 | print() print()
q0 | append() append()
q0 | append() append()


=> Class-3, RegEx?

-sip