|
From: Tegiri Nenashi on 8 Apr 2008 13:44 String tokenizer is the simplest parser of all. In java, arguably, it is much more frequently used than regular expressions. Yet I fail to see any parsing theory book ever mentioning it. In my practical experience it is easer write a scanner on string tokenizer foundation rather than to use off the shellf reg exp engine. I'm interested however what is the language theory perspective onto this phenomenon. Formally, a set of terminals is partitioned into a separator or a set of separators, and the rest of terminals. Then, string tokenizer translates a given word into a set (or list) of words. Here we have the first technical difficulty, what exactly this translation is? Is it a mapping from Monoid to Idempotent Semiring? Certainly not if we insist on the result being a list of words, and not the set. Then if we try exrtend this mapping naturally to the domain of sets, then we have to deal with set of sets (or set of lists) on the range side of the mapping? Consider an example, an alphabet of terminals {a,b,c} with the "a" being a separator. Can you suggest a grammar that be able to tokenize the word "babcac" into {b,bc,c}? Sure something like s >= a t >= b t >= c w >= wt w >= 1 u >= ws v >= 1 v >= uv would work, but does this 8(!) rule grammar present any new insight to what string tokenizor is? Besides, assuming this grammar produces an unambigious parse tree, there is still an extra step of extracting the set of words from the tree.
From: Mitch on 11 Apr 2008 10:17 On Apr 8, 1:44 pm, Tegiri Nenashi <TegiriNena...(a)gmail.com> wrote: > String tokenizer is the simplest parser of all. In java, arguably, it > is much more frequently used than regular expressions. Yet I fail to > see any parsing theory book ever mentioning it. 'String tokenizer', used by software engineering people, is a synonym for 'lexical analyzer', which is the term tending to be used by the parsing/theory crowd. Just a difference in language use. > In my practical experience it is easer write a scanner on string > tokenizer foundation rather than to use off the shellf reg exp engine. A string tokenizer (or lexical analyzer) may use regular expressions to define the tokens, the off-the-shelf tool you want to use is something like lex, flex, javacc. > I'm interested however what is the language theory perspective onto > this phenomenon. > > Formally, a set of terminals is partitioned into a separator or a set > of separators, and the rest of terminals. Then, string tokenizer > translates a given word into a set (or list) of words. If all you're doing is managing delimiters (as opposed to keywords and floating point), then that's modeled by a pretty simple regular expression and so I can see that it'd be much easier to write one yourself than use an off-the-shelf package. (but doesn't even C's 'strtok' take care of that? > Here we have the first technical difficulty, what exactly this > translation is? Is it a mapping from Monoid to Idempotent Semiring? The map is from monoid to monoid; the implementation of the map uses a semiring (the set described using a regexp). .... > does this 8(!) rule grammar present any new insight to > what string tokenizor is? I don't think so. > Besides, assuming this grammar produces an unambigious parse tree, > there is still an extra step of extracting the set of words from the > tree. A regular language (described by a reg exp) always has an unambiguous parse tree, and that tree is really an unnested list. As to extra step, that is just a matter of how you implement your parsing. You could extract each token as you go along, rather than producing a tree then scanning it a second time to produce the separate tokens. I think the idea of tokenizing via regular expressions is used because it is so easy to do that part of the parsing on-line, as each character comes in (no backtracking, you don't have to complete the token parsing before you know you have the correct token. (of course this doesn't apply to extended regular expressions with variables). Mitch
From: Hans-Peter Diettrich on 11 Apr 2008 11:15 Tegiri Nenashi wrote: > String tokenizer is the simplest parser of all. In java, arguably, it > is much more frequently used than regular expressions. Yet I fail to > see any parsing theory book ever mentioning it. Just a shot into the dark: Tokenizers typically are described as automatons, which are not bound to the restrictions of grammars [questionable, I know ;-]. With regards to the Chomsky hierarchy, a tokenizer is not bound or restricted to some specific Chomsky class. The most important difference IMO is the goal, where an automaton is allowed to stop *before* reaching the end of some input, whereas a grammar requires that only the *entire* input can be a valid element of the language. OTOH an automaton is allowed to never stop, what at least is undesireable for an tokenizer, and for an parser as well. Of course one can argue that the language, recognized by an tokenizer/lexer, simply consists of a concatenation of valid tokens. Even then I'd prefer to distinguish between "words" and "sentences" of a language, where the tokenizer recognizes the words, and a parser recognizes sentences, consisting of words. Kind of a multi-level grammar, whose levels can be analyzed separately. At least I found considerable problems in the transformation of languages, which were *designed* for distinct lexer and parser "grammars", into a single grammar for an scannerless parser. I leave it to somebody else, to find a more "mathematical" definition for my thoughts, so that my "obvious" differences can be discussed in a more formal way ;-) > Consider an example, an alphabet of terminals {a,b,c} with the "a" > being a separator. Can you suggest a grammar that be able to tokenize > the word "babcac" into {b,bc,c}? [...] goal ::= [separator] { word [separator] } EOF. separator ::= "a". word ::= { "b" | "c" }. > would work, but does this 8(!) rule grammar present any new insight to > what string tokenizor is? IMO that's a matter of the grammar notation. > Besides, assuming this grammar produces an > unambigious parse tree, there is still an extra step of extracting the > set of words from the tree. That's why a lexer does not normally produce an parse tree, but a stream of tokens. The lexer can be considered as a filter, sitting in a pipeline in between the input and the parser. In the practice, of e.g. an C parser, more filters sit in the tool chain, implementing e.g. the preprocessor with macro expansion and include file handling. But again this only is an argument for the separation of an parser into distinct parts, for easy modelling and implementation, not a formal discussion of the underlying math or algebra. DoDi
From: sipayi on 15 Apr 2008 08:31 On Apr 11, 11:15 am, Hans-Peter Diettrich <DrDiettri...(a)aol.com> wrote: > Tegiri Nenashi wrote: > > String tokenizer is the simplest parser of all. In java, arguably, it > > is much more frequently used than regular expressions. Yet I fail to > > see any parsing theory book ever mentioning it. Isn't this a case of simple Mealy machine? Q = {q0, q1} Q0 = q0 S = {1, b, c} O = {"b", "c", "bb", "bc", "cc", "cb".... } D = q0 q1 +------------------------ a | q0 q0 b | q1 q1 c | q1 q1 G = q0 q1 +------------------------ q0 | print() print() q0 | append() append() q0 | append() append() => Class-3, RegEx? -sip
|
Pages: 1 Prev: jobs Next: How to determine whether an expression equals to zero |