Parsing CSV files [MFC]

Prev: tooltip for controls on a dialog application
Next: Accessing all parts of too big dialog box

From: David Crow on 22 Jan 2010 14:45

Have you tried AfxExtractSubString()?

"Stanza" <stanza(a)devnull.com> wrote in message
news:-OqdnbEEGM-UF8XWnZ2dnUVZ8r-dnZ2d(a)brightview.com...
> What is the easiest way of reading a line at a time through a textual CSV
> file, and then extracting the comma-separated elements from each line?
>

From: Hector Santos on 22 Jan 2010 15:37

Tom Serface wrote:

> One thing most parsers don't handle correctly, that's I've seen, is
> double double quotes for strings if you want to have a quote as part of
> the string like:
>
> "This is my string "Tom" that I am using", "Next token", "Next token"
>
> In the above, from my perspective, the parser should read the entire
> first string since we didn't come to a delimiter yet, but a lot of
> tokenizers choke on this sort of thing.

Often, it takes two to tango. A writer needs to escape tokens in
order to reach some level of sanity. i.e, borrowing a C slash for \".

"This is my string \"Tom\" that I am using"

Or use some encoding method, each HTTP Escape! :)

The above is simple if just delimiting by comma. So watching for an
embedded comma is required. For example:

"This is my string "Tom, Hector" that I am using"

That can be easily handled if the design assumption is each field is
double quoted. The first token:

"This is my string "Tom,

does not end in double quote, so you continue with a concatenation of
the next token.

Hector" that I am using"

to complete the first field.

But overall, I found unless its really simple, it helps if you have
field type definitions known before hand.

--
HLS

From: Joseph M. Newcomer on 22 Jan 2010 17:28

See below...
On Thu, 21 Jan 2010 22:52:07 -0500, Hector Santos <sant9442(a)nospam.gmail.com> wrote:

>Stanza wrote:
>
>> What is the easiest way of reading a line at a time through a textual
>> CSV file, and then extracting the comma-separated elements from each line?
>
>"Easiest" depends on what language and framework you are using and how
>you hold, store, process the data in memory.
>
>Assuming C language, the traditional implementation is to use
>strtok(), is a C/C++ simple example:
>
>// File: d:\wc5beta\testtok.cpp
>
>// compile with: cl testtok.cpp
>
>#include <stdio.h>
>#include <afx.h>
>
>int main(char argc, char *argv[])
***
This should be _tmain, the first argument is int, and the second argument is _TCHAR *
argv[].
****
>{
> //
> // get file name from command line
> //
>
> char *pfn = (argc>1)?argv[1]:NULL;
****
char is so yesterday. It should not be used to teach anything any longer. TCHAR,
LPCTSTR, LPTSTR are appropriate, or for purists, WCHAR, LPWSTR, LPCWSTR. In addition,
string parsing in terms of character arrays is so obsolete; CString or std::string should
be used for any examples.
****
>
> if (!pfn) {
> printf("- syntax: testeol csv_filename\n");
> return 1;
> }
>
> //
> // open text file for reading
> //
>
> FILE *fv = fopen(pfn,"rt");
> if (!fv) {
> printf("ERROR %d Opening file\n",GetLastError());
> return 1;
> }
>
> //
> // read each line using fgets() and parse
> // the "," and cr/lf (\r\n) token characters.
> //
>
> char *tok = ",\r\n";
>
> int nLine = 0;
> char szLine[1024];
****
INSTANTLY, we see completely obsolete, dangerous, and teaching-away-from-best-practice
here. NEVER allocate a fixed buffer on the stack.
****
> memset(&szLine,sizeof(szLine),0);
****
This is totally useless. Since the buffer is about to be overwritten with input, zeroing
it is silly.
****
> while (fgets(szLine,sizeof(szLine)-1,fv)) {
****
sizeof() is bad teaching. _countof would be appropriate, but at the VERY least, the
correct code would be in terms of
(sizeof(szline)/sizeof(TCHAR)) -1
This code looks like something from K&R C programming first edition.
****
> nLine++;
> printf("# %d | %s",nLine, szLine);
****
_tprintf(_T("# %5d | %s\n"), nLine, szline);

Unicode-aware, keeps columns aligned, has a newline at the end.
****
>
> //
> // parse the line by the tok characters
> //
> char *fld = strtok(szLine, tok);
****
strtok is bad practice. strtok_s, or _tcstok_s, is a better choice, because these have a
separate context that can be maintained, allowing several ...tok calls to be applied at
the same time (for example, subscanning an number looking for a decimal point). The old
strtok was what could best be called a "childish" design, with a single, implicit,
internal static context pointer. I would actively teach against ever using strtok (or
even _tcstok) in any program today. If you want to have locale-specific parsing, you may
even want to look at _tcstok_s_l, which allows a locale specification.
****
> while(fld) {
> printf("- [%s]\n",fld);
> fld = strtok(NULL, tok);
> }
> }
>
> fclose(fv);
> return 0;
>}
>
>So for example testdata.csv file containing these lines:
>
>hector santos,email1(a)whatever.com
>stanza,email2(a)whatever2.com
>Joe Newcomer,email3(a)whatever3.com
>
>compiling and running testtok testdata.csv, you get:
>
># 1 | hector santos,email1(a)whatever.com
>- [hector santos]
>- [email1(a)whatever.com]
># 2 | stanza,email2(a)whatever2.com
>- [stanza]
>- [email2(a)whatever2.com]
># 3 | Joe Newcomer,email3(a)whatever3.com
>- [Joe Newcomer]
>- [email3(a)whatever3.com]
>
>This is very simplistic and doesn't many design issues in regards to
>parsing csv bases files.
>
>The #1 design issue is the idea of "escaping" the token character you
>are using to separate fields, in this case the comma (',') because it
>is possible to have the comma with the field strings. That depends on
>the type and data specifications. Maybe your program doesn't expect
>them and maybe the creator the file will never ADD them and/or escapes
>them. All this is implementation base.
>
>For example, the data file can have a 3rd field that is a description
>like field, OR the name field can have commas this, thus introduce the
>idea that it can escaping is requiring. i.e, the data file can look
>like this:
>
>hector santos,email1(a)whatever.com,whatever,whatever,whatever
>stanza,email2(a)whatever2.com,"whatever,whatever,whatever"
>Joe Newcomer,email3(a)whatever3.com
>Serlace, tom,email4(a)whatever4.com
>
>So you can roll up sleeves and begin to use the above simple C/C++
>code as a basis to fine tune the reading requirements for your CSV by
>adding token escaping concepts, or you can use 3rd party libraries and
>functions available to do these things, and your requirements will be
>that these 3rd party libraries and function have the features of
>escaping tokens.
>
>Now, I purposely creates the testdata.csv above that would normally be
>considered bad formatting and doesn't promote or help good csv
>reading. A good practice it surround the fields with double quotes
>and that MAY be enough for escaping embedded commas, for example,
>the first line has a 3rd field:
>
> whatever,whatever,whatever
>
>well, if you parsing only by comma, the field results in just
>"whatever". So what is normally done is use lines like the 2nd line
>where the 3rd field is quoted:
>
> "whatever,whatever,whatever"
>
>The same issue with the 4th line with the first "expected" field has:
>
> Serlace, tom,
>
>and this causes your fields to be shifted and off.
>
>There are other concepts to deal with, namely, how you are reading
>into memory storage, if needed or if your processing each line and
>forgetting about it.
>
>So writing a robust CSV reader that takes into account, such as:
>
> - escaping and embedded tokens
> - reading into memory
>
>are common design requirements here. It really isn't that hard. I
>would encourage to learn and gain the rewarding experiences to program
>this yourself. It covers ideas that will be common ideas in a
>programmers life. I will say, that sometimes it pays do to just a
>byte stream parser instead of using strtok() checking each possible
>token and delimiter, double quoted strings, etc. For example, instead
>of the strtok block of lines, you can use something like:
>
> char *p = szLine;
> while (*p) {
> switch(*p) {
> case '\r':
> ... add logic for this ...
> break;
> case '\n':
> ... add logic for this ...
> break;
> case '\"':
> ... add logic for this ...
> break;
> case ',':
> ... add logic for this ...
> break;
> }
> p++;
> }
>
>It can be simple to complex depending on the CSV reading requirements.
****
This is an overly-simplified example of the Finite State Machine recognizer pattern. For
example, you can do something like

typedef enum {S0, Sign, Digit, Decimal, Fraction} States;

State state = S0; // the initial FSM state is always called S0 for historical reasons
int sign = 1;
TCHAR token;

while(*p != _T('\0'))
{
switch(state)
{
case S0:
switch(*p)
{
case _T(' '):
case _T('\t'):
case _T('\r'):
p++;
continue;
case _T('\n'):
... handling here depends on what you have as input
... if it is guaranteed to be a single line, this is just like
... \r; otherwise, you terminate the parse and set up so
... the next parse starts the next line in state S0
...return, continue, here, as appropriate
case _T('+'):
case _T('-'):
state = Sign;
sign = -1;
token = p;
p++;
continue;
case _T('0'):
...
case _T('9'):
state = Digit;
token = p;
p++;
continue;
case _T('.'): // note: localize this test!
state = Decimal;
token = p;
p++;
continue;
default:
// report error
return FALSE; // or whatever your error recovery is
};
case Sign:
switch(*p)
{
case _T('0'):
...
case _T('9'):
state = Digit;
p++;
continue;
case _T('.'): // note: localize this test!
state = Decimal;
p++;
continue;
case _T(' '):
case _T('\t'):
...other whitespace cases
p++;
continue;
case _T(','):
handle token just parsed
...return, continue, etc. as appropriate
default:
// error, + or - not followed by digit or decimal pt
}

(I have to leave for a concert at this point, leave the rest as An Exercise For The
Reader)

Overall, I find that parsers that are based on simplistic models that simply look for a
delimiter and assume that everything between the delimiters is syntactically correct are
naive, and certainly not robust enough for real programs. Generalizations include
extending this to recognize strings, quoted strings (allowing embedded commas inside the
quotes), etc. To me, correctness is essential.
joe
****

>
>Anyway, if you just wish to get a solution, you can use one the many
>3rd party libraries, classes, that will do these things for you.
>
>If you using another language, the same ideas apply, but some
>languages already have a good library, like .NET perhaps. It has an
>excellent text I/O reader class in its collections library, See
>OpenTextFieldParser(). It supports CSV reading and covers the two
>important ideas above for escaping and storage.
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

From: David Wilkinson on 22 Jan 2010 19:34

From: Joseph M. Newcomer on 22 Jan 2010 23:36

Many of these issues depend on what you consider valid syntax.

For example, one possible implementation is to consider a line that does not contain a
matching quote to be syntactically incorrect, and be rejected as bad data, with an error
message indicating a fundamental failure of the data format.

Escape notions have interesting concepts. For example, some languages (like XML) accept
either single quote delimiters or double quote delimiters, and you use the opposite of the
one you want:

"He said 'This is really bad' quite loudly"

'He said "This is really bad" quite loudly"

but doesn't work in the case

'He shouted "I can't do this!" quite loudly"

It may surprise people to realize that the escape convention of \" is very rare, limited
to C and C++. A far more popular language, SQL, requires you double the quotes [I once
offered expert testimony in a legal case where company A said company B stole their code,
and as evidence showed that the allegedly stolen code had a subroutine to double quote
marks. I showed that the two algorithms were quite different, producing different results
for the same input (an issue of interpretation of the input syntax: were existing double
quotes as the first and last character doubled again, or eliminated? The "stolen" code
dropped them. Also, the person who was the opposing "expert" claimed that there was no
interface to any other code, but the subroutine was mandated by the fact that SQL, which
is the other code both applications interfaced to, DEMANDS that quotes be doubled, and
consequently ANY code that talked to SQL would have to have a double-the-quotes
subroutine]

Similarly, there is an issue of delimiters. For example, if there is no escape
convention, you can use something like the C/C++ string concatenation:

'He shouted "I can' "'t do this!" '" quite loudly", 12345

Depending on what font you have, it may be hard to tell where I used double-single and
single-double (in Arial, they are really hard to tell part) but you can implement a rule
that a comma separator is required to delimit sequences of strings where it is always
legal to have a quoted string separated by 0 or more non-end-of-line whitespace from
another sequence of quoted strings, and these are "compile-time concatenated".

I recently did a project (my PowerPoint Indexer) where I decided to double commas to make
them significant. So if you wrote

item1, item2, item3

this was a sequence of three items,

item1
item2
item3

but if you wrote

item1,, item2, item3

this was treated as a sequence of two items:

item1, item2
item3

or you could write

item1, item2,, item3

which became two items

item1
item2, item3

So it is important to decide what you mean when you define the syntax.
joe

On Fri, 22 Jan 2010 15:37:30 -0500, Hector Santos <sant9442(a)nospam.gmail.com> wrote:

>Tom Serface wrote:
>
>> One thing most parsers don't handle correctly, that's I've seen, is
>> double double quotes for strings if you want to have a quote as part of
>> the string like:
>>
>> "This is my string "Tom" that I am using", "Next token", "Next token"
>>
>> In the above, from my perspective, the parser should read the entire
>> first string since we didn't come to a delimiter yet, but a lot of
>> tokenizers choke on this sort of thing.
>
>
>Often, it takes two to tango. A writer needs to escape tokens in
>order to reach some level of sanity. i.e, borrowing a C slash for \".
>
> "This is my string \"Tom\" that I am using"
>
>Or use some encoding method, each HTTP Escape! :)
>
>The above is simple if just delimiting by comma. So watching for an
>embedded comma is required. For example:
>
> "This is my string "Tom, Hector" that I am using"
>
>That can be easily handled if the design assumption is each field is
>double quoted. The first token:
>
> "This is my string "Tom,
>
>does not end in double quote, so you continue with a concatenation of
>the next token.
>
> Hector" that I am using"
>
>to complete the first field.
>
>But overall, I found unless its really simple, it helps if you have
>field type definitions known before hand.
Joseph M. Newcomer [MVP]
email: newcomer(a)flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Prev: tooltip for controls on a dialog application
Next: Accessing all parts of too big dialog box