Parsing CSV files [MFC]

Prev: tooltip for controls on a dialog application
Next: Accessing all parts of too big dialog box

From: Goran on 22 Jan 2010 08:10

On Jan 22, 12:04 pm, Hector Santos <sant9...(a)nospam.gmail.com> wrote:

In the interest of honesty: I, too, wrote CSV parsers (or participated
in writing them) in my time.

My bias is clear, however: was that my, or my employer's, time well
spent? I don't think so.

> Goran wrote:
> > Effort in learning __that__ certainly beats effort of rolling your
> > own. Of course, that's provided that fits your use-case and that there
> > is a similar library. But that's done by Googling, newsgrouping and
> > reading.
>
> Still a learning curve for most.

It's still better to learn to call some new functions, than do the
silly grunt work of writing e.g. strtok-s. (And that, for something as
common as CSV parsing.)

> You know the old saying "Teach a man
> how to fish...." moral.

What, as in "if I write CSV parser, I learned something?" If you are
learning and but want to exercise on CSV, OK (I said that in my first
post). But otherwise, there's more to learn by looking at an existing
CSV parser than in rolling your own. Especially if one finds something
comprehensive. And even if one finds/corrects bugs in it.

Goran.

From: Hector Santos on 22 Jan 2010 11:14

Goran wrote:

> On Jan 22, 12:04 pm, Hector Santos <sant9...(a)nospam.gmail.com> wrote:
>
> In the interest of honesty: I, too, wrote CSV parsers (or participated
> in writing them) in my time.
>
> My bias is clear, however: was that my, or my employer's, time well
> spent? I don't think so.

It depends Goran. I rather have someone be able to think for himself,
solve problems without have to depend too much on 3rd party solutions,
and generally, when used as a "tool", like a hammer or screwdriver, it
is normally because you already know how to use the hammer or
screwdriver. IOW, if you know what you are doing, then go ahead and
get that library, rather than get the library because you (speaking in
general) lack a understanding of what the problem was to solve it. It
becomes a crutch.

I personally believe the IDE and evolution of component (modular)
engineering has placated sound engineering thinking. The technology
was meant not only to increase productivity, but to merge disciplines
and lower the cost of expertise.

>> Still a learning curve for most.
>
> It's still better to learn to call some new functions, than do the
> silly grunt work of writing e.g. strtok-s. (And that, for something as
> common as CSV parsing.)

Yes, when you already know what you are doing. Thats a mark of a good
programmer with insight into problem solving.

>> You know the old saying "Teach a man how to fish...." moral.
>
> What, as in "if I write CSV parser, I learned something?"

Sure, if you never did it before.

> If you are learning and but want to exercise on CSV, OK (I said that

> in my first post). But otherwise, there's more to learn by looking
> at an existing CSV parser than in rolling your own.

We have to respectfully agree to disagree. :) I sincerely doubt most
people will understand how to use a library if he/she didn't have a
fundamental understanding in what to look for and how to use it.

> Especially if one finds something

> comprehensive. And even if one finds/corrects bugs in it.

Fix after the fact programming. Love it! :) I have fired a well known
developer for that mindset. What ever happen to a QA engineering mantra?

"Getting it right... the first time!"

--
HLS

From: Tom Serface on 22 Jan 2010 13:40

That's one of the things that MFC really has going for it. There is a lot
of code available and you typically get source with it so, even if there is
some learning curve, you still get a jump start on getting your job done
even if you just see how it's done in the sample code.

Tom

"Hector Santos" <sant9442(a)nospam.gmail.com> wrote in message
news:#A$vtG0mKHA.5464(a)TK2MSFTNGP02.phx.gbl...
> Goran,
>
> Many times even with 3rd party libraries, you still have to learn how to
> use it. Many times, the attempt to generalized does not cover all bases.
> What if there is a bug? Many times with CSV, it might requires upfront
> field definition or its all viewed as strings. So the "easiest" does not
> always mean use a 3rd party solution.
>
> Of course the devil is in the details and it helps when the OP provides
> info, like what language and platform. If he said .NET, as I mention the
> MS .net collection library has a pretty darn good reader class with the
> benefits of supporting OOPS as well which allows you to create a data
> "class" that you pass to the line reader.
>
> Guess what? There is still a learning curve here to understand the
> interface, to use it right as there would be with any library.
>
> So the easiest? For me, it all depends - a simple text reader and
> strtok() parser and work in the escaping issues can be both very easy and
> super fast! with no dependency on 3rd party QA issues.
>
> For me, I have never come across a library or class that could handle
> everything and if it did, required a data definition interface of some
> sort - like the .NET collection class offers. If he using .NET, then I
> recommend using this class as the "easiest."
>

From: Tom Serface on 22 Jan 2010 13:42

One thing most parsers don't handle correctly, that's I've seen, is double
double quotes for strings if you want to have a quote as part of the string
like:

"This is my string "Tom" that I am using", "Next token", "Next token"

In the above, from my perspective, the parser should read the entire first
string since we didn't come to a delimiter yet, but a lot of tokenizers
choke on this sort of thing.

Tom

"Hector Santos" <sant9442(a)nospam.gmail.com> wrote in message
news:eeMYgc0mKHA.5464(a)TK2MSFTNGP02.phx.gbl...
> Hector Santos wrote:
>
>> Goran,
>>
>> Many times even with 3rd party libraries, you still have to learn how to
>> use it. Many times, the attempt to generalized does not cover all bases.
>> What if there is a bug? Many times with CSV, it might requires upfront
>> field definition or its all viewed as strings. So the "easiest" does not
>> always mean use a 3rd party solution.
>>
>> Of course the devil is in the details and it helps when the OP provides
>> info, like what language and platform. If he said .NET, as I mention
>> the MS .net collection library has a pretty darn good reader class with
>> the benefits of supporting OOPS as well which allows you to create a data
>> "class" that you pass to the line reader.
>>
>> Guess what? There is still a learning curve here to understand the
>> interface, to use it right as there would be with any library.
>>
>> So the easiest? For me, it all depends - a simple text reader and
>> strtok() parser and work in the escaping issues can be both very easy and
>> super fast! with no dependency on 3rd party QA issues.
>>
>> For me, I have never come across a library or class that could handle
>> everything and if it did, required a data definition interface of some
>> sort - like the .NET collection class offers. If he using .NET, then I
>> recommend using this class as the "easiest."
>
> Case in point.
>
> Even with the excellent .NET text I/O class and a CSV reader wrapper, it
> only offers a generalized method to parse fields. This still requires
> proper setup and conditions that might occur. It might require specific
> addition logic to handle situations where it does not cover, like when
> fields span across multiple lines. For example:
>
> 1,2,3,4,5,"hector
> , santos",6
> 7,8
> 9,10
>
> That might be 1 data record with 10 fields.
>
> However, even if the library allows you to do this, in my opinion, only an
> experienced implementator knows what to look for, see how to do it with
> the library to properly address this.
>
> Here is a VB.NET test program I wrote a few years back for a VERY long
> thread regarding this topic and how to handle the situation for a fella
> that had this need of fields spanning across multiple rows.
>
> ------------- CUT HERE -------------------
> '--------------------------------------------------------------
> ' File : D:\Local\wcsdk\wcserver\dotnet\Sandbox\readcsf4.vb
> ' About:
> '--------------------------------------------------------------
> Option Strict Off
> Option Explicit On
>
> imports system
> imports system.diagnostics
> imports system.console
> imports system.reflection
> imports system.collections.generic
> Imports system.text
>
> Module module1
>
> //
> // Dump an object
> //
>
> Sub dumpObject(ByVal o As Object)
> Dim t As Type = o.GetType()
> WriteLine("Type: {0} Fields: {1}", t, t.GetFields().Length)
> For Each s As FieldInfo In t.GetFields()
> Dim ft As Type = s.FieldType()
> WriteLine("- {0,-10} {1,-15} => {2}", s.Name, ft, s.GetValue(o))
> Next
> End Sub
>
> //
> // Data definition "TRecord" class, for this example
> // 9 fields are expected per data record.
> //
>
> Public Class TRecord
> Public f1 As String
> Public f2 As String
> Public f3 As String
> Public f4 As String
> Public f5 As String
> Public f6 As String
> Public f7 As String
> Public f8 As String
> Public f9 As String
>
> Public Sub Convert(ByRef flds As List(Of String))
> Dim fi As FieldInfo() = Me.GetType().GetFields()
> Dim i As Integer = 0
> For Each s As FieldInfo In fi
> Dim tt As Type = s.FieldType()
> If (i < flds.Count) Then
> If TypeOf (s.GetValue(Me)) Is Integer Then
> s.SetValue(Me, CInt(flds.Item(i)))
> Else
> s.SetValue(Me, flds.Item(i))
> End If
> End If
> i += 1
> Next
> End Sub
>
> Public Sub New()
> End Sub
>
> Public Sub New(ByVal flds As List(Of String))
> Convert(flds)
> End Sub
>
> Public Shared Narrowing Operator CType(_
> ByVal flds As List(Of String)) As TRecord
> Return New TRecord(flds)
> End Operator
>
> Public Shared Narrowing Operator CType(_
> ByVal flds As String()) As TRecord
> Dim sl As New List(Of String)
> For i As Integer = 1 To flds.Length
> sl.Add(flds(i - 1))
> Next
> Return New TRecord(sl)
> End Operator
> End Class
>
> Public Class ReaderCVS
>
> Public Shared data As New List(Of TRecord)
>
> '
> ' Read cvs file with max_fields, optional eolfilter
> '
> Public Function ReadCSV( _
> ByVal fn As String, _
> Optional ByVal max_fields As Integer = 0, _
> Optional ByVal eolfilter As Boolean = True) As Boolean
> Try
> Dim tr As New TRecord
> max_fields = tr.GetType().GetFields().Length()
> data.Clear()
>
> Dim rdr As FileIO.TextFieldParser
> rdr = My.Computer.FileSystem.OpenTextFieldParser(fn)
> rdr.SetDelimiters(",")
> Dim flds As New List(Of String)
> While Not rdr.EndOfData()
> Dim lines As String() = rdr.ReadFields()
> For Each fld As String In lines
> If eolfilter Then
> fld = fld.Replace(vbCr, " ").Replace(vbLf,"")
> End If
> flds.Add(fld)
> If flds.Count = max_fields Then
> tr = flds
> data.Add(tr)
> flds = New List(Of String)
> End If
> Next
> End While
> If flds.Count > 0 Then
> tr = flds
> data.Add(tr)
> End If
> rdr.Close()
> Return True
>
> Catch ex As Exception
> WriteLine(ex.Message)
> WriteLine(ex.StackTrace)
> Return False
> End Try
> End Function
>
> Public Sub Dump()
> WriteLine("------- DUMP ")
> debug.WriteLine("Dump")
> For i As Integer = 1 To data.Count
> dumpObject(data(i - 1))
> Next
> End Sub
>
> End Class
>
> Sub main(ByVal args() As String)
> Dim csv As New ReaderCVS
> csv.ReadCSV("test1.csf")
> csv.Dump()
> End Sub
>
> End Module
> ------------- CUT HERE -------------------
>
> Mind you, the above written 2 years ago while I was still learning .NET
> library and I was participating in support questions to learn myself to do
> common concept ideas in the .NET environment.
>
> Is the above simple for most beginners? I wouldn't say so, but then
> again, I tend to be a "tools" writer and try to generalized an tool, hence
> when I spent the time to implement a data class using an object dump
> function to debug it all. Not eveyone needs this. Most of the time, the
> field types are known so a reduction can be done, or better yet, you can
> take the above, have it read the first line as the field definition line
> and generalize the TRecord class to make it all dynamic.
>
> --
> HLS

From: Tom Serface on 22 Jan 2010 13:44

I'd say, "it depends". For example, I have a program where I have
specialized parsing needs and the program needs to be really small and not
include any external code. I wrote my own specialized parser and it was a
good use of time imo. I've found that most of the parsers that are "in the
box" libraries are very limited in scope.

Tom

"Goran" <goran.pusic(a)gmail.com> wrote in message
news:14051f46-8820-46bd-9cc8-10705b7b402e(a)l19g2000yqb.googlegroups.com...
> On Jan 22, 12:04 pm, Hector Santos <sant9...(a)nospam.gmail.com> wrote:
>
> In the interest of honesty: I, too, wrote CSV parsers (or participated
> in writing them) in my time.
>
..

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13
Prev: tooltip for controls on a dialog application
Next: Accessing all parts of too big dialog box