From: Ashley Sheridan on
On Thu, 2010-03-18 at 12:57 -0400, Paul M Foster wrote:

> On Thu, Mar 18, 2010 at 04:15:33PM +0000, Ashley Sheridan wrote:
>
> > On Thu, 2010-03-18 at 12:12 -0400, Paul M Foster wrote:
> >
> > On Thu, Mar 18, 2010 at 08:57:00AM -0700, Tommy Pham wrote:
> >
> > <snip>
> >
> > >
> > > Personally, I find working with fixed widths is best. The text file
> > > might be larger but I don't have worry about escaping any type of
> > > characters ;)
> >
> > I find this impossible, since I never know the largest width of all the
> > fields in a file. And a simple explode() call allows pulling all the
> > fields into an array, based on a common delimiter.
> >
> > Paul
> >
> > --
> > Paul M. Foster
> >
> >
> >
> > Explode won't work in the case of a comma in a field value.
>
> That's why I convert the files to tab-delimited first. explode() does
> work in that case.
>
> >
> > Also, newlines can exist within a field value, so a line in the file doesn't
> > equate to a row of data
>
> I've never seen this in the files I receive.
>
> >
> > The best way is just to start parsing at the beginning of the file and break it
> > into fields one by one from there.
> >
> > The bit I don't like about characters other than a comma being used in a "comma
> > separated values" file is that you can't automatically tell what character has
> > been used as the delimiter. Hence being asked by spreadsheet programs what the
> > delimiter is if a comma doesn't give up what it recognises as valid fields.
>
> I've honestly never seen a "CSV" or "Comma-separated Values" which used
> tabs for delimiters. At that point, it's really not a *comma* separated
> value file.
>
> My application for all this is accepting mailing lists from customers
> which I have to convert into DBFs for a commercial mailing list program.
> Because most of my customers can barely find the on/off switch on their
> computers, I never know what I'm going to get. So before I string
> together the filters to process the file, I have to actually look at and
> analyze the file to find out what it is. Could be a fixed-field length
> file, a CSV, a tab-delimited file, or anything in between. Once I've
> selected the filters, the sequence they will be put together in, and the
> fields from the file I want to capture, I hit the button. After it's all
> done, I now have to look at the result to ensure that the requested
> fields ended up where they were supposed to.
>
> Paul
>
> --
> Paul M. Foster
>


But surely whatever character is used as the delimiter could be part of
the fields value?

I hadn't even known that newlines would exist in the fields, until it
broke a script of mine!

And I believe that when MS Office saves a CSV out with a character other
than a comma as the delimiter, it still saves it as a .csv by default.

Thanks,
Ash
http://www.ashleysheridan.co.uk


From: Andrew Ballard on
On Thu, Mar 18, 2010 at 1:00 PM, Ashley Sheridan
<ash(a)ashleysheridan.co.uk> wrote:
[snip]
> And I believe that when MS Office saves a CSV out with a character other
> than a comma as the delimiter, it still saves it as a .csv by default.

Nope. If you save as CSV, it is comma-separated with double-quotes as
the text qualifier. There is also an option to save in tab-delimited
format, but the default extension for that is .txt.

The only issue I have with Excel handling text files is with columns
like ZIP code that should be treated as text (they are string
sequences that happen to contain only numeric digits where leading
zeros are significant) but are interpreted as numbers.

Andrew
From: Paul M Foster on
On Thu, Mar 18, 2010 at 05:00:24PM +0000, Ashley Sheridan wrote:

> On Thu, 2010-03-18 at 12:57 -0400, Paul M Foster wrote:
>
> On Thu, Mar 18, 2010 at 04:15:33PM +0000, Ashley Sheridan wrote:
>
> > On Thu, 2010-03-18 at 12:12 -0400, Paul M Foster wrote:
> >
> > On Thu, Mar 18, 2010 at 08:57:00AM -0700, Tommy Pham wrote:
> >

<snip>

> > Explode won't work in the case of a comma in a field value.
>
> That's why I convert the files to tab-delimited first. explode() does
> work in that case.
>
> >
> > Also, newlines can exist within a field value, so a line in the
> file doesn't
> > equate to a row of data
>
> I've never seen this in the files I receive.
>
> >
> > The best way is just to start parsing at the beginning of the file
> and break it
> > into fields one by one from there.
> >
> > The bit I don't like about characters other than a comma being used
> in a "comma
> > separated values" file is that you can't automatically tell what
> character has
> > been used as the delimiter. Hence being asked by spreadsheet programs
> what the
> > delimiter is if a comma doesn't give up what it recognises as valid
> fields.
>
> I've honestly never seen a "CSV" or "Comma-separated
> Values" which used
> tabs for delimiters. At that point, it's really not a *comma* separated
> value file.
>
> My application for all this is accepting mailing lists from customers
> which I have to convert into DBFs for a commercial mailing list program.
> Because most of my customers can barely find the on/off switch on their
> computers, I never know what I'm going to get. So before I string
> together the filters to process the file, I have to actually look at and
> analyze the file to find out what it is. Could be a fixed-field length
> file, a CSV, a tab-delimited file, or anything in between. Once I've
> selected the filters, the sequence they will be put together in, and the
> fields from the file I want to capture, I hit the button. After it's all
> done, I now have to look at the result to ensure that the requested
> fields ended up where they were supposed to.
>
> Paul
>
> --
> Paul M. Foster
>
>
>
> But surely whatever character is used as the delimiter could be part of the
> fields value?

Well, remember I shove these into tab-delimited files. It does
occasionally happen that someone will slip a tab into a field. When that
happens, I can tell when the final result is off. Then I do a hex dump
of the file (in PHP) to determine if it actually is a tab. If so, I have
a filter I prepend to the line of filters which removes tabs from the
original CSV file. Then proceed as before.

Occasionally someone will send me a file in a "label" format which
contains \x0C characters or somesuch at page boundaries. I actually have
to look at the file and find out what they've inserted. I have filters
for most anything I find like that.

Paul

--
Paul M. Foster
From: Jason Pruim on

On Mar 18, 2010, at 1:26 PM, Andrew Ballard wrote:

> On Thu, Mar 18, 2010 at 1:00 PM, Ashley Sheridan
> <ash(a)ashleysheridan.co.uk> wrote:
> [snip]
>> And I believe that when MS Office saves a CSV out with a character
>> other
>> than a comma as the delimiter, it still saves it as a .csv by
>> default.
>
> Nope. If you save as CSV, it is comma-separated with double-quotes as
> the text qualifier. There is also an option to save in tab-delimited
> format, but the default extension for that is .txt.
>
> The only issue I have with Excel handling text files is with columns
> like ZIP code that should be treated as text (they are string
> sequences that happen to contain only numeric digits where leading
> zeros are significant) but are interpreted as numbers.
>
> Andrew


Hi Andrew,

As a fellow mailing list processor I can feel your pain... One thing I
have found is when you are importing the data, you can select the zip
column and change the format from "general" to "text" and it will
maintain the leading zero's. Or a simple filter applied to it
afterwards will help to.

But if you have a decent CASS software then it should add the zip back
in hehe :)

From: Ashley Sheridan on
On Fri, 2010-03-19 at 18:01 -0400, Jason Pruim wrote:

> On Mar 18, 2010, at 1:26 PM, Andrew Ballard wrote:
>
> > On Thu, Mar 18, 2010 at 1:00 PM, Ashley Sheridan
> > <ash(a)ashleysheridan.co.uk> wrote:
> > [snip]
> >> And I believe that when MS Office saves a CSV out with a character
> >> other
> >> than a comma as the delimiter, it still saves it as a .csv by
> >> default.
> >
> > Nope. If you save as CSV, it is comma-separated with double-quotes as
> > the text qualifier. There is also an option to save in tab-delimited
> > format, but the default extension for that is .txt.
> >
> > The only issue I have with Excel handling text files is with columns
> > like ZIP code that should be treated as text (they are string
> > sequences that happen to contain only numeric digits where leading
> > zeros are significant) but are interpreted as numbers.
> >
> > Andrew
>
>
> Hi Andrew,
>
> As a fellow mailing list processor I can feel your pain... One thing I
> have found is when you are importing the data, you can select the zip
> column and change the format from "general" to "text" and it will
> maintain the leading zero's. Or a simple filter applied to it
> afterwards will help to.
>
> But if you have a decent CASS software then it should add the zip back
> in hehe :)
>
>


It's not really just that. In the csv format, a field value of 00123 (I
don't really know what zip code formats are) is perfectly valid.
Unfortunately, Excel (and Calc) tries to be clever and strips out
leading zeros on a field it recognises as all numbers. This is annoying
for things like zip codes and phone numbers (which in the UK mostly all
start with a 0)

I think short of enclosing the field in quote marks to signify it's a
string and not something that the software should guess at is the only
way to ensure it works effectively.

Thanks,
Ash
http://www.ashleysheridan.co.uk