From: Tim on
Hullo
Csv is a very common format for publishing data as a form of primitive
integration. It's an annoyingly brittle approach, so I'd like to
ensure that I capture errors as soon as possible, so that I can get
the upstream processes fixed, or at worst put in some correction
mechanisms and avoid getting polluted data into my analyses.

A symptom of several types of errors is that the number of fields
being interpreted varies over a file (eg from wrongly embedded quote
strings or mishandled embedded newlines). My preferred approach would
be to get DictReader to throw an exception when encountering such
oddities, but at the moment it seems to try to patch over the error
and fill in the blanks for short lines, or ignore long lines. I know
that I can use the restval parameter and then check for what's been
parsed when I get my results back, but this seems brittle as whatever
I use for restval could legitimately be in the data.

Is there any way to get csv.DictReader to throw and exception on such
simple line errors, or am I going to have to use csv.reader and
explicitly check for the number of fields read in on each line?

cheers

Tim
From: Peter Otten on
Tim wrote:

> Csv is a very common format for publishing data as a form of primitive
> integration. It's an annoyingly brittle approach, so I'd like to
> ensure that I capture errors as soon as possible, so that I can get
> the upstream processes fixed, or at worst put in some correction
> mechanisms and avoid getting polluted data into my analyses.
>
> A symptom of several types of errors is that the number of fields
> being interpreted varies over a file (eg from wrongly embedded quote
> strings or mishandled embedded newlines). My preferred approach would
> be to get DictReader to throw an exception when encountering such
> oddities, but at the moment it seems to try to patch over the error
> and fill in the blanks for short lines, or ignore long lines. I know
> that I can use the restval parameter and then check for what's been
> parsed when I get my results back, but this seems brittle as whatever
> I use for restval could legitimately be in the data.
>
> Is there any way to get csv.DictReader to throw and exception on such
> simple line errors, or am I going to have to use csv.reader and
> explicitly check for the number of fields read in on each line?

I think you have to use csv.reader. Untested:

def DictReader(f, fieldnames=None, *args, **kw):
reader = csv.reader(f, *args, **kw)
if fieldnames is None:
fieldnames = next(reader)
for row in reader:
if row:
if len(fieldnames) != len(row):
raise ValueError
yield dict(zip(fieldnames, row))

Peter