deleting duplicate rows [Programming SQL]

Prev: Organizing Solutions and Projects
Next: tuning import of large dataset

From: gareth erskine-jones on 11 Jun 2010 21:10

On Sun, 06 Jun 2010 22:32:58 GMT, "Geoff Schaller"
<geoffx(a)softxwareobjectives.com.au> wrote:

Celko's manner certainly seems to annoy many people... but I've found
his books to be very useful, and so find it hard to join in with the
joe-bashing...

In this case, other people have given the OP practical solutions - the
actual syntax needed to remove the offending lines. Also, it's been
pointed out that situations like this do in fact occur in the real
world - duplicates in relational databases do not make sense ( the set
{ A, A } is the same as the set { A } ), but often data has to be
imported from files and those files do have duplicates.

Nonetheless, Celko does have a point. The OP said,

>What I need to do is to delete one of the rows that has itemnum = 1 and
>firstname = 'keith'. I don't care which one is deleted. It does NOT matter.
>The resultant table needs to have itemnum and firstname unique together.

So he cares about itemnum and firstname, but not about the notes
column.

The first solution given, eliminates the duplicates randomly - that
is, which row is eliminated depends on the order the rows happen to be
returned. The second solution, eliminates duplicates based on the
value of the notes column. A note of "foo" will always be removed in
favour of "bar".

I agree absolutely that in the "real world" this sort of thing
happens. Surely though, when importing the data, and insisting on
things like the itemnum and firstname uniquely identifying a row, we
generally want to be able to say, "this is good data" as opposed to
saying, "the data in the notes column is more or less random".

If the OP does not care about the contents of the notes column, then
the simplest solution is to lose it:

select distinct itemnum, firstname, '' as notes from ...

This is perfectly practical, and unlike the first solution suggested,
it's well defined, and unlike the second, doesn't favour comments
beginning with 'a' over those beginning with 'b'.

Another alternative, if the notes column does contain valuable
information, is to concatenate the contents, preserving it all.

I really don't know - perhaps he is. Perhaps not. I recently worked on
an application where the first of the solutions (basically pick one of
the rows arbitarily was used. I rewrote some queries to improve
performance, and found I had a different results set, which made
things a lot harder to test, and also hard to explain to the customer.
This is the point when you say to the customer - look, we have these
two rows, different apart from the "notes" column, which should we
use? Then the customer can think about whether he really means it when
he says he doesn't care about which one is selected. Certainly you
should prefer a solution like Erland's, which is deterministic and
isn't going to change with a service pack release or something.

..

>Hmmmm.... exposing your inexperience again?
..
If someone asks "how do I do X?", it is, occasionally, appropriate to
say, "You don't want to do X, You actually want to do Y". This
involves telling someone that they're wrong, which can be done very
gently or otherwise. Celko generally does it "otherwise", but he is
often right....

GSEJ
>
>Out here in the real world there are actually quite a lot of
>circumstances where duplicates are quite natural. One is an auctioning
>style system. Multiple entries are created against the same item but at
>some point, competing values are removed, resulting in one unique
>transaction. Another is a data matching service. For example name
>cleaning. You start with 20 'similar' names. The process discovers that
>15 are the same and renders common fields common and determines the
>common 'key'. The duplicates are removed. Haven't you heard of de-duping
>processes?
>
>Geoff
>
>
>
>"--CELKO--" <jcelko212(a)earthlink.net> wrote in message
>news:3cd59f1e-eb54-470b-a3b4-3f4745e583db(a)j8g2000yqd.googlegroups.com:
>
>> >> What I need to do is to delete one of the rows that has item_nbr = 1 and first_name = 'keith'. I don't care which one is deleted. It does NOT matter.<<
>>
>>
>> This is a realllllty bad data management policy. Please care and find
>> our why you are getting duplicates in your process. Mop the floor, but
>> always fix the leak.

First | Prev |
Pages: 1 2 3
Prev: Organizing Solutions and Projects
Next: tuning import of large dataset