From: Bill Mudry on
Back about May 10, I presented a difficulty I was having trying to
add in more (distinct) records from one
MySQL table into another that permanently holds the most important
data that I have. Despite help from
a couple of you in the forum and numerous efforts by me since,
success on my efforts have been denied
<big sigh!>.

First, some review and introduction:
This is for a knowledge base on all woods around the world. The more
woods can be reported,
the more successful is the project. I have named the project 'TAXA"
(..short for taxonomy). Many of the
other features I have wanted to add have been coming on quite well
but this task is still evading any success.
If you care to directly view what this is about, take a look at
http://www.prowebcanada.com/taxa/.
In fact, you can escape the frames menu and go right to where species
only are displayed:
http://www.prowebcanada.com/taxa/alphaspecies.php.
There are 4 major entry points to all the data but you have to be
concerned with the 'species' level.

The data there is stored in a table named 'species' (ie. -
species.sql). All tables in the project are in database 'taxa'. The
more (botanical) names of woods that I can find and add in, the more
successful is the whole project. So far I am reporting just over
6,200 different woods (as I say, all in species.sql).

The Thorny Problem
Presently I use all kinds of sources to find new woods. The largest
list of woods I have is in a sql file
called 'tervuren', with 11,337 wood names. That is WAAYY larger than
what is in species.sql so there
is a huge interest in copying over these records to species.sql It
would be a huge boost that even with
duplicates avoided, that would almost double the number of woods I
could report!

That is easy said but I have almost being tearing my hair out (...
and I am almost bald anyway ;-) )
trying to accomplish this. To prepare for this:

- I protected the integrity of my original files
(species.sql and tervuren.sql) by copying them
as working files species_master.sql and
tervuren_target.sql. It is with these that I will
(somehow) copy all wood species names that are NOT in the
species_master table that
are, however, in the tervuren_target file. (Later after
species_master is expanded, I can
rename it back again to 'species').

- The key column for comparison in both files is called
'species_name' in both files.

- I have already treated the tervuren_target table so that
there are no duplicate records
within itself with respect to column species_name.

- There are a couple columns I do not have to carry over to
the species_master table from
the tervuren_target table. For the columns that do have
to, I made sure
they are named exactly the same in both files and that
the collation is the same in both.

- The field names common to both the source for my data (ie.
tervuren_target) and for
the destination for the new data (species_master) are:

genus_name, species_name, authorities_species, source

- The species_master table has about 18 total fields but
except for the ones above, none of the
other fields I have not listed will receive any new data
whatsoever. They are (so to speak) just
along for the ... ride. All new data will go into the 4
fields listed above exclusively.

- In any attempt to avoid new duplicates forming as data
comes over into species_master,
at NO TIME WHATSOEVER can we allow any records in
species_master to be deleted!
The reason is because meaningful data in the other fields
in species_master would be
lost and data integrity would be violated.

Just adding ALL records from tervuren_ target was easy but it sure
did not give me the result i need. Both
the tervuren_target table and the species_master table have wood
names in them that are common to each.
As copying records over is happening, great care has to be taken to
avoid NEW duplicates forming via
the combining of records from both tables. In fact, just merging all
files from tervuren_target over into
species_master created over 3,000 NEW DUPLICATES! The result was an
unusable mess!

===> .... and that is where all the problem is that has stopped me
from being successful!

I am hoping that an SQL statement would do all the work fine .... but
if that will not work, I am also open to
adding any PHP lines for more versatility if needed.

I have made a few column name changes since approaching the group
back around May 10 but other than that,
I tried out a script tonight suggested to me back then:

SELECT DISTINCT tervuren_target.genus_name,
tervuren_target.species_name, tervuren_target.authorities_species,
tervuren_target.source
FROM tervuren_target
LEFT JOIN species_master
ON tervuren_target.species_name = species_master.species_name
LIMIT 0,12000

I had to add in the LIMIT statement or for some reason the server
would time out every time. The script ran ok
but when I looked at the number of records it created (11,377) that
was EXACTLY how many records
there are in the original Tervuren file! In other words, there was no
success in avoiding new duplicates at all
and the data did NOT actually merge into species_master <big sigh!>.
I also predicted that the DISTINCT
statement would only work on the tervuren_target table (which was
already cleaned of duplicates within itself)
and I believe I was right. it seems to be extraneous and not the
needed answer.

..... I had never thought it should be that hard just to add in new
data to a table of central important without
introducing new duplicate records --- but it is.

Therefore I come in humble request to all of you, hoping someone can
figure out what actually will work.
The solution is harder than may first seem. My email address if you
need it is billmudry at rogers.com

With thanks in advance,

Fingers crossed, toes crossed, eyeballs crossed .......

Bill Mudry
Mississauga, Ontario Canada.


From: Niel Archer on
> Back about May 10, I presented a difficulty I was having trying to
> add in more (distinct) records from one
> MySQL table into another that permanently holds the most important
> data that I have. Despite help from
> a couple of you in the forum and numerous efforts by me since,
> success on my efforts have been denied
> <big sigh!>.
>
> First, some review and introduction:
> This is for a knowledge base on all woods around the world. The more
> woods can be reported,
> the more successful is the project. I have named the project 'TAXA"
> (..short for taxonomy). Many of the
> other features I have wanted to add have been coming on quite well
> but this task is still evading any success.
> If you care to directly view what this is about, take a look at
> http://www.prowebcanada.com/taxa/.
> In fact, you can escape the frames menu and go right to where species
> only are displayed:
> http://www.prowebcanada.com/taxa/alphaspecies.php.
> There are 4 major entry points to all the data but you have to be
> concerned with the 'species' level.
>
> The data there is stored in a table named 'species' (ie. -
> species.sql). All tables in the project are in database 'taxa'. The
> more (botanical) names of woods that I can find and add in, the more
> successful is the whole project. So far I am reporting just over
> 6,200 different woods (as I say, all in species.sql).
>
> The Thorny Problem
> Presently I use all kinds of sources to find new woods. The largest
> list of woods I have is in a sql file
> called 'tervuren', with 11,337 wood names. That is WAAYY larger than
> what is in species.sql so there
> is a huge interest in copying over these records to species.sql It
> would be a huge boost that even with
> duplicates avoided, that would almost double the number of woods I
> could report!
>
> That is easy said but I have almost being tearing my hair out (...
> and I am almost bald anyway ;-) )
> trying to accomplish this. To prepare for this:
>
> - I protected the integrity of my original files
> (species.sql and tervuren.sql) by copying them
> as working files species_master.sql and
> tervuren_target.sql. It is with these that I will
> (somehow) copy all wood species names that are NOT in the
> species_master table that
> are, however, in the tervuren_target file. (Later after
> species_master is expanded, I can
> rename it back again to 'species').
>
> - The key column for comparison in both files is called
> 'species_name' in both files.
>
> - I have already treated the tervuren_target table so that
> there are no duplicate records
> within itself with respect to column species_name.
>
> - There are a couple columns I do not have to carry over to
> the species_master table from
> the tervuren_target table. For the columns that do have
> to, I made sure
> they are named exactly the same in both files and that
> the collation is the same in both.
>
> - The field names common to both the source for my data (ie.
> tervuren_target) and for
> the destination for the new data (species_master) are:
>
> genus_name, species_name, authorities_species, source
>
> - The species_master table has about 18 total fields but
> except for the ones above, none of the
> other fields I have not listed will receive any new data
> whatsoever. They are (so to speak) just
> along for the ... ride. All new data will go into the 4
> fields listed above exclusively.
>
> - In any attempt to avoid new duplicates forming as data
> comes over into species_master,
> at NO TIME WHATSOEVER can we allow any records in
> species_master to be deleted!
> The reason is because meaningful data in the other fields
> in species_master would be
> lost and data integrity would be violated.
>
> Just adding ALL records from tervuren_ target was easy but it sure
> did not give me the result i need. Both
> the tervuren_target table and the species_master table have wood
> names in them that are common to each.
> As copying records over is happening, great care has to be taken to
> avoid NEW duplicates forming via
> the combining of records from both tables. In fact, just merging all
> files from tervuren_target over into
> species_master created over 3,000 NEW DUPLICATES! The result was an
> unusable mess!
>
> ===> .... and that is where all the problem is that has stopped me
> from being successful!
>
> I am hoping that an SQL statement would do all the work fine .... but
> if that will not work, I am also open to
> adding any PHP lines for more versatility if needed.
>
> I have made a few column name changes since approaching the group
> back around May 10 but other than that,
> I tried out a script tonight suggested to me back then:
>
> SELECT DISTINCT tervuren_target.genus_name,
> tervuren_target.species_name, tervuren_target.authorities_species,
> tervuren_target.source
> FROM tervuren_target
> LEFT JOIN species_master
> ON tervuren_target.species_name = species_master.species_name
> LIMIT 0,12000
>
> I had to add in the LIMIT statement or for some reason the server
> would time out every time. The script ran ok
> but when I looked at the number of records it created (11,377) that
> was EXACTLY how many records
> there are in the original Tervuren file! In other words, there was no
> success in avoiding new duplicates at all
> and the data did NOT actually merge into species_master <big sigh!>.
> I also predicted that the DISTINCT
> statement would only work on the tervuren_target table (which was
> already cleaned of duplicates within itself)
> and I believe I was right. it seems to be extraneous and not the
> needed answer.
>
> .... I had never thought it should be that hard just to add in new
> data to a table of central important without
> introducing new duplicate records --- but it is.
>
> Therefore I come in humble request to all of you, hoping someone can
> figure out what actually will work.
> The solution is harder than may first seem. My email address if you
> need it is billmudry at rogers.com
>
> With thanks in advance,
>
> Fingers crossed, toes crossed, eyeballs crossed .......
>
> Bill Mudry
> Mississauga, Ontario Canada.

This would be much better taken to the DB list, where the entire
audience is knowledgable about databases, so answers are more likely ;-)

First, what database engine are you using? All engines have specific
idiosyncrasies that need to be accounted for, and knowing the engine
will allow more specific answers to your problems.

--
Niel Archer


From: Bill Mudry on
At 10:10 AM 13/06/2010, you wrote:
<cut for brevity>
> > Fingers crossed, toes crossed, eyeballs crossed .......
> >
> > Bill Mudry
> > Mississauga, Ontario Canada.
>
>This would be much better taken to the DB list, where the entire
>audience is knowledgable about databases, so answers are more likely ;-)

Advice taken. Will seek them out.


>First, what database engine are you using? All engines have specific
>idiosyncrasies that need to be accounted for, and knowing the engine
>will allow more specific answers to your problems.

At one point I had changed the tables I was working with to InnoDb
but thought I changed them all back to MYISAM and found this
morning that didn't stick. I just made sure they are all MYISAM now.

Bill

>--
>Niel Archer
>
>
>
>--
>PHP Windows Mailing List (http://www.php.net/)
>To unsubscribe, visit: http://www.php.net/unsub.php