From: Ryan Chan on
Hello,

Consider the case:

You have 200 lines of mapping to replace, in a csv format, e.g.

apple,orange
boy,girl
....

You have a 500MB file, you want to replace all 200 lines of mapping,
what would be the most efficient way to do it?

Thanks.
From: The Natural Philosopher on
Ryan Chan wrote:
> Hello,
>
> Consider the case:
>
> You have 200 lines of mapping to replace, in a csv format, e.g.
>
> apple,orange
> boy,girl
> ...
>
> You have a 500MB file, you want to replace all 200 lines of mapping,
> what would be the most efficient way to do it?
>
> Thanks.
replace what with what?
From: pk on
Ryan Chan wrote:

> Consider the case:
>
> You have 200 lines of mapping to replace, in a csv format, e.g.
>
> apple,orange
> boy,girl
> ...
>
> You have a 500MB file, you want to replace all 200 lines of mapping,
> what would be the most efficient way to do it?

Not sure about "most efficient", but with awk you can do all of that in a
single pass (almost) over the data:

awk -F, 'NR==FNR{a[$1]=$2;next}
{for(i in a)gsub(i,a[i]); print}' mapfile datafile

However, that has at least two problems, which may or may not be relevant
for your scenario:

1) Does not know about "words", so if "pineapple" appears in the data, it
will become "pineorange";

2) assumes that all the strings don't contain regex metacharacters, and that
will likely produce wrong outcomes if one of the words to replace is, say
"a.*b" or similar.
From: John Hasler on
man sed
--
John Hasler
jhasler(a)newsguy.com
Dancing Horse Hill
Elmwood, WI USA
From: Chris Davies on
Ryan Chan <ryanchan404(a)gmail.com> wrote:
> You have 200 lines of mapping to replace, in a csv format, e.g.

> You have a 500MB file, you want to replace all 200 lines of mapping,
> what would be the most efficient way to do it?

Define "efficiency". For example, is this a one-off, and you want to
make most efficient use of your people resources. Or perhaps it's going
to run multiple times per hour, so you want to have someone spend a
significant amount of time working out and implementing a scheme that
runs in as short a time as is realistically possible.

Chris