Finding the common elements in two strings . [General Programming]

Prev: Analysis of detecting loops in a singly linked list.
Next: "Why Scripting Is Evil" -- comments?

From: bart.c on 30 Apr 2010 08:54

"mohangupta13" <mohangupta13(a)gmail.com> wrote in message
news:2964019b-3f82-4c28-833f-d3752f3eb75f(a)h37g2000pra.googlegroups.com...
> Problem: Given two strings S1 & S2 , return the common elements in
> both the strings ??
>
> One way I thought is to take a b-bit vector ( assuming the range of
> elements in the strings is 0-b) ...now in the first pass through the
> string S1 set all those bit positions that have an element equal to
> them . i.e set i'th bit if we an element with value i..
> Now go through S2 and and AND the i'th bit if we have an element equal
> to i. Now return the elements for which we have a bit set.
>
> Now there are various problems with this 1) It assumes elements in the
> range 0-b .2) It just returns one instance of the common element .i.e
> if S1= AABCA and S2= AAABCBA o/p= ABC and not AAABC .
>
> Can anyone please provide a better way to do this ( in possibly linear
> time and as small space as possible)...??

If by better you mean, generating the right output, I used this:

s1:="AABCA"
s2:="AAABCBA"

hist1:=hist2:=(0,)*256 #ie. histograms

foreach c in s1 do ++hist1[asc(c)] od
foreach c in s2 do ++hist2[asc(c)] od

s3:=""
for i:='A' to 'Z' do # or 0 to 255
if hist1[i] and hist2[i] then
s3+ := chr(i)*(hist1[i] min hist2[i])
fi
od

println "S1=",s1
println "S2=",s2
println "S3=",s3

(The if-statement is not essential, as "A"*0 is "", but is kept for
clarity.)

The process is linear I think, but requires two vectors of 256 (or perhaps
26, depending on your strings) elements.

--
Bartc

From: mohangupta13 on 30 Apr 2010 11:01

On 30 Apr, 14:54, "bart.c" <ba...(a)freeuk.com> wrote:
> "mohangupta13" <mohangupt...(a)gmail.com> wrote in message
>
> news:2964019b-3f82-4c28-833f-d3752f3eb75f(a)h37g2000pra.googlegroups.com...
>
>
>
>
>
> > Problem: Given two strings S1 & S2 , return the common elements in
> > both the strings ??
>
> > One way I thought is to take a b-bit vector ( assuming the range of
> > elements in the strings is 0-b) ...now in the first pass through the
> > string S1 set all those bit positions that have an element equal to
> > them . i.e set i'th bit if we an element with value i..
> > Now go through S2 and and AND the i'th bit if we have an element equal
> > to i. Now return the elements for which we have a bit set.
>
> > Now there are various problems with this 1) It assumes elements in the
> > range 0-b .2) It just returns one instance of the common element .i.e
> > if S1= AABCA and S2= AAABCBA o/p= ABC and not AAABC .
>
> > Can anyone please provide a better way to do this ( in possibly linear
> > time and as small space as possible)...??

I did meant the intersection of the multisets represented by S1 &
S2.
>
> If by better you mean, generating the right output, I used this:

By better I mean ...an algorithm that runs in linear time and uses as
little space as possible ( should ofcourse produce right output)..
>
> s1:="AABCA"
> s2:="AAABCBA"
>
> hist1:=hist2:=(0,)*256 #ie. histograms

to further reduce the number of lists used to just one I would rather
do ( suppose the length of the strings is at max n )
then i will take an x to be the smallest power of 10 such that 10^x
<n .
>
> foreach c in s1 do ++hist1[asc(c)] od

foreach c in s1 do hist1[asc(c)]+=10^x;

> foreach c in s2 do ++hist2[asc(c)] od
the seocnd loop remains same (but on hist1)

......foreach c in s2 do ++hist1[asc(c)]

now
for i:A to Z
s3+=chr(i)* ( hist1[i]/ 10^x min hist1[i]% 10^x)

>
> s3:=""
> for i:='A' to 'Z' do # or 0 to 255
> if hist1[i] and hist2[i] then
> s3+ := chr(i)*(hist1[i] min hist2[i])
> fi
> od
>
> println "S1=",s1
> println "S2=",s2
> println "S3=",s3
>
> (The if-statement is not essential, as "A"*0 is "", but is kept for
> clarity.)
>
> The process is linear I think, but requires two vectors of 256 (or perhaps
> 26, depending on your strings) elements.
>
so here i use only a single vector ......hope this works (does it ???)

Mohan

> --
> Bartc

From: bart.c on 30 Apr 2010 11:31

mohangupta13 wrote:
> On 30 Apr, 14:54, "bart.c" <ba...(a)freeuk.com> wrote:
>> "mohangupta13" <mohangupt...(a)gmail.com> wrote in message
>> news:2964019b-3f82-4c28-833f-d3752f3eb75f(a)h37g2000pra.googlegroups.com...

>>> Problem: Given two strings S1 & S2 , return the common elements in
>>> both the strings ??

>>> Now there are various problems with this 1) It assumes elements in
>>> the range 0-b .2) It just returns one instance of the common
>>> element .i.e if S1= AABCA and S2= AAABCBA o/p= ABC and not AAABC .
>>
>>> Can anyone please provide a better way to do this ( in possibly
>>> linear time and as small space as possible)...??
>
> I did meant the intersection of the multisets represented by S1 &
> S2.

>> If by better you mean, generating the right output, I used this:
>
> By better I mean ...an algorithm that runs in linear time and uses as
> little space as possible ( should ofcourse produce right output)..
>>
>> s1:="AABCA"
>> s2:="AAABCBA"
>>
>> hist1:=hist2:=(0,)*256 #ie. histograms
>
> to further reduce the number of lists used to just one I would rather
> do ( suppose the length of the strings is at max n )
> then i will take an x to be the smallest power of 10 such that 10^x
> <n .
>>
>> foreach c in s1 do ++hist1[asc(c)] od
>
> foreach c in s1 do hist1[asc(c)]+=10^x;
>
>> foreach c in s2 do ++hist2[asc(c)] od
> the seocnd loop remains same (but on hist1)
>
> .....foreach c in s2 do ++hist1[asc(c)]
>
> now
> for i:A to Z
> s3+=chr(i)* ( hist1[i]/ 10^x min hist1[i]% 10^x)

In other words, you're just packing two counts into each histogram element?

I'm not sure it saves much space, as it means the count elements might have
to be twice as wide (so 32 bits instead of 16, or 64 instead of 32).

And space would only be an issue if the elements in each string could be
much larger than 8 bits: at 16-bits, the vectors needed are still small; at
32-bits, this method is no longer practical, the count vector(s) would
require several GBytes.

> so here i use only a single vector ......hope this works (does it ???)

It probably would, but I haven't tried it. (In practice, you would use a
shifted binary value, or a structure, instead of decimal shifting, but as I
said, it does't really save much. Unless perhaps in your implementation
language, arrays are in short supply)

--
Bartc

From: Ben Bacarisse on 30 Apr 2010 22:16

mohangupta13 <mohangupta13(a)gmail.com> writes:

> On 30 Apr, 14:54, "bart.c" <ba...(a)freeuk.com> wrote:
>> "mohangupta13" <mohangupt...(a)gmail.com> wrote in message
>> news:2964019b-3f82-4c28-833f-d3752f3eb75f(a)h37g2000pra.googlegroups.com...
>>
>> > Problem: Given two strings S1 & S2 , return the common elements in
>> > both the strings ??
>>
>> > One way I thought is to take a b-bit vector ( assuming the range of
>> > elements in the strings is 0-b) ...now in the first pass through the
>> > string S1 set all those bit positions that have an element equal to
>> > them . i.e set i'th bit if we an element with value i..
>> > Now go through S2 and and AND the i'th bit if we have an element equal
>> > to i. Now return the elements for which we have a bit set.
>>
>> > Now there are various problems with this 1) It assumes elements in the
>> > range 0-b .2) It just returns one instance of the common element .i.e
>> > if S1= AABCA and S2= AAABCBA o/p= ABC and not AAABC .
>>
>> > Can anyone please provide a better way to do this ( in possibly linear
>> > time and as small space as possible)...??
>
> I did meant the intersection of the multisets represented by S1 &
> S2.
<snip>
> so here i use only a single vector ......hope this works (does it ???)

You need only one vector of counts if you count up as you scan one string
and count down as you scan the second:

void intersect(char *result, const unsigned char *s1, const unsigned char *s2)
{
size_t count[UCHAR_MAX] = {0};
while (*s1)
++count[*s1++];
for (; *s2; ++s2)
if (count[*s2]) {
--count[*s2]
*result++ = *s2;
}
}

Of course, you don't get the result in sorted order but that was not
part of the original specification. (size_t is an unsigned type BTW).

--
Ben.

From: mohangupta13 on 3 May 2010 02:51

On Apr 30, 7:16 pm, Ben Bacarisse <ben.use...(a)bsb.me.uk> wrote:
> mohangupta13 <mohangupt...(a)gmail.com> writes:
> > On 30 Apr, 14:54, "bart.c" <ba...(a)freeuk.com> wrote:
> >> "mohangupta13" <mohangupt...(a)gmail.com> wrote in message
> >>news:2964019b-3f82-4c28-833f-d3752f3eb75f(a)h37g2000pra.googlegroups.com....
>
> >> > Problem: Given two strings S1 & S2 , return the common elements in
> >> > both the strings ??
>
> >> > One way I thought is to take a b-bit vector ( assuming the range of
> >> > elements in the strings is 0-b) ...now in the first pass through the
> >> > string S1 set all those bit positions that have an element equal to
> >> > them . i.e set i'th bit if we an element with value i..
> >> > Now go through S2 and and AND the i'th bit if we have an element equal
> >> > to i. Now return the elements for which we have a bit set.
>
> >> > Now there are various problems with this 1) It assumes elements in the
> >> > range 0-b .2) It just returns one instance of the common element .i.e
> >> > if S1= AABCA and S2= AAABCBA o/p= ABC and not AAABC .
>
> >> > Can anyone please provide a better way to do this ( in possibly linear
> >> > time and as small space as possible)...??
>
> > I did meant the intersection of the multisets represented by S1 &
> > S2.
> <snip>
> > so here i use only a single vector ......hope this works (does it ???)
>
> You need only one vector of counts if you count up as you scan one string
> and count down as you scan the second:
Oh I am so stupid ..i did think about this up and down scan but
didn't got that working ..

>
> void intersect(char *result, const unsigned char *s1, const unsigned char *s2)
> {
> size_t count[UCHAR_MAX] = {0};
> while (*s1)
> ++count[*s1++];
> for (; *s2; ++s2)
> if (count[*s2]) {
> --count[*s2]
> *result++ = *s2;
> }
>
> }
>
> Of course, you don't get the result in sorted order but that was not
> part of the original specification. (size_t is an unsigned type BTW).

Thanks Ben,

>
> --
> Ben.

This might be the fourth time I am asking it here (havn't got
satisfactory answers yet).

How can one prove (either mathematically or just by intuition ) that
the run time of an algorithm or the memory use of it is the best we
can do for the particular problem ??? ( I mean how can one say that a
particular algorithm is the best for the given problem )???

I know its not always possible to say such things "for some
problems"..but for major class of problems its know what is the best
possible like for a comparison sort the lower bound is well know (nlg
n).

Any reference to some resource will also do ... ( some one previously
referred me to " MIT lectures on Structure and Interpretation of
computer programs" ..i did that but still the doubt persists.)
Thanks
Mohan.

First | Prev | Next | Last
Pages: 1 2 3
Prev: Analysis of detecting loops in a singly linked list.
Next: "Why Scripting Is Evil" -- comments?