Comparing Multiple Strings More Efficiently [C++]

Prev: New types defined in a return type
Next: Generating a derived class from a base class

From: Chris Riesbeck on 23 Jun 2010 21:44

M Stefan wrote:
> On Jun 22, 2:34 am, Brad <byte8b...(a)gmail.com> wrote:
>>
>> Basically, I'm comparing a randomly generated string to a vector of
>> strings by iterating through the vector (as seen above). Here is the
>> problem: The vector may have one or millions of strings and execution
>> time slows considerably when I go above a few thousand strings.
>>
>> Is there a more efficient way to approach this? ...
>>
>> Brad
>>
>
> Implementing a trie data structure will provide very efficient
> substring lookup
> after the strings have been inserted into the trie.
>
> http://en.wikipedia.org/wiki/Trie
> http://en.wikipedia.org/wiki/Patricia_trie

Let me second the Trie approach. One programming exercise I give is to
create a trie for a English dictionary to speed up a Boggle word finder.
The speedup is quite impressive over searching.

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Brad on 24 Jun 2010 10:02

On Jun 24, 8:44 am, Chris Riesbeck <Chris.Riesb...(a)gmail.com> wrote:

> Let me second the Trie approach. One programming exercise I give is to
> create a trie for a English dictionary to speed up a Boggle word finder.
> The speedup is quite impressive over searching.

I will test Trie as well. I just experimented with
boost::unordered_set and that made a significant difference on large
sets, but was a bit slower on small sets when compared to a sorted
vector and binary search. Here are some numbers:

1 string: binary_search (28 seconds) unordered_set (36 seconds)

1,000,000 strings: binary_search (127 seconds) unordered_set (60
seconds)

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: dhruv on 10 Jul 2010 04:26

On Jun 22, 4:34 am, Brad <byte8b...(a)gmail.com> wrote:
> I think there are better (more efficient) ways to do this, but I've
> only ever needed this simple approach:
>
> void compare(const std::vector<std::string>& strings)
> {
> // Normally, this is randomly generated. It is a C style
> string.
> char the_string[4] = "abc";
>
> std::vector<std::string>::const_iterator send(strings.end());
>
> for (std::vector<std::string>::const_iterator sit = strings.begin();
> sit != send; ++sit)
> {
> if (the_string == *sit)
> std::clog << "MATCH:\t" << *sit << std::endl;
> }
>
> }
>
> Basically, I'm comparing a randomly generated string to a vector of
> strings by iterating through the vector (as seen above). Here is the
> problem: The vector may have one or millions of strings and execution
> time slows considerably when I go above a few thousand strings.
>
> Is there a more efficient way to approach this? My code works fine on
> smaller vectors, I just never expected to do millions and it may not
> be possible to do that many as quickly as a few, but I wanted to ask.
> Perhaps there is something in algorithm that I could use. Just point
> me to it. I prefer standard C++ wherever possible.

This seems to be more of an algorithm related question rather than a C+
+ related one.

Many approaches have been mentioned on this thread.

1. I however think you should hash every string and then do the brute
force compare with only those strings that hash to the same value and
have the same length.

2. The trie approach may not work for strings more than a handful
(10-15) bytes long.

3. There is another data structure called a suffix tree which will
help you in this case if you can afford to build it for each of the
strings you want to compare with. A suffix array is an equivalent data
structure, but much easier to implement and bumps up the runtime
complexity by a factor of just log(n). You might want to check them
out as well.

Regards,
-Dhruv.

--
Try out Duck Duck Go
www.duckduckgo.com

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

First | Prev |
Pages: 1 2 3 4
Prev: New types defined in a return type
Next: Generating a derived class from a base class