Comparing Multiple Strings More Efficiently [C++]

Prev: New types defined in a return type
Next: Generating a derived class from a base class

From: Jens Auer on 22 Jun 2010 08:39

On 22 Jun., 01:34, Brad <byte8b...(a)gmail.com> wrote:
> I think there are better (more efficient) ways to do this, but I've
> only ever needed this simple approach:
>
> void compare(const std::vector<std::string>& strings)
> {
> // Normally, this is randomly generated. It is a C style
> string.
> char the_string[4] = "abc";
>
> std::vector<std::string>::const_iterator send(strings.end());
>
> for (std::vector<std::string>::const_iterator sit = strings.begin();
> sit != send; ++sit)
> {
> if (the_string == *sit)
> std::clog << "MATCH:\t" << *sit << std::endl;
> }
>
> }
>
> Basically, I'm comparing a randomly generated string to a vector of
> strings by iterating through the vector (as seen above). Here is the
> problem: The vector may have one or millions of strings and execution
> time slows considerably when I go above a few thousand strings.
>
> Is there a more efficient way to approach this? My code works fine on
> smaller vectors, I just never expected to do millions and it may not
> be possible to do that many as quickly as a few, but I wanted to ask.
> Perhaps there is something in algorithm that I could use. Just point
> me to it. I prefer standard C++ wherever possible.

You could use std::binary_search if you sort the vector before or
std::set to find string in logarithmic time (ignoring the cost for the
non-constant string comparison). If you use the new standard, you
could also employ a hash table which may be faster or slower.

Further, if we leave standard c++ and the string vector is a fixed set
of strings you could use a suffix tree to search for a string of
length m in O(m) time.

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Ulrich Eckhardt on 22 Jun 2010 08:32

Brad wrote:
> std::vector<std::string>::const_iterator send(strings.end());
> for (std::vector<std::string>::const_iterator sit = strings.begin();
> sit != send; ++sit)
[...]

Just for the record, you can declare multiple objects in the header of a for
loop:

for(my_vector::const_iterator it=strings.begin(), end=strings.end();
it!=end;
++it)

> Basically, I'm comparing a randomly generated string to a vector of
> strings by iterating through the vector (as seen above). Here is the
> problem: The vector may have one or millions of strings and execution
> time slows considerably when I go above a few thousand strings.
>
> Is there a more efficient way to approach this? My code works fine on
> smaller vectors, I just never expected to do millions and it may not
> be possible to do that many as quickly as a few, but I wanted to ask.

Not enough info. If you don't care for the order of the strings, you could
use a multiset<string> instead of a vector, which would give you better
lookup times. You might be able to exploit other properties of the vector
and its content, too.

In any case, what you should do first is to create tests that tell you how
fast your code runs, so you can compare different optimisations and also
use a profiler to find out where you code spends most of its time.

Uli

--
Sator Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: bf on 22 Jun 2010 08:33

On Jun 22, 1:34 am, Brad <byte8b...(a)gmail.com> wrote:

> Basically, I'm comparing a randomly generated string to a vector of
> strings by iterating through the vector (as seen above). Here is the
> problem: The vector may have one or millions of strings and execution
> time slows considerably when I go above a few thousand strings.

Given the constraints mentioned, I don't think you can do much better.
You can
shorten the code slightly as:

bool has_string(const std::vector<std::string> &haystack, const
std::string& needle)
{
return std::find(haystack.begin(), haystack.end(), needle) !=
haystack.end();
}

> Is there a more efficient way to approach this? My code works fine on
> smaller vectors, I just never expected to do millions and it may not
> be possible to do that many as quickly as a few, but I wanted to ask.
> Perhaps there is something in algorithm that I could use. Just point
> me to it. I prefer standard C++ wherever possible.

If you can change your vector of strings into a std::set (or
unordered_set) you'll be much better off (O(log2(n)) for set, and
ideally(O(1) for unordered_set, although its worst case is O(n))
compared to O(n) for your current implementation.

If you can't do that, but you can control how the vector is filled, to
make it always sorted, you can use std::binary_search() (using it the
same way as std::find() above. Also O(log2(n))).

Departing from the standard, a radix tree is likely to serve you well,
but then you'd need to write a lot of non-trivial code, so I'd avoid
that unless deemed necessary (or as an exercise.)
_
/Bjorn

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Mathias Gaunard on 22 Jun 2010 08:56

On Jun 22, 12:34 am, Brad <byte8b...(a)gmail.com> wrote:
> [...]
> Basically, I'm comparing a randomly generated string to a vector of
> strings by iterating through the vector (as seen above). Here is the
> problem: The vector may have one or millions of strings and execution
> time slows considerably when I go above a few thousand strings.
>
> Is there a more efficient way to approach this?

Yes.
Searching for an entry in a list (or even better, set) can be done in
logarithmic time by sorting the list and doing a binary search, or in
average constant-time with hashing techniques.

You may want to take a basic algorithmics class.

Note this is not a problem tied to C++. C++ does, however, provide
everything you might need already, so you don't have to reimplement
anything.

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Mathias Gaunard on 22 Jun 2010 09:09

On Jun 22, 1:28 pm, George Neuner <gneun...(a)comcast.net> wrote:

> RegEx? A dedicated string search algorithm such as Boyer-Moore or
> Knuth-Morris-Pratt or Rabin-Karp?

Those are methods to find a substring within a string.
What he's looking for is a particular string among a collection of
strings.

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

First | Prev | Next | Last
Pages: 1 2 3 4
Prev: New types defined in a return type
Next: Generating a derived class from a base class