Comparing Multiple Strings More Efficiently [C++]

Prev: New types defined in a return type
Next: Generating a derived class from a base class

From: Brad on 21 Jun 2010 08:34

I think there are better (more efficient) ways to do this, but I've
only ever needed this simple approach:

void compare(const std::vector<std::string>& strings)
{
// Normally, this is randomly generated. It is a C style
string.
char the_string[4] = "abc";

std::vector<std::string>::const_iterator send(strings.end());

for (std::vector<std::string>::const_iterator sit = strings.begin();
sit != send; ++sit)
{
if (the_string == *sit)
std::clog << "MATCH:\t" << *sit << std::endl;
}
}

Basically, I'm comparing a randomly generated string to a vector of
strings by iterating through the vector (as seen above). Here is the
problem: The vector may have one or millions of strings and execution
time slows considerably when I go above a few thousand strings.

Is there a more efficient way to approach this? My code works fine on
smaller vectors, I just never expected to do millions and it may not
be possible to do that many as quickly as a few, but I wanted to ask.
Perhaps there is something in algorithm that I could use. Just point
me to it. I prefer standard C++ wherever possible.

Thanks,

Brad

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Seungbeom Kim on 21 Jun 2010 21:27

On 2010-06-21 16:34, Brad wrote:
> I think there are better (more efficient) ways to do this, but I've
> only ever needed this simple approach:
>
> void compare(const std::vector<std::string>& strings)
> {
> // Normally, this is randomly generated. It is a C style string.
> char the_string[4] = "abc";
>
> std::vector<std::string>::const_iterator send(strings.end());
>
> for (std::vector<std::string>::const_iterator sit = strings.begin(); sit != send; ++sit)
> {
> if (the_string == *sit)
> std::clog<< "MATCH:\t"<< *sit<< std::endl;
> }
> }
>
> Basically, I'm comparing a randomly generated string to a vector of
> strings by iterating through the vector (as seen above). Here is the
> problem: The vector may have one or millions of strings and execution
> time slows considerably when I go above a few thousand strings.
>
> Is there a more efficient way to approach this? My code works fine on
> smaller vectors, I just never expected to do millions and it may not
> be possible to do that many as quickly as a few, but I wanted to ask.
> Perhaps there is something in algorithm that I could use. Just point
> me to it. I prefer standard C++ wherever possible.

As it is given, the function above prints the (same) target string as
many times as it is found in the given vector. That task cannot be done
faster than a running time of O(strings.size() * strlen(the_string)),
and I don't see anything in the given function that makes the running
time worse than that.

Some implementations may be less than optimal; e.g. the library that
comes with g++-4.4 computes strlen(the_string) each time operator==
is called, so you could make the function faster by constructing an
std::string object before the loop and using it for comparison, but I
don't think the difference would be very noticeable unless the_string
is quite long. And that just reduces the constant in O(strings.size()),
still leaving the algorithm as a linear-time one.

Depending on your needs, if you need to do the same thing many times,
you could store the strings in an std::multiset<std::string> instead of
an std::vector<std::string>, and a call to std::multiset<std::string>
..count(the_string) should give you the count in a logarithmic time.

Anyway, the task the function above is doing is quite dumb, and I
suspect you're actually doing something much more interesting than that,
but too much detail has been left out.

--
Seungbeom Kim

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: George Neuner on 21 Jun 2010 21:28

On Mon, 21 Jun 2010 17:34:18 CST, Brad <byte8bits(a)gmail.com> wrote:

>I think there are better (more efficient) ways to do this, but I've
>only ever needed this simple approach:
>
>void compare(const std::vector<std::string>& strings)
>{
> // Normally, this is randomly generated. It is a C style
>string.
> char the_string[4] = "abc";
>
> std::vector<std::string>::const_iterator send(strings.end());
>
> for (std::vector<std::string>::const_iterator sit = strings.begin();
>sit != send; ++sit)
> {
> if (the_string == *sit)
> std::clog << "MATCH:\t" << *sit << std::endl;
> }
>}
>
>Basically, I'm comparing a randomly generated string ...

What use is a random string?

> ... to a vector of
>strings by iterating through the vector (as seen above). Here is the
>problem: The vector may have one or millions of strings and execution
>time slows considerably when I go above a few thousand strings.
>
>Is there a more efficient way to approach this?

RegEx? A dedicated string search algorithm such as Boyer-Moore or
Knuth-Morris-Pratt or Rabin-Karp?

>My code works fine on smaller vectors, I just never expected to do
>millions and it may not be possible to do that many as quickly as
>a few, but I wanted to ask.

You can't possibly check a long stream as quickly as a short one.

All the different search algorithms require preprocessing of the
search string ... what is best to do depends on buffering, on how many
distinct patterns you need to search for, how often the search set is
expected to change and how many times you expect to reuse existing
search sets.

Boyer-Moore is the fastest method. BM searches forward but matches
the pattern in reverse order, so it requires the ability to "back up"
in the input stream. That's a simple buffering issue, but it slightly
complicates use vs other methods that match forward. The running time
of BM is, on average, sublinear because BM typically does not need to
examine every input character (it does so only in the worst case).

Knuth-Morris-Pratt matches forward, so it's useful in limited memory
situations. KMP examines every input character once.

RegEx matches forward and can search simultaneously for multiple
strings. RegEx examines every input character once.

Rabin-Karp matches forward and can search simultaneously for multiple
strings. RK examines every potential match character (not necessarily
every character). BM computes a hash over each potential match
target, so it is best used when the input is clearly delimited. You
can control the granularity of matches to tune performance (e.g., for
a text document you could hash entire sentences or you could hash by
word and string them together to match sentence fragments).

There are other algorithms but these are the most well known.

>Perhaps there is something in algorithm that I could use. Just point
>me to it. I prefer standard C++ wherever possible.

Current compilers should have RegEx either in std or in tr1.
Alternatively, Boost has RegEx.

The dedicated search algorithms you will have to code yourself (or
find them already coded somewhere).

George

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Daniel Krügler on 22 Jun 2010 08:29

On 22 Jun., 01:34, Brad <byte8b...(a)gmail.com> wrote:
> I think there are better (more efficient) ways to do this, but I've
> only ever needed this simple approach:
>
> void compare(const std::vector<std::string>& strings)
> {
> // Normally, this is randomly generated. It is a C style
> string.
> char the_string[4] = "abc";
>
> std::vector<std::string>::const_iterator send(strings.end());
>
> for (std::vector<std::string>::const_iterator sit = strings.begin();
> sit != send; ++sit)
> {
> if (the_string == *sit)
> std::clog << "MATCH:\t" << *sit << std::endl;
> }
> }
>
> Basically, I'm comparing a randomly generated string to a vector of
> strings by iterating through the vector (as seen above). Here is the
> problem: The vector may have one or millions of strings and execution
> time slows considerably when I go above a few thousand strings.
>
> Is there a more efficient way to approach this? My code works fine on
> smaller vectors, I just never expected to do millions and it may not
> be possible to do that many as quickly as a few, but I wanted to ask.
> Perhaps there is something in algorithm that I could use. Just point
> me to it. I prefer standard C++ wherever possible.

It depends. If you can ensure that the input-vector is sorted it
could improve the performance drastically by instead using
a binary search algorithm of the random string within the
sequence (e.g. lower_bound). Take care that the predicate
used for the sorting is the same (at least equivalent) to that
used in the binary search.

The advantage of this approach is especially dominant, if
you don't need to sort the container each time.

HTH & Greetings from Bremen,

Daniel Kr�gler

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

From: Nick Hounsome on 22 Jun 2010 08:28

On 22 June, 00:34, Brad <byte8b...(a)gmail.com> wrote:
> I think there are better (more efficient) ways to do this, but I've
> only ever needed this simple approach:
>
> void compare(const std::vector<std::string>& strings)
> {
> // Normally, this is randomly generated. It is a C style
> string.
> char the_string[4] = "abc";
>
> std::vector<std::string>::const_iterator send(strings.end());
>
> for (std::vector<std::string>::const_iterator sit = strings.begin();
> sit != send; ++sit)
> {
> if (the_string == *sit)
> std::clog << "MATCH:\t" << *sit << std::endl;
> }
>
> }
>
> Basically, I'm comparing a randomly generated string to a vector of
> strings by iterating through the vector (as seen above). Here is the
> problem: The vector may have one or millions of strings and execution
> time slows considerably when I go above a few thousand strings.
>
> Is there a more efficient way to approach this? My code works fine on
> smaller vectors, I just never expected to do millions and it may not
> be possible to do that many as quickly as a few, but I wanted to ask.
> Perhaps there is something in algorithm that I could use. Just point
> me to it. I prefer standard C++ wherever possible.

If you are comparing random strings against the same vector multiple
times then the way to go is hashing.
If the vector is replaced by a hashed set (new std::unordered_set)
then the lookup will be much faster in general.

Even plain old std::set will be faster (find must be logarithmic or
better).

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

| Next | Last
Pages: 1 2 3 4
Prev: New types defined in a return type
Next: Generating a derived class from a base class