From: Fencer on
Hello, I have a large set of strings and I need to find all combinations
of two of those string and perform an algorithm on them (the algorithm
is actually an XQuery that runs over a large dataset so I won't post
that here). The combinations foo-bar and bar-foo are considered to be
equal so only one should be generated.

I wrote a static class method that takes an array of Strings and returns
all combinations in a Vector. The Vector stores a class Pair which I've
written myself too. The complete code is posted below with output from a
sample run. It seems to work, but how can it be improved and get rid of
the Pair class? Thanks!

package utility;

import java.util.HashSet;
import java.util.Set;
import java.util.Vector;

public class MiscUtils {

public static void main(String[] args) {
Vector<Pair> combos = findCombinations(new String[]{"chebi",
"kegg.compound"});
System.out.println(combos.size());
System.out.println(combos);
}

public static Vector<Pair> findCombinations(String[] strings) {
Vector<Pair> combinations = new Vector<Pair>();
Set<String> alreadyPaired = new HashSet<String>();

for (int i = 0; i < strings.length; ++i) {
for (int j = 0; j < strings.length; ++j) {
if (i != j && !alreadyPaired.contains(strings[j])) {
combinations.add(new Pair(strings[i], strings[j]));
}
}

alreadyPaired.add(strings[i]);
}

return combinations;
}
}

class Pair {
public Pair(String s1, String s2) {
this.s1 = s1;
this.s2 = s2;
}

public String getFirst() {
return s1;
}

public String getSecond() {
return s2;
}

public String toString() {
return s1 + ":" + s2;
}

private String s1 = null;
private String s2 = null;
}

Sample run:
1
[chebi:kegg.compound]

- Fencer
From: Peter Duniho on
Fencer wrote:
> Hello, I have a large set of strings and I need to find all combinations
> of two of those string and perform an algorithm on them (the algorithm
> is actually an XQuery that runs over a large dataset so I won't post
> that here). The combinations foo-bar and bar-foo are considered to be
> equal so only one should be generated.

Do you mean you perform a single operation on the entire set of string
pairs? Or that you iterate through the set of string pairs, performing
some operation on each pair?

If the latter, then one of the biggest improvements you might make is to
not keep the pairs in a collection at all. Just feed them one pair at a
time to the algorithm as you generate them, discarding the pair once
it's been processed.

> I wrote a static class method that takes an array of Strings and returns
> all combinations in a Vector. The Vector stores a class Pair which I've
> written myself too. The complete code is posted below with output from a
> sample run. It seems to work, but how can it be improved and get rid of
> the Pair class? Thanks!

It seems to me that another significant improvement you could easily
make is to get rid of the Set<String> for excluding pairs you already
have. Instead, just don't generate the reversed pairs in the first place:

for (int i = 0; i < strings.length; i++)
{
for (int j = i + 1; j < strings.length; j++)
{
combinations.add(new Pair(strings[i], strings[j]));
}
}

The above does not account for the possibility of duplicate strings in
the input. In that case, you would still get a duplicated pair. That
can be resolved by removing the duplicates in the input before
generating the pairs.

Of course, if you have huge amounts of RAM, then you may find that
reducing the memory overhead is less important than reducing processing
overhead. In that case, you could deal with duplicates using a
Set<Pair> just as you had been before, but still at least avoiding
generating reversed pairs as above.

If you're guaranteed the input won't have duplicate strings, then of
course dealing with duplicates is not necessary.

There may be more "clever" improvements you could make. But I think
it's a good idea to deal with the more basic inefficiencies in the
algorithm first.

Pete
From: Fencer on
On 2010-07-29 20:07, Peter Duniho wrote:
>
> Do you mean you perform a single operation on the entire set of string
> pairs? Or that you iterate through the set of string pairs, performing
> some operation on each pair?

Thanks for your reply my Duniho. I should have been clearer: I iterate
through all my combinations and runs the all algorithm on each
combination. The algorithm is a complicated XQuery that runs over a
large data set. The number of combinations I have is not that large
actually, I just didn't want to do the pairing by hand. It's true that I
don't really need to keep the combinations around but I would actually
like to do that.
The input data could actually contain duplicates but that would be an
error on my part. I mostly want to get rid of the Pair class. :)

>
> If the latter, then one of the biggest improvements you might make is to
> not keep the pairs in a collection at all. Just feed them one pair at a
> time to the algorithm as you generate them, discarding the pair once
> it's been processed.
>
>> I wrote a static class method that takes an array of Strings and
>> returns all combinations in a Vector. The Vector stores a class Pair
>> which I've written myself too. The complete code is posted below with
>> output from a sample run. It seems to work, but how can it be improved
>> and get rid of the Pair class? Thanks!
>
> It seems to me that another significant improvement you could easily
> make is to get rid of the Set<String> for excluding pairs you already
> have. Instead, just don't generate the reversed pairs in the first place:
>
> for (int i = 0; i < strings.length; i++)
> {
> for (int j = i + 1; j < strings.length; j++)
> {
> combinations.add(new Pair(strings[i], strings[j]));
> }
> }
>
> The above does not account for the possibility of duplicate strings in
> the input. In that case, you would still get a duplicated pair. That can
> be resolved by removing the duplicates in the input before generating
> the pairs.
>
> Of course, if you have huge amounts of RAM, then you may find that
> reducing the memory overhead is less important than reducing processing
> overhead. In that case, you could deal with duplicates using a Set<Pair>
> just as you had been before, but still at least avoiding generating
> reversed pairs as above.
>
> If you're guaranteed the input won't have duplicate strings, then of
> course dealing with duplicates is not necessary.
>
> There may be more "clever" improvements you could make. But I think it's
> a good idea to deal with the more basic inefficiencies in the algorithm
> first.
>
> Pete

From: Fencer on
I rewrote it like this:
package utility;

import java.util.Arrays;
import java.util.HashSet;
import java.util.LinkedHashMap;
import java.util.Map;
import java.util.Set;

public class MiscUtils {

public static void main(String[] args) {
Map<String, String> combos =
findCombinations(new String[]{"chebi", "kegg.compound", "chebi"});
System.out.println(combos.size());
System.out.println(combos);
}

public static Map<String, String> findCombinations(String[] strings) {
Set<String> set = new HashSet<String>(Arrays.asList(strings));

if (set.size() < strings.length) {
System.err.println("Input array contained duplicates.");
strings = (String[])set.toArray();
}

Map<String, String> combos = new LinkedHashMap<String, String>();

for (int i = 0; i < strings.length; i++)
{
for (int j = i + 1; j < strings.length; j++)
{
combos.put(strings[i], strings[j]);
}
}

return combos;
}
}

but apparently I can't handle duplicates the way I try handle them,
because I get following runtime error:
Input array contained duplicates.
Exception in thread "main" java.lang.ClassCastException:
[Ljava.lang.Object; cannot be cast to [Ljava.lang.String;
at utility.MiscUtils.findCombinations(MiscUtils.java:23)
at utility.MiscUtils.main(MiscUtils.java:13)

This line strings = (String[])set.toArray(); is what the runtime system
is unhappy with.


- Fencer
From: Screamin Lord Byron on
On 29.07.2010 19:52, Fencer wrote:
> Hello, I have a large set of strings and I need to find all combinations
> of two of those string and perform an algorithm on them (the algorithm
> is actually an XQuery that runs over a large dataset so I won't post
> that here). The combinations foo-bar and bar-foo are considered to be
> equal so only one should be generated.
>
> I wrote a static class method that takes an array of Strings and returns
> all combinations in a Vector. The Vector stores a class Pair which I've
> written myself too. The complete code is posted below with output from a
> sample run. It seems to work, but how can it be improved and get rid of
> the Pair class? Thanks!

If you need a concept of pair, then you need some class that represents
it. Why would you want to get rid of it?

What is the reason you are using a Vector (synchronization)? Why not
implement your equals() and hashcode() methods of the Pair class, and
simply use the HashSet for storing pairs? Then you wouldn't have to
worry about duplicates at all.


> for (int i = 0; i < strings.length; ++i) {
> for (int j = 0; j < strings.length; ++j) {

for (int j = i+1;.....