From: Taliesin Nuin on

Hi,

I've been out of proper programming for a while (ended up doing mostly
database design work) and am attempting to put together a system to
model some complex networks (in fact, relationships between some
business entities). I need to find the best way to model these networks
which are essentially hundreds to thousands of nodes with a potentially
unlimited number of lines between any of them of different types (in
practice, likely between 0 and 20 between any given two nodes, and the
number of nodes connected too *typically* between 1 and 30).

Queries on this data will commonly take the form of picking a subject
node and tracing all connected nodes out to a chosen depth where the
connecting line is of a chosen type.

It's a little hard to describe this, but harder still to draw diagrams
in ASCII. ;) Hopefully the scenario is clear from the above.

I'm trying to find the best way of modelling this. The model will be
read much more often than it is written to, but be able to add and
remove nodes and edges in reasonable time nonetheless. And obviously it
needs to be persistent.

The only solution I am aware of are to model this is in a relational
database as an Adjaceny List. I have serious concerns about the
performance of this approach, but at least I understand it (well, I have
a copy of Joe Celko's Trees and Hierarchies in SQL next to me ;) and
it offers the persistence and interfaces that I need.

I have a vague awareness that there are such things as GraphML. I have a
concern that such methods can't deal with the data piece-meal however
(i.e. read part of the graph into memory from the file and work with it
without having to instantiate the entire monster) which I can do with
the database approach.

The lines will have attached information which also points toward the
database approach, but this has to scale and a relational database
doesn't seem to be suited to this task. I will be caching the results of
as many queries as possible but I obviously can't cache them all and
these will need to be rebuilt after each update also.

So that's the size of the task I've bitten off. I'm obviously not
looking for implementation details, but any suggestions people can make
about what technologies are appropriate or what is the right approach
would be hugely useful. I am aware of how abstract this question is, but
that's because unfortunately I may not know enough to ask the right
questions, yet.

For preference, technologies accessible from C++ or Python are ideal for
me, but I'll muddle through in whatever is necessary. I could code
something up myself in theory, but I am sure this is a a known problem
that better programmers have found the best answers to before me.

Thanks very much for any replies.

Taliesin.
From: Gene on
On Nov 13, 11:51 am, Taliesin Nuin <tn...(a)bath.ac.uk> wrote:
> Hi,
>
> I've been out of proper programming for a while (ended up doing mostly
> database design work) and am attempting to put together a system to
> model some complex networks (in fact, relationships between some
> business entities). I need to find the best way to model these networks
> which are essentially hundreds to thousands of nodes with a potentially
> unlimited number of lines between any of them of different types (in
> practice, likely between 0 and 20 between any given two nodes, and the
> number of nodes connected too *typically* between 1 and 30).
>
> Queries on this data will commonly take the form of picking a subject
> node and tracing all connected nodes out to a chosen depth where the
> connecting line is of a chosen type.
>
> It's a little hard to describe this, but harder still to draw diagrams
> in ASCII. ;) Hopefully the scenario is clear from the above.
>
> I'm trying to find the best way of modelling this. The model will be
> read much more often than it is written to, but be able to add and
> remove nodes and edges in reasonable time nonetheless. And obviously it
> needs to be persistent.
>
> The only solution I am aware of are to model this is in a relational
> database as an Adjaceny List. I have serious concerns about the
> performance of this approach, but at least I understand it (well, I have
>   a copy of Joe Celko's Trees and Hierarchies in SQL next to me ;) and
> it offers the persistence and interfaces that I need.
>
> I have a vague awareness that there are such things as GraphML. I have a
> concern that such methods can't deal with the data piece-meal however
> (i.e. read part of the graph into memory from the file and work with it
> without having to instantiate the entire monster) which I can do with
> the database approach.
>
> The lines will have attached information which also points toward the
> database approach, but this has to scale and a relational database
> doesn't seem to be suited to this task. I will be caching the results of
> as many queries as possible but I obviously can't cache them all and
> these will need to be rebuilt after each update also.
>
> So that's the size of the task I've bitten off. I'm obviously not
> looking for implementation details, but any suggestions people can make
> about what technologies are appropriate or what is the right approach
> would be hugely useful. I am aware of how abstract this question is, but
> that's because unfortunately I may not know enough to ask the right
> questions, yet.
>
> For preference, technologies accessible from C++ or Python are ideal for
> me, but I'll muddle through in whatever is necessary. I could code
> something up myself in theory, but I am sure this is a a known problem
> that better programmers have found the best answers to before me.

You didn't say if persistence needs to be continuous. If so, the
relational DB is about the best you will do. Associate an adjacency
list with each node (by indexed join) giving the list of edges
(represesented as node pairs).

You can get _much_ better speed if by persistence you mean that it's
possible to freeze the graph by writing it to a file and rebuild it
later. In this case the data structure is the same: nodes containing
adjacency lists of edges, which are pairs of pointers of nodes. It's
easy to write such a graph to disk in XML or some other convenient
format by using a depth-first-search algorithm to discover all the
nodes and edges exactly once. Use unique node and edge IDs to
facilitate. One convenient idea is to keep nodes and edges in arrays
and use the array index as the respective ID. Then reading the graph
back in is absolutely straightforward.