From: np map on
I'd like to write an open source clustering (for computation and
general use) and automation of configuration/deployment in Python.
It's main purpose is to be used in academic environments.
It would be something like running numpy/simpy code (and other custom
python code) on a set of machines in a distributed fashion (e.g.
splitting tasks, doing certain bits on some machines, other sub-tasks
on other machines, etc).

The cluster could be used in at least two ways:
- submit code/files via a web interface, monitor the task via the web
interface and download the results from the master node (user<>web
interface<>master)
- run code directly from another machine on the cluster (as if it were
a subprocess or something like this)


Requirements (so far):
- support the Ubuntu Linux distribution in the initial iteration
- be easy to extend to other OS-es and package managers
- try to be 3.x compatible where dual compatibility is possible (2.x
and 3.x)
- it will support Python 2.5-2.6
- document required changes to the 2.x only code to make it work on
3.x
- make it easy to submit code directly from python scripts to the
cluster (with the right credentials)
- support key based authentication for job submission
- should talk to at least one type of RDBMS to store various types of
data
- the cluster should be able to kill a task on nodes automatically if
it executes for too long or requires too much memory (configurable)
- should be modular (use automation & configuration or just
clustering)


Therefore, I'd like to know a few things:

Is there a clustering toolkit already available for python?

What would the recommended architecture be ?

How should the "user" code interface with the clustering system's
code?

How should the results be stored (at the node and master level)?

Should threading be supported in the tasks?

How should they be returned to the Master node(s)? (polling, submitted
by the nodes, etc)

What libraries should be used for this? (e.g. fabric as a library,
pyro, etc)

Any other suggestions and pieces of advice?

Should Fabric be used in this clustering system for automation? If
not, what else? Would simply using a wrapper written in python for the
'ssh' app be ok?

Would the following architecture be ok?
Master: splits tasks into sub-tasks, sends them to nodes - provided
the node's load isn't greater than a certain percentage, gets results,
stores and provides configuration to nodes, stores results, etc
Node: runs code, applies configuration, submits the results to the
master, etc

If this system actually gets python-level code submission inside, how
should it work?

The reason I posted this set of questions and ideas is that I'd like
this to be as flexible and usable as possible.

Thanks.