public inbox for gentoo-cluster@lists.gentoo.org
 help / color / mirror / Atom feed
* [gentoo-cluster] cluster or distributed queue, general question
@ 2008-01-10 13:59 Jos Houtman
  2008-01-10 14:12 ` Panagiotis Christopoulos
  2008-01-10 14:30 ` Robin H. Johnson
  0 siblings, 2 replies; 5+ messages in thread
From: Jos Houtman @ 2008-01-10 13:59 UTC (permalink / raw
  To: gentoo-cluster

List,
 
For my master thesis I took up a project that requires mapping of a number of statically defined parallel jobs into a more dynamic environment that allows better scaling.  
The situation as described below let me to believe a cluster or distributed queue (DrQueue?) solution is necessary. For the situation see [situation] at the end of this email.
 
Because I am new in this field I ask for a bit of your time to help me get my bearings on current work in the field and good documentation. 
 
To be able to see if there are any suitable (or near enough) environments, I made a list of capabilities that this environment should have:
* Dynamic load balancing, either by process migration or stopping jobs and starting them somewhere else.
* Dynamic decision on the degree of parallelism, according to the dataset that needs to be processed (growing/shrinking).
* Failover of the jobs when node failure happens.
* The guarantee that a job runs only once in the cluster, even during node failure. 
* Limiting jobs to a class of nodes (subset of the total of nodes)
 
Do you know of any projects that have these capabilities? 
HPC clustering seems to come close but I don't know about the dynamic degree of parallelism, isn't that defined at job submission?
And when using process migration, do IP connections also migrate, in other words will database connections stay intact during process migration?
 
I also hope you have some favorite resources on the subject, especially on methods that can be used for these capabilities.
 
 
[situation]
Over the years we have created a small (about 40) number of jobs that support the main function of our business ( an online social community). Typical jobs include aggregation of data, Queue processing, automated email notifications, video/photo rendering. A common factor is the need for database connections for all these scripts. 
 
For scalability issues most of these jobs are parallelized, sometimes the dataset is partitioned, and processing is done in manageable chunks.  Each job basically run in a while(true) with a bit of sleep after a chunk is processed, so not to overwhelm the machine's when all data is processed.  Some jobs, though, cannot be split and therefore cannot run in parallel since this would cause data corruption.
 
To run these job we have about 10 nodes, configuration is done statically through a configuration file. The configuration defines how many instances there need to run, sometimes even where to run (crude load balancing).  Because of our growing volume of users there is a need to identify which job cannot keep up and adjust the configuration accordingly. This is a cumbersome job that has grown out of habit and introduces in efficient use of the resources (both human and machine alike).
 
With regards,

Jos Houtman
System administrator Hyves.nl
email: jos@hyves.nl


--
gentoo-cluster@lists.gentoo.org mailing list



^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2008-01-11 12:02 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-01-10 13:59 [gentoo-cluster] cluster or distributed queue, general question Jos Houtman
2008-01-10 14:12 ` Panagiotis Christopoulos
2008-01-10 14:30 ` Robin H. Johnson
2008-01-11 10:26   ` Jos Houtman
2008-01-11 12:02     ` Hanni Ali

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox