[gentoo-cluster] cluster or distributed queue, general question

public inbox for gentoo-cluster@lists.gentoo.org
 help / color / mirror / Atom feed

* [gentoo-cluster] cluster or distributed queue, general question
@ 2008-01-10 13:59 Jos Houtman
  2008-01-10 14:12 ` Panagiotis Christopoulos
  2008-01-10 14:30 ` Robin H. Johnson
  0 siblings, 2 replies; 5+ messages in thread
From: Jos Houtman @ 2008-01-10 13:59 UTC (permalink / raw
  To: gentoo-cluster

List,

For my master thesis I took up a project that requires mapping of a number of statically defined parallel jobs into a more dynamic environment that allows better scaling.  
The situation as described below let me to believe a cluster or distributed queue (DrQueue?) solution is necessary. For the situation see [situation] at the end of this email.

Because I am new in this field I ask for a bit of your time to help me get my bearings on current work in the field and good documentation. 

To be able to see if there are any suitable (or near enough) environments, I made a list of capabilities that this environment should have:
* Dynamic load balancing, either by process migration or stopping jobs and starting them somewhere else.
* Dynamic decision on the degree of parallelism, according to the dataset that needs to be processed (growing/shrinking).
* Failover of the jobs when node failure happens.
* The guarantee that a job runs only once in the cluster, even during node failure. 
* Limiting jobs to a class of nodes (subset of the total of nodes)

Do you know of any projects that have these capabilities? 
HPC clustering seems to come close but I don't know about the dynamic degree of parallelism, isn't that defined at job submission?
And when using process migration, do IP connections also migrate, in other words will database connections stay intact during process migration?

I also hope you have some favorite resources on the subject, especially on methods that can be used for these capabilities.

[situation]
Over the years we have created a small (about 40) number of jobs that support the main function of our business ( an online social community). Typical jobs include aggregation of data, Queue processing, automated email notifications, video/photo rendering. A common factor is the need for database connections for all these scripts. 

For scalability issues most of these jobs are parallelized, sometimes the dataset is partitioned, and processing is done in manageable chunks.  Each job basically run in a while(true) with a bit of sleep after a chunk is processed, so not to overwhelm the machine's when all data is processed.  Some jobs, though, cannot be split and therefore cannot run in parallel since this would cause data corruption.

To run these job we have about 10 nodes, configuration is done statically through a configuration file. The configuration defines how many instances there need to run, sometimes even where to run (crude load balancing).  Because of our growing volume of users there is a need to identify which job cannot keep up and adjust the configuration accordingly. This is a cumbersome job that has grown out of habit and introduces in efficient use of the resources (both human and machine alike).

With regards,

Jos Houtman
System administrator Hyves.nl
email: jos@hyves.nl

--
gentoo-cluster@lists.gentoo.org mailing list

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [gentoo-cluster] cluster or distributed queue, general question
  2008-01-10 13:59 [gentoo-cluster] cluster or distributed queue, general question Jos Houtman
@ 2008-01-10 14:12 ` Panagiotis Christopoulos
  2008-01-10 14:30 ` Robin H. Johnson
  1 sibling, 0 replies; 5+ messages in thread
From: Panagiotis Christopoulos @ 2008-01-10 14:12 UTC (permalink / raw
  To: gentoo-cluster

On 14:59 Thu 10 Jan     , Jos Houtman wrote:
> ...

Maybe it's a better idea to join the beowulf's mailing
list(http://beowulf.org/), and ask
there. It's a more generic ml, but very active.

Panagiotis Christopoulos(pchrist)
Technological Educational Institute of Athens.
-- 
gentoo-cluster@lists.gentoo.org mailing list



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [gentoo-cluster] cluster or distributed queue, general question
  2008-01-10 13:59 [gentoo-cluster] cluster or distributed queue, general question Jos Houtman
  2008-01-10 14:12 ` Panagiotis Christopoulos
@ 2008-01-10 14:30 ` Robin H. Johnson
  2008-01-11 10:26   ` Jos Houtman
  1 sibling, 1 reply; 5+ messages in thread
From: Robin H. Johnson @ 2008-01-10 14:30 UTC (permalink / raw
  To: gentoo-cluster

[-- Attachment #1: Type: text/plain, Size: 1100 bytes --]

On Thu, Jan 10, 2008 at 02:59:27PM +0100, Jos Houtman wrote:
> For my master thesis I took up a project that requires mapping of a number of statically defined parallel jobs into a more dynamic environment that allows better scaling.  
> The situation as described below let me to believe a cluster or distributed queue (DrQueue?) solution is necessary. For the situation see [situation] at the end of this email.
Off the top of my head, many of your requirements are available in two
totally different apps:
- Gearman, written by Brad Fitzpatrick @ LiveJournal. Perl mainly, I
  think there are other interfaces as well to it.
- Torque/PBS - somewhat less of a fit, I'm not certain about running
  perpetual jobs.

You may also need some degree of STONITH for the job running only once
during node failure case. (Say the job manager crashes, the job is still
running, but you have no control of it. You need to zap it hard).

-- 
Robin Hugh Johnson
Gentoo Linux Developer & Infra Guy
E-Mail     : robbat2@gentoo.org
GnuPG FP   : 11AC BA4F 4778 E3F6 E4ED  F38E B27B 944E 3488 4E85

[-- Attachment #2: Type: application/pgp-signature, Size: 329 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: [gentoo-cluster] cluster or distributed queue, general question
  2008-01-10 14:30 ` Robin H. Johnson
@ 2008-01-11 10:26   ` Jos Houtman
  2008-01-11 12:02     ` Hanni Ali
  0 siblings, 1 reply; 5+ messages in thread
From: Jos Houtman @ 2008-01-11 10:26 UTC (permalink / raw
  To: gentoo-cluster

Thanx for the replies,

I will certainly try the beowulf mailinglist next week, 
The suggestions are great gearman seems to fit except for being perl
oriented (atleast that's what I read sofar).
Torque/pbs, don't know about that yet.

Thanx again,

Jos Houtman
System administrator Hyves.nl
email: jos@hyves.nl

-----Original Message-----
From: Robin H. Johnson [mailto:robbat2@gentoo.org] 
Sent: donderdag 10 januari 2008 15:30
To: gentoo-cluster@lists.gentoo.org
Subject: Re: [gentoo-cluster] cluster or distributed queue, general
question

On Thu, Jan 10, 2008 at 02:59:27PM +0100, Jos Houtman wrote:
> For my master thesis I took up a project that requires mapping of a
number of statically defined parallel jobs into a more dynamic
environment that allows better scaling.  
> The situation as described below let me to believe a cluster or
distributed queue (DrQueue?) solution is necessary. For the situation
see [situation] at the end of this email.
Off the top of my head, many of your requirements are available in two
totally different apps:
- Gearman, written by Brad Fitzpatrick @ LiveJournal. Perl mainly, I
  think there are other interfaces as well to it.
- Torque/PBS - somewhat less of a fit, I'm not certain about running
  perpetual jobs.

You may also need some degree of STONITH for the job running only once
during node failure case. (Say the job manager crashes, the job is still
running, but you have no control of it. You need to zap it hard).

--
Robin Hugh Johnson
Gentoo Linux Developer & Infra Guy
E-Mail     : robbat2@gentoo.org
GnuPG FP   : 11AC BA4F 4778 E3F6 E4ED  F38E B27B 944E 3488 4E85
--
gentoo-cluster@lists.gentoo.org mailing list



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [gentoo-cluster] cluster or distributed queue, general question
  2008-01-11 10:26   ` Jos Houtman
@ 2008-01-11 12:02     ` Hanni Ali
  0 siblings, 0 replies; 5+ messages in thread
From: Hanni Ali @ 2008-01-11 12:02 UTC (permalink / raw
  To: gentoo-cluster

[-- Attachment #1: Type: text/plain, Size: 2108 bytes --]

You mentioned DrQueue, that batch processing app fulfills many of your
requirements. I wrote an ebuild for it too which is on bugs.gentoo

Hanni

On 11/01/2008, Jos Houtman <jos@hyves.nl> wrote:
>
> Thanx for the replies,
>
> I will certainly try the beowulf mailinglist next week,
> The suggestions are great gearman seems to fit except for being perl
> oriented (atleast that's what I read sofar).
> Torque/pbs, don't know about that yet.
>
> Thanx again,
>
> Jos Houtman
> System administrator Hyves.nl
> email: jos@hyves.nl
>
> -----Original Message-----
> From: Robin H. Johnson [mailto:robbat2@gentoo.org]
> Sent: donderdag 10 januari 2008 15:30
> To: gentoo-cluster@lists.gentoo.org
> Subject: Re: [gentoo-cluster] cluster or distributed queue, general
> question
>
> On Thu, Jan 10, 2008 at 02:59:27PM +0100, Jos Houtman wrote:
> > For my master thesis I took up a project that requires mapping of a
> number of statically defined parallel jobs into a more dynamic
> environment that allows better scaling.
> > The situation as described below let me to believe a cluster or
> distributed queue (DrQueue?) solution is necessary. For the situation
> see [situation] at the end of this email.
> Off the top of my head, many of your requirements are available in two
> totally different apps:
> - Gearman, written by Brad Fitzpatrick @ LiveJournal. Perl mainly, I
>   think there are other interfaces as well to it.
> - Torque/PBS - somewhat less of a fit, I'm not certain about running
>   perpetual jobs.
>
> You may also need some degree of STONITH for the job running only once
> during node failure case. (Say the job manager crashes, the job is still
> running, but you have no control of it. You need to zap it hard).
>
> --
> Robin Hugh Johnson
> Gentoo Linux Developer & Infra Guy
> E-Mail     : robbat2@gentoo.org
> GnuPG FP   : 11AC BA4F 4778 E3F6 E4ED  F38E B27B 944E 3488 4E85
> --
> gentoo-cluster@lists.gentoo.org mailing list
>
>


-- 
E-mail: hanni.ali@gmail.com
Mobile: +44 (0) 7985580147
My Blog: http://ainkaboot.co.uk/blogs/hanni/
Website: http://ainkaboot.co.uk http://drqueue.org

[-- Attachment #2: Type: text/html, Size: 2966 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2008-01-11 12:02 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-01-10 13:59 [gentoo-cluster] cluster or distributed queue, general question Jos Houtman
2008-01-10 14:12 ` Panagiotis Christopoulos
2008-01-10 14:30 ` Robin H. Johnson
2008-01-11 10:26   ` Jos Houtman
2008-01-11 12:02     ` Hanni Ali

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox