From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from pigeon.gentoo.org ([69.77.167.62] helo=lists.gentoo.org)
	by finch.gentoo.org with esmtp (Exim 4.60)
	(envelope-from <gentoo-cluster+bounces-429-garchives=archives.gentoo.org@lists.gentoo.org>)
	id 1JCxwr-0001xE-T0
	for garchives@archives.gentoo.org; Thu, 10 Jan 2008 13:59:58 +0000
Received: from pigeon.gentoo.org (localhost [127.0.0.1])
	by pigeon.gentoo.org (Postfix) with SMTP id 8560CE0B46;
	Thu, 10 Jan 2008 13:59:56 +0000 (UTC)
Received: from exc03vs1.exchange.cysonet.com (exc03vs1.exchange.cysonet.com [85.158.200.86])
	by pigeon.gentoo.org (Postfix) with ESMTP id 42C84E0B46
	for <gentoo-cluster@lists.gentoo.org>; Thu, 10 Jan 2008 13:59:56 +0000 (UTC)
Received: from hyves1.exchange.cysonet.com ([85.158.200.92]) by exc03vs1.exchange.cysonet.com with Microsoft SMTPSVC(6.0.3790.3959);
	 Thu, 10 Jan 2008 14:59:55 +0100
X-MimeOLE: Produced By Microsoft Exchange V6.5
Content-class: urn:content-classes:message
Precedence: bulk
List-Post: <mailto:gentoo-cluster@lists.gentoo.org>
List-Help: <mailto:gentoo-cluster+help@lists.gentoo.org>
List-Unsubscribe: <mailto:gentoo-cluster+unsubscribe@lists.gentoo.org>
List-Subscribe: <mailto:gentoo-cluster+subscribe@lists.gentoo.org>
List-Id: Gentoo Linux mail <gentoo-cluster.gentoo.org>
X-BeenThere: gentoo-cluster@lists.gentoo.org
Reply-to: gentoo-cluster@lists.gentoo.org
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Subject: [gentoo-cluster] cluster or distributed queue, general question
Date: Thu, 10 Jan 2008 14:59:27 +0100
Message-ID: <AD5924F6A589EF419765D39834C6BAFE699FBA@hyves1.exchange.cysonet.com>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: cluster or distributed queue, general question
Thread-Index: AchTkQNOvsk+K/QbSNGTjuOtHdlWXA==
From: "Jos Houtman" <jos@hyves.nl>
To: <gentoo-cluster@lists.gentoo.org>
X-OriginalArrivalTime: 10 Jan 2008 13:59:55.0458 (UTC) FILETIME=[14372220:01C85391]
X-Archives-Salt: 2a524f37-f343-4d8d-ad00-e1a1d70c600d
X-Archives-Hash: 810c4124b6b27ac6cd15ea1578556f38

List,
=A0
For my master thesis I took up a project that requires mapping of a =
number of statically defined parallel jobs into a more dynamic =
environment that allows better scaling. =20
The situation as described below let me to believe a cluster or =
distributed queue (DrQueue?) solution is necessary. For the situation =
see [situation] at the end of this email.
=A0
Because I am new in this field I ask for a bit of your time to help me =
get my bearings on current work in the field and good documentation.=20
=A0
To be able to see if there are any suitable (or near enough) =
environments, I made a list of capabilities that this environment should =
have:
* Dynamic load balancing, either by process migration or stopping jobs =
and starting them somewhere else.
* Dynamic decision on the degree of parallelism, according to the =
dataset that needs to be processed (growing/shrinking).
* Failover of the jobs when node failure happens.
* The guarantee that a job runs only once in the cluster, even during =
node failure.=20
* Limiting jobs to a class of nodes (subset of the total of nodes)
=A0
Do you know of any projects that have these capabilities?=20
HPC clustering seems to come close but I don't know about the dynamic =
degree of parallelism, isn't that defined at job submission?
And when using process migration, do IP connections also migrate, in =
other words will database connections stay intact during process =
migration?
=A0
I also hope you have some favorite resources on the subject, especially =
on methods that can be used for these capabilities.
=A0
=A0
[situation]
Over the years we have created a small (about 40) number of jobs that =
support the main function of our business ( an online social community). =
Typical jobs include aggregation of data, Queue processing, automated =
email notifications, video/photo rendering. A common factor is the need =
for database connections for all these scripts.=20
=A0
For scalability issues most of these jobs are parallelized, sometimes =
the dataset is partitioned, and processing is done in manageable chunks. =
 Each job basically run in a while(true) with a bit of sleep after a =
chunk is processed, so not to overwhelm the machine's when all data is =
processed.  Some jobs, though, cannot be split and therefore cannot run =
in parallel since this would cause data corruption.
=A0
To run these job we have about 10 nodes, configuration is done =
statically through a configuration file. The configuration defines how =
many instances there need to run, sometimes even where to run (crude =
load balancing).  Because of our growing volume of users there is a need =
to identify which job cannot keep up and adjust the configuration =
accordingly. This is a cumbersome job that has grown out of habit and =
introduces in efficient use of the resources (both human and machine =
alike).
=A0
With regards,

Jos Houtman
System administrator Hyves.nl
email: jos@hyves.nl


-- 
gentoo-cluster@lists.gentoo.org mailing list