From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from pigeon.gentoo.org ([69.77.167.62] helo=lists.gentoo.org) by finch.gentoo.org with esmtp (Exim 4.60) (envelope-from ) id 1JCxwr-0001xE-T0 for garchives@archives.gentoo.org; Thu, 10 Jan 2008 13:59:58 +0000 Received: from pigeon.gentoo.org (localhost [127.0.0.1]) by pigeon.gentoo.org (Postfix) with SMTP id 8560CE0B46; Thu, 10 Jan 2008 13:59:56 +0000 (UTC) Received: from exc03vs1.exchange.cysonet.com (exc03vs1.exchange.cysonet.com [85.158.200.86]) by pigeon.gentoo.org (Postfix) with ESMTP id 42C84E0B46 for ; Thu, 10 Jan 2008 13:59:56 +0000 (UTC) Received: from hyves1.exchange.cysonet.com ([85.158.200.92]) by exc03vs1.exchange.cysonet.com with Microsoft SMTPSVC(6.0.3790.3959); Thu, 10 Jan 2008 14:59:55 +0100 X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Id: Gentoo Linux mail X-BeenThere: gentoo-cluster@lists.gentoo.org Reply-to: gentoo-cluster@lists.gentoo.org MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Subject: [gentoo-cluster] cluster or distributed queue, general question Date: Thu, 10 Jan 2008 14:59:27 +0100 Message-ID: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: cluster or distributed queue, general question Thread-Index: AchTkQNOvsk+K/QbSNGTjuOtHdlWXA== From: "Jos Houtman" To: X-OriginalArrivalTime: 10 Jan 2008 13:59:55.0458 (UTC) FILETIME=[14372220:01C85391] X-Archives-Salt: 2a524f37-f343-4d8d-ad00-e1a1d70c600d X-Archives-Hash: 810c4124b6b27ac6cd15ea1578556f38 List, =A0 For my master thesis I took up a project that requires mapping of a = number of statically defined parallel jobs into a more dynamic = environment that allows better scaling. =20 The situation as described below let me to believe a cluster or = distributed queue (DrQueue?) solution is necessary. For the situation = see [situation] at the end of this email. =A0 Because I am new in this field I ask for a bit of your time to help me = get my bearings on current work in the field and good documentation.=20 =A0 To be able to see if there are any suitable (or near enough) = environments, I made a list of capabilities that this environment should = have: * Dynamic load balancing, either by process migration or stopping jobs = and starting them somewhere else. * Dynamic decision on the degree of parallelism, according to the = dataset that needs to be processed (growing/shrinking). * Failover of the jobs when node failure happens. * The guarantee that a job runs only once in the cluster, even during = node failure.=20 * Limiting jobs to a class of nodes (subset of the total of nodes) =A0 Do you know of any projects that have these capabilities?=20 HPC clustering seems to come close but I don't know about the dynamic = degree of parallelism, isn't that defined at job submission? And when using process migration, do IP connections also migrate, in = other words will database connections stay intact during process = migration? =A0 I also hope you have some favorite resources on the subject, especially = on methods that can be used for these capabilities. =A0 =A0 [situation] Over the years we have created a small (about 40) number of jobs that = support the main function of our business ( an online social community). = Typical jobs include aggregation of data, Queue processing, automated = email notifications, video/photo rendering. A common factor is the need = for database connections for all these scripts.=20 =A0 For scalability issues most of these jobs are parallelized, sometimes = the dataset is partitioned, and processing is done in manageable chunks. = Each job basically run in a while(true) with a bit of sleep after a = chunk is processed, so not to overwhelm the machine's when all data is = processed. Some jobs, though, cannot be split and therefore cannot run = in parallel since this would cause data corruption. =A0 To run these job we have about 10 nodes, configuration is done = statically through a configuration file. The configuration defines how = many instances there need to run, sometimes even where to run (crude = load balancing). Because of our growing volume of users there is a need = to identify which job cannot keep up and adjust the configuration = accordingly. This is a cumbersome job that has grown out of habit and = introduces in efficient use of the resources (both human and machine = alike). =A0 With regards, Jos Houtman System administrator Hyves.nl email: jos@hyves.nl -- gentoo-cluster@lists.gentoo.org mailing list