From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from lists.gentoo.org (pigeon.gentoo.org [208.92.234.80]) by finch.gentoo.org (Postfix) with ESMTP id DBC3A13877A for ; Mon, 4 Aug 2014 10:40:24 +0000 (UTC) Received: from pigeon.gentoo.org (localhost [127.0.0.1]) by pigeon.gentoo.org (Postfix) with SMTP id 218F2E0983; Mon, 4 Aug 2014 10:40:16 +0000 (UTC) Received: from smtpq1.tb.mail.iss.as9143.net (smtpq1.tb.mail.iss.as9143.net [212.54.42.164]) by pigeon.gentoo.org (Postfix) with ESMTP id 07686E07D6 for ; Mon, 4 Aug 2014 10:40:14 +0000 (UTC) Received: from [212.54.42.135] (helo=smtp4.tb.mail.iss.as9143.net) by smtpq1.tb.mail.iss.as9143.net with esmtp (Exim 4.76) (envelope-from ) id 1XEFgc-0001j9-8J for gentoo-user@lists.gentoo.org; Mon, 04 Aug 2014 12:40:14 +0200 Received: from 53579160.cm-6-8c.dynamic.ziggo.nl ([83.87.145.96] helo=data.antarean.org) by smtp4.tb.mail.iss.as9143.net with esmtp (Exim 4.76) (envelope-from ) id 1XEFgb-0007Wt-M7 for gentoo-user@lists.gentoo.org; Mon, 04 Aug 2014 12:40:14 +0200 Received: from andromeda.localnet (unknown [10.20.13.51]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by data.antarean.org (Postfix) with ESMTPSA id B58BE4C for ; Mon, 4 Aug 2014 12:39:44 +0200 (CEST) From: "J. Roeleveld" To: gentoo-user@lists.gentoo.org Subject: Re: [gentoo-user] Re: Recommendations for scheduler Date: Mon, 04 Aug 2014 12:40:12 +0200 Message-ID: <4871526.Mj2HT7lMQH@andromeda> Organization: Antarean User-Agent: KMail/4.12.5 (Linux/3.14.14-gentoo; KDE/4.12.5; x86_64; ; ) In-Reply-To: References: <53DBCF34.6060601@gmail.com> <10589ff2-a642-4951-955d-339d475ccaad@email.android.com> Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Id: Gentoo Linux mail X-BeenThere: gentoo-user@lists.gentoo.org Reply-to: gentoo-user@lists.gentoo.org MIME-Version: 1.0 Content-Transfer-Encoding: 7Bit Content-Type: text/plain; charset="us-ascii" X-Ziggo-spambar: ---- X-Ziggo-spamscore: -4.9 X-Ziggo-spamreport: ALL_TRUSTED=-1,BAYES_00=-1.9,PROLO_TRUST_RDNS=-3,RDNS_DYNAMIC=0.982 X-Ziggo-Spam-Status: No X-Spam-Status: No X-Spam-Flag: No X-Archives-Salt: 31966d42-245e-4045-bac2-9d24af72a003 X-Archives-Hash: 6603674e4314fddd246f517ceb7ff11c On Monday, August 04, 2014 10:11:41 AM Martin Vaeth wrote: > J. Roeleveld wrote: > > These schedules then also can't be restarted from the beginning > > when they stop halfway through without risking massive consistency > > problems in the final data. > > So you have a command which might break due to hardware error > and cannot be rerun. I cannot see how any general-purpose scheduler > might help you here: You either need to be able to split your command > into several (sequential) commands or you need something adapted > for your particular command. A general-purpose scheduler can work, as they do exist. (With a price tag) In the OSS world, there is, to my knowledge, none. Yours seems to be the most promising as it looks like the missing features shouldn't be too difficult to add. The commands are relatively simple, but they deal with large amounts of data. I am talking about ETL processes that, due to the amount of data being processed, can easily take several hours per step. If, during one of these steps, the database or ETL process suffers a crash, the activities of the ETL process need to be rolled back to the point where you can restart it. I am not talking about simple schedules related to day-to-day maintenance of a few servers. > > And then multiple of those starting at random times with > > occasionally a whole bunch of the same schedule put into the > > queue with dependencies to the previous run. > > That's not a problem. Only if the granularity of one command is > not fine enough, it becomes a problem. If nothing happens, it can all be stuck into a single script and the end result will be the same. Problems start because the real world is not 100% reliable. > > If, during that time, one of the machines has a hardware failure > > or the scheduling process crashes on one or more of the servers, > > the last state needs to be recoverable. > > One must distinguish two cases: > > 1. The machine running "schedule-server" has a hardware failure. > (Let us assume tha "schedule-server" does not have a software failure - > otherwise, you have problems anyway.) > 2. Some other machine has a hardware failure. > > Case 2. is not bad (as concerns the scheduling): Of course, the > machine will not report that it completed the job, and you will > have to think how to complete the job. But it is clear that in > such exceptional cases you have to interfere manually in some sense. Agreed, this happens more often then you might think. > In order to deal with case 1., you can regularly (e.g. each minute) > dump the output of "schedule list" (possibly suppressing non-important > data through the options to keep it short). Or all the necessary information is kept in-sync on persistent storage. This would then also allow easy fail-over if the master-schedule-node fails. A 2nd machine could quickly take over. > One could add a logging option to decrease the possible race of 1 minute, > but in case of hardware failure a possible race cannot be excluded anyway. > > In case 1. you manually have to re-queue the jobs and think what to do > with the already started jobs. However, I cannot imagine that this > occurs so frequently that this exceptional case becomes something > one should seriously think about. As I mentioned above, with BI infrastructure (large databases, complex ETL processes, interactive report services,....), the scheduler is busy 24/7. The amount of tasks, schedules, dependencies, states,.... that needs to kept track off can easily lead to unforeseen issues and bugs.