[gentoo-user] Recommendations for scheduler

public inbox for gentoo-user@lists.gentoo.org
 help / color / mirror / Atom feed

* [gentoo-user] Recommendations for scheduler
@ 2014-08-01 17:32 Alan McKinnon
  2014-08-01 17:49 ` Сергей
                   ` (3 more replies)
  0 siblings, 4 replies; 52+ messages in thread
From: Alan McKinnon @ 2014-08-01 17:32 UTC (permalink / raw
  To: gentoo-user

Hi,

Up-front disclaimer: Mostly [OT] post. But at least I'll test drive it
on Gentoo before putting it in production :-)

New job, new environment. Existing persons suffer from
5-year-old-with-a-hammer syndrome and assume cron is the solution to all
ills. Result: a towering edifice of cron jobs that may or may not
clobber each other's work, may or may not work at all, and implement no
error handling at all. But my god, can they spew out mail from STOUT

But cron has only one event trigger: wall-clock time. And it's a very
blunt weapon. I'm looking for recommendations of alternative schedulers
that satisfy real-world business needs that need some other event
trigger than what the time is right now.

For those familiar with it, I'm looking for something with the useful
feature set, without the useless features and without the price tag of
ControlM

Anyone care to share experiences?

-- 
Alan McKinnon
alan.mckinnon@gmail.com

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [gentoo-user] Recommendations for scheduler
  2014-08-01 17:32 [gentoo-user] Recommendations for scheduler Alan McKinnon
@ 2014-08-01 17:49 ` Сергей
  2014-08-01 17:50   ` Сергей
  2014-08-01 18:17 ` [gentoo-user] " James
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 52+ messages in thread
From: Сергей @ 2014-08-01 17:49 UTC (permalink / raw
  To: gentoo-user

For example in crontab */3 means "every three hours/minutes/etc".

2014-08-01 21:32 GMT+04:00 Alan McKinnon <alan.mckinnon@gmail.com>:
> Hi,
>
> Up-front disclaimer: Mostly [OT] post. But at least I'll test drive it
> on Gentoo before putting it in production :-)
>
> New job, new environment. Existing persons suffer from
> 5-year-old-with-a-hammer syndrome and assume cron is the solution to all
> ills. Result: a towering edifice of cron jobs that may or may not
> clobber each other's work, may or may not work at all, and implement no
> error handling at all. But my god, can they spew out mail from STOUT
>
>
> But cron has only one event trigger: wall-clock time. And it's a very
> blunt weapon. I'm looking for recommendations of alternative schedulers
> that satisfy real-world business needs that need some other event
> trigger than what the time is right now.
>
> For those familiar with it, I'm looking for something with the useful
> feature set, without the useless features and without the price tag of
> ControlM
>
> Anyone care to share experiences?
>
>
> --
> Alan McKinnon
> alan.mckinnon@gmail.com
>
>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [gentoo-user] Recommendations for scheduler
  2014-08-01 17:49 ` Сергей
@ 2014-08-01 17:50   ` Сергей
  2014-08-01 19:10     ` Alan McKinnon
  0 siblings, 1 reply; 52+ messages in thread
From: Сергей @ 2014-08-01 17:50 UTC (permalink / raw
  To: gentoo-user

Also you can have a look at anacron.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [gentoo-user] Re: Recommendations for scheduler
  2014-08-01 17:32 [gentoo-user] Recommendations for scheduler Alan McKinnon
  2014-08-01 17:49 ` Сергей
@ 2014-08-01 18:17 ` James
  2014-08-01 19:19   ` Alan McKinnon
  2014-08-01 21:17   ` J. Roeleveld
  2014-08-01 21:02 ` Martin Vaeth
  2014-08-01 21:13 ` [gentoo-user] " J. Roeleveld
  3 siblings, 2 replies; 52+ messages in thread
From: James @ 2014-08-01 18:17 UTC (permalink / raw
  To: gentoo-user

Alan McKinnon <alan.mckinnon <at> gmail.com> writes:


> New job, new environment. Existing persons suffer from
> 5-year-old-with-a-hammer syndrome and assume cron is the solution to all
> ills. Result: a towering edifice of cron jobs that may or may not
> clobber each other's work, may or may not work at all, and implement no
> error handling at all. But my god, can they spew out mail from STOUT

Sounds like a department full of computer scientist I inherited a few
decades ago...........

I know nothing bout chronos, but I find it an interesting read....ymmv.


http://nerds.airbnb.com/introducing-chronos/
http://airbnb.github.io/chronos/
https://github.com/airbnb/chronos


cheers mate!

James





^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [gentoo-user] Recommendations for scheduler
  2014-08-01 17:50   ` Сергей
@ 2014-08-01 19:10     ` Alan McKinnon
  2014-08-03  9:27       ` Bruce Schultz
  0 siblings, 1 reply; 52+ messages in thread
From: Alan McKinnon @ 2014-08-01 19:10 UTC (permalink / raw
  To: gentoo-user

On 01/08/2014 19:50, Сергей wrote:
> Also you can have a look at anacron.
> 
> 
> 

Unfortunately, anacron doesn't suit my needs at all. Here's how anacron
works:

this bunch of job will all happen today regardless of what time it is.
That's not what I need, I need something that has very little to do with
time. Example:

1. Start backup job on db server A
2. When complete, copy backup to server B and do a test import
3. If import succeeds, move backup to permanent storage and log the fact
4. If import fails, raise an alert and trigger the whole cycle to start
again at 1

Meanwhile,

1. All servers are regularly doing apt-get update and downloading .debs,
and applying security packages. Delay this on the db server if a backup
is in progress.

Meanwhile there is the regular Friday 5am code-publish cycle and
month-end finance runs - this is a DevOps environment.

Yes, I know I can hack something together with bash scripts and cron
with a truly insane number of flag files. But this doesn't work for sane
definitions of work involving other people. I can't expect my support
crew to read bash scripts they found from crontabs and figure out what
they mean. They need a picture that shows what will happen when and what
the environment looks like.

So basically I need something to replace bash and cron the same way
puppet replaces scp and for loops

-- 
Alan McKinnon
alan.mckinnon@gmail.com

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [gentoo-user] Re: Recommendations for scheduler
  2014-08-01 18:17 ` [gentoo-user] " James
@ 2014-08-01 19:19   ` Alan McKinnon
  2014-08-01 19:35     ` covici
  2014-08-01 21:17   ` J. Roeleveld
  1 sibling, 1 reply; 52+ messages in thread
From: Alan McKinnon @ 2014-08-01 19:19 UTC (permalink / raw
  To: gentoo-user

On 01/08/2014 20:17, James wrote:
> Alan McKinnon <alan.mckinnon <at> gmail.com> writes:
> 
> 
>> New job, new environment. Existing persons suffer from
>> 5-year-old-with-a-hammer syndrome and assume cron is the solution to all
>> ills. Result: a towering edifice of cron jobs that may or may not
>> clobber each other's work, may or may not work at all, and implement no
>> error handling at all. But my god, can they spew out mail from STOUT
> 
> Sounds like a department full of computer scientist I inherited a few
> decades ago...........

I've met folks like that....
Brilliant in their chosen field but completely useless outside it? The
kind of fellows who see nothing wrong with eating a barbeque'd steak
with a spoon because they can get a result?

> 
> I know nothing bout chronos, but I find it an interesting read....ymmv.
> 
> 
> http://nerds.airbnb.com/introducing-chronos/
> http://airbnb.github.io/chronos/
> https://github.com/airbnb/chronos

Aaaaaaaah, now this sounds like something I can use. Proper dependency
chains, Restful JSON interface so the devs can write code to drive it in
automation.

Good find, thanks!




> 
> 
> cheers mate!
> 
> James
> 
> 
> 
> 
> 
> 


-- 
Alan McKinnon
alan.mckinnon@gmail.com



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [gentoo-user] Re: Recommendations for scheduler
  2014-08-01 19:19   ` Alan McKinnon
@ 2014-08-01 19:35     ` covici
  2014-08-02  9:18       ` Alan McKinnon
  0 siblings, 1 reply; 52+ messages in thread
From: covici @ 2014-08-01 19:35 UTC (permalink / raw
  To: gentoo-user

Alan McKinnon <alan.mckinnon@gmail.com> wrote:

> On 01/08/2014 20:17, James wrote:
> > Alan McKinnon <alan.mckinnon <at> gmail.com> writes:
> > 
> > 
> >> New job, new environment. Existing persons suffer from
> >> 5-year-old-with-a-hammer syndrome and assume cron is the solution to all
> >> ills. Result: a towering edifice of cron jobs that may or may not
> >> clobber each other's work, may or may not work at all, and implement no
> >> error handling at all. But my god, can they spew out mail from STOUT
> > 
> > Sounds like a department full of computer scientist I inherited a few
> > decades ago...........
> 
> I've met folks like that....
> Brilliant in their chosen field but completely useless outside it? The
> kind of fellows who see nothing wrong with eating a barbeque'd steak
> with a spoon because they can get a result?
> 
> > 
> > I know nothing bout chronos, but I find it an interesting read....ymmv.
> > 
> > 
> > http://nerds.airbnb.com/introducing-chronos/
> > http://airbnb.github.io/chronos/
> > https://github.com/airbnb/chronos
> 
> Aaaaaaaah, now this sounds like something I can use. Proper dependency
> chains, Restful JSON interface so the devs can write code to drive it in
> automation.
> 
> Good find, thanks!

Unless I am missing something, chronos is not in the tree at all.

-- 
Your life is like a penny.  You're going to lose it.  The question is:
How do
you spend it?

         John Covici
         covici@ccs.covici.com


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [gentoo-user] Re: Recommendations for scheduler
  2014-08-01 17:32 [gentoo-user] Recommendations for scheduler Alan McKinnon
  2014-08-01 17:49 ` Сергей
  2014-08-01 18:17 ` [gentoo-user] " James
@ 2014-08-01 21:02 ` Martin Vaeth
  2014-08-01 21:22   ` J. Roeleveld
  2014-08-02  9:27   ` Alan McKinnon
  2014-08-01 21:13 ` [gentoo-user] " J. Roeleveld
  3 siblings, 2 replies; 52+ messages in thread
From: Martin Vaeth @ 2014-08-01 21:02 UTC (permalink / raw
  To: gentoo-user

Alan McKinnon <alan.mckinnon@gmail.com> wrote:
>
> But cron has only one event trigger: wall-clock time. And it's a very
> blunt weapon. I'm looking for recommendations of alternative schedulers
> that satisfy real-world business needs that need some other event
> trigger than what the time is right now.

I had a similar need recently, and since the discussion in

https://forums.gentoo.org/viewtopic-t-992780-highlight-.html

had led to nothing satisfactory for me, I have written a
scheduler tool which serves my needs
(which might very well differ from yours...):

The corresponding tool is still in beta testing phase:
https://github.com/vaeth/schedule/

You can install it from the mv overlay (available over layman).



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [gentoo-user] Recommendations for scheduler
  2014-08-01 17:32 [gentoo-user] Recommendations for scheduler Alan McKinnon
                   ` (2 preceding siblings ...)
  2014-08-01 21:02 ` Martin Vaeth
@ 2014-08-01 21:13 ` J. Roeleveld
  2014-08-02  9:33   ` Alan McKinnon
  3 siblings, 1 reply; 52+ messages in thread
From: J. Roeleveld @ 2014-08-01 21:13 UTC (permalink / raw
  To: gentoo-user

On 1 August 2014 19:32:36 CEST, Alan McKinnon <alan.mckinnon@gmail.com> wrote:
>Hi,
>
>Up-front disclaimer: Mostly [OT] post. But at least I'll test drive it
>on Gentoo before putting it in production :-)
>
>New job, new environment. Existing persons suffer from
>5-year-old-with-a-hammer syndrome and assume cron is the solution to
>all
>ills. Result: a towering edifice of cron jobs that may or may not
>clobber each other's work, may or may not work at all, and implement no
>error handling at all. But my god, can they spew out mail from STOUT
>
>
>But cron has only one event trigger: wall-clock time. And it's a very
>blunt weapon. I'm looking for recommendations of alternative schedulers
>that satisfy real-world business needs that need some other event
>trigger than what the time is right now.
>
>For those familiar with it, I'm looking for something with the useful
>feature set, without the useless features and without the price tag of
>ControlM
>
>Anyone care to share experiences?

I'm also looking for a free alternative.
At most of my clients, I see Tivoli Workload Scheduler (TWS) being used a lot. 

It has most things what you want from an intelligent multi host scheduler. Unfortunately,  it also comes with a corresponding price tag.

If anyone knows of an OS project with comparable features, please let me know. 
Failing this, it is on my list to start writing one myself when I get some spare time. 

--
Joost
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [gentoo-user] Re: Recommendations for scheduler
  2014-08-01 18:17 ` [gentoo-user] " James
  2014-08-01 19:19   ` Alan McKinnon
@ 2014-08-01 21:17   ` J. Roeleveld
  1 sibling, 0 replies; 52+ messages in thread
From: J. Roeleveld @ 2014-08-01 21:17 UTC (permalink / raw
  To: gentoo-user

On 1 August 2014 20:17:05 CEST, James <wireless@tampabay.rr.com> wrote:
>Alan McKinnon <alan.mckinnon <at> gmail.com> writes:
>
>
>> New job, new environment. Existing persons suffer from
>> 5-year-old-with-a-hammer syndrome and assume cron is the solution to
>all
>> ills. Result: a towering edifice of cron jobs that may or may not
>> clobber each other's work, may or may not work at all, and implement
>no
>> error handling at all. But my god, can they spew out mail from STOUT
>
>Sounds like a department full of computer scientist I inherited a few
>decades ago...........
>
>I know nothing bout chronos, but I find it an interesting read....ymmv.
>
>
>http://nerds.airbnb.com/introducing-chronos/
>http://airbnb.github.io/chronos/
>https://github.com/airbnb/chronos
>
>
>cheers mate!
>
>James

Looks interesting.
Apart from it requiring a clustered environment (mesos).

Unless I misunderstand the part where it says it runs on top of mesos?

--
Joost
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [gentoo-user] Re: Recommendations for scheduler
  2014-08-01 21:02 ` Martin Vaeth
@ 2014-08-01 21:22   ` J. Roeleveld
  2014-08-01 22:06     ` Martin Vaeth
  2014-08-02  9:27   ` Alan McKinnon
  1 sibling, 1 reply; 52+ messages in thread
From: J. Roeleveld @ 2014-08-01 21:22 UTC (permalink / raw
  To: gentoo-user

On 1 August 2014 23:02:11 CEST, Martin Vaeth <martin@mvath.de> wrote:
>Alan McKinnon <alan.mckinnon@gmail.com> wrote:
>>
>> But cron has only one event trigger: wall-clock time. And it's a very
>> blunt weapon. I'm looking for recommendations of alternative
>schedulers
>> that satisfy real-world business needs that need some other event
>> trigger than what the time is right now.
>
>I had a similar need recently, and since the discussion in
>
>https://forums.gentoo.org/viewtopic-t-992780-highlight-.html
>
>had led to nothing satisfactory for me, I have written a
>scheduler tool which serves my needs
>(which might very well differ from yours...):
>
>The corresponding tool is still in beta testing phase:
>https://github.com/vaeth/schedule/
>
>You can install it from the mv overlay (available over layman).

Going to have a look at this soon.

What are the features it currently has already and what are you planning on adding?

--
Joost
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [gentoo-user] Re: Recommendations for scheduler
  2014-08-01 21:22   ` J. Roeleveld
@ 2014-08-01 22:06     ` Martin Vaeth
  0 siblings, 0 replies; 52+ messages in thread
From: Martin Vaeth @ 2014-08-01 22:06 UTC (permalink / raw
  To: gentoo-user

J. Roeleveld <joost@antarean.org> wrote:
>>https://github.com/vaeth/schedule/
>
> What are the features it currently has already

This is hard to answer, since at a first glance the whole thing
does not even look like a scheduler: It looks more like a means to
communicate with some server, but after the discussions in the
gentoo forums, it became clear to my surprise that this is all
what is needed for the use cases I had in mind:
The "real" scheduler driving the whole thing can be a tiny script
(in shell or any other language) which just communicates with
that server.

To understand whether this can solve your problems, it is
probably best if you look at the examples in the README
(and/or the mentioned discussion in the gentoo forum).

> and what are you planning on adding?

Since it is sufficient for my purposes, I am currently not
planning to add anything (except possibly bug fixes or if I run
into a problem which I cannot solve with it).
Patches for extensions are welcome, of course.
(Also suggestions without patches are welcome, but my time is
currently very limited, and I do not make any promises.)

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [gentoo-user] Re: Recommendations for scheduler
  2014-08-01 19:35     ` covici
@ 2014-08-02  9:18       ` Alan McKinnon
  2014-08-02 13:34         ` J. Roeleveld
  0 siblings, 1 reply; 52+ messages in thread
From: Alan McKinnon @ 2014-08-02  9:18 UTC (permalink / raw
  To: gentoo-user

On 01/08/2014 21:35, covici@ccs.covici.com wrote:
> Alan McKinnon <alan.mckinnon@gmail.com> wrote:
> 
>> On 01/08/2014 20:17, James wrote:
>>> Alan McKinnon <alan.mckinnon <at> gmail.com> writes:
>>>
>>>
>>>> New job, new environment. Existing persons suffer from
>>>> 5-year-old-with-a-hammer syndrome and assume cron is the solution to all
>>>> ills. Result: a towering edifice of cron jobs that may or may not
>>>> clobber each other's work, may or may not work at all, and implement no
>>>> error handling at all. But my god, can they spew out mail from STOUT
>>>
>>> Sounds like a department full of computer scientist I inherited a few
>>> decades ago...........
>>
>> I've met folks like that....
>> Brilliant in their chosen field but completely useless outside it? The
>> kind of fellows who see nothing wrong with eating a barbeque'd steak
>> with a spoon because they can get a result?
>>
>>>
>>> I know nothing bout chronos, but I find it an interesting read....ymmv.
>>>
>>>
>>> http://nerds.airbnb.com/introducing-chronos/
>>> http://airbnb.github.io/chronos/
>>> https://github.com/airbnb/chronos
>>
>> Aaaaaaaah, now this sounds like something I can use. Proper dependency
>> chains, Restful JSON interface so the devs can write code to drive it in
>> automation.
>>
>> Good find, thanks!
> 
> Unless I am missing something, chronos is not in the tree at all.
> 

Correct, it isn't in the tree. But there's nothing stopping me from
getting it in there

-- 
Alan McKinnon
alan.mckinnon@gmail.com



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [gentoo-user] Re: Recommendations for scheduler
  2014-08-01 21:02 ` Martin Vaeth
  2014-08-01 21:22   ` J. Roeleveld
@ 2014-08-02  9:27   ` Alan McKinnon
  1 sibling, 0 replies; 52+ messages in thread
From: Alan McKinnon @ 2014-08-02  9:27 UTC (permalink / raw
  To: gentoo-user

On 01/08/2014 23:02, Martin Vaeth wrote:
> Alan McKinnon <alan.mckinnon@gmail.com> wrote:
>>
>> But cron has only one event trigger: wall-clock time. And it's a very
>> blunt weapon. I'm looking for recommendations of alternative schedulers
>> that satisfy real-world business needs that need some other event
>> trigger than what the time is right now.
> 
> I had a similar need recently, and since the discussion in
> 
> https://forums.gentoo.org/viewtopic-t-992780-highlight-.html

Interesting thread :-)

Conceptually, your needs are the same as mine - sequence defined by
something other than wall-clock time.
The responders there do the same thing as I experience - tunnel vision
with regard to cron. Sysadmins are used to cron and sadly most of us
want to ram a purely cron-based solution into places where it most
certainly does not belong.

Business rules very seldom fit easily into a cron model, they usually
rely on a defined sequence


> 
> had led to nothing satisfactory for me, I have written a
> scheduler tool which serves my needs
> (which might very well differ from yours...):
> 
> The corresponding tool is still in beta testing phase:
> https://github.com/vaeth/schedule/
> 
> You can install it from the mv overlay (available over layman).

Nice, thanks for the link :-)

Now I have two projects to evaluate.


-- 
Alan McKinnon
alan.mckinnon@gmail.com



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [gentoo-user] Recommendations for scheduler
  2014-08-01 21:13 ` [gentoo-user] " J. Roeleveld
@ 2014-08-02  9:33   ` Alan McKinnon
  2014-08-02 13:31     ` J. Roeleveld
  2014-08-03 13:02     ` [gentoo-user] " Tanstaafl
  0 siblings, 2 replies; 52+ messages in thread
From: Alan McKinnon @ 2014-08-02  9:33 UTC (permalink / raw
  To: gentoo-user

On 01/08/2014 23:13, J. Roeleveld wrote:
> On 1 August 2014 19:32:36 CEST, Alan McKinnon <alan.mckinnon@gmail.com> wrote:
>> Hi,
>>
>> Up-front disclaimer: Mostly [OT] post. But at least I'll test drive it
>> on Gentoo before putting it in production :-)
>>
>> New job, new environment. Existing persons suffer from
>> 5-year-old-with-a-hammer syndrome and assume cron is the solution to
>> all
>> ills. Result: a towering edifice of cron jobs that may or may not
>> clobber each other's work, may or may not work at all, and implement no
>> error handling at all. But my god, can they spew out mail from STOUT
>>
>>
>> But cron has only one event trigger: wall-clock time. And it's a very
>> blunt weapon. I'm looking for recommendations of alternative schedulers
>> that satisfy real-world business needs that need some other event
>> trigger than what the time is right now.
>>
>> For those familiar with it, I'm looking for something with the useful
>> feature set, without the useless features and without the price tag of
>> ControlM
>>
>> Anyone care to share experiences?
> 
> I'm also looking for a free alternative.
> At most of my clients, I see Tivoli Workload Scheduler (TWS) being used a lot. 
> 
> It has most things what you want from an intelligent multi host scheduler. Unfortunately,  it also comes with a corresponding price tag.

I have an unusual boss. He's a business owner and quite naturally
profit-driven. He also employs smart people and expects us to maintain
systems in-house.

He's also a zealous FLOSS fan.

So when I present him a price tag for software his first question is
always "is there any free as in freedom software suited for the job?"

I'm still trying to wrap my brains around dealing with a boss that
thinks like this :-)

> 
> If anyone knows of an OS project with comparable features, please let me know. 
> Failing this, it is on my list to start writing one myself when I get some spare time. 
> 
> --
> Joost
> 


-- 
Alan McKinnon
alan.mckinnon@gmail.com



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [gentoo-user] Recommendations for scheduler
  2014-08-02  9:33   ` Alan McKinnon
@ 2014-08-02 13:31     ` J. Roeleveld
  2014-08-02 14:03       ` Alan McKinnon
  2014-08-03  7:50       ` Martin Vaeth
  2014-08-03 13:02     ` [gentoo-user] " Tanstaafl
  1 sibling, 2 replies; 52+ messages in thread
From: J. Roeleveld @ 2014-08-02 13:31 UTC (permalink / raw
  To: gentoo-user

[-- Attachment #1: Type: text/plain, Size: 2657 bytes --]

On Saturday, August 02, 2014 11:33:30 AM Alan McKinnon wrote:
> On 01/08/2014 23:13, J. Roeleveld wrote:
> > On 1 August 2014 19:32:36 CEST, Alan McKinnon 
<alan.mckinnon@gmail.com> wrote:
> >> Hi,
> >> 
> >> Up-front disclaimer: Mostly [OT] post. But at least I'll test drive it
> >> on Gentoo before putting it in production :-)
> >> 
> >> New job, new environment. Existing persons suffer from
> >> 5-year-old-with-a-hammer syndrome and assume cron is the solution 
to
> >> all
> >> ills. Result: a towering edifice of cron jobs that may or may not
> >> clobber each other's work, may or may not work at all, and 
implement no
> >> error handling at all. But my god, can they spew out mail from STOUT
> >> 
> >> 
> >> But cron has only one event trigger: wall-clock time. And it's a very
> >> blunt weapon. I'm looking for recommendations of alternative 
schedulers
> >> that satisfy real-world business needs that need some other event
> >> trigger than what the time is right now.
> >> 
> >> For those familiar with it, I'm looking for something with the useful
> >> feature set, without the useless features and without the price tag 
of
> >> ControlM
> >> 
> >> Anyone care to share experiences?
> > 
> > I'm also looking for a free alternative.
> > At most of my clients, I see Tivoli Workload Scheduler (TWS) being used 
a
> > lot.
> > 
> > It has most things what you want from an intelligent multi host 
scheduler.
> > Unfortunately,  it also comes with a corresponding price tag.
> I have an unusual boss. He's a business owner and quite naturally
> profit-driven. He also employs smart people and expects us to maintain
> systems in-house.
> 
> He's also a zealous FLOSS fan.
> 
> So when I present him a price tag for software his first question is
> always "is there any free as in freedom software suited for the job?"

Depends on the specific requirements.
If you want:
- time based start of a schedule
- dependencies in said schedules and between schedules which can delay 
the actual start
- stop of schedule if error occurs
- ability to restart schedule from crashed point
- have schedules operate over multiple machines (eg. part run on 
database, some on a compute-cluster, some other bit making nice graphs 
and printing it,...)

Then you might be out of luck.
If anyone has something that is already going along these lines, please let 
me know. I am more then willing to spend time and effort to assist in the 
development. Doing a project like that on my own in my extremely limited 
free time is not really an option.

> I'm still trying to wrap my brains around dealing with a boss that
> thinks like this :-)

Hehe :)

--
Joost

[-- Attachment #2: Type: text/html, Size: 11807 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [gentoo-user] Re: Recommendations for scheduler
  2014-08-02  9:18       ` Alan McKinnon
@ 2014-08-02 13:34         ` J. Roeleveld
  0 siblings, 0 replies; 52+ messages in thread
From: J. Roeleveld @ 2014-08-02 13:34 UTC (permalink / raw
  To: gentoo-user

[-- Attachment #1: Type: text/plain, Size: 1799 bytes --]

On Saturday, August 02, 2014 11:18:32 AM Alan McKinnon wrote:
> On 01/08/2014 21:35, covici@ccs.covici.com wrote:
> > Alan McKinnon <alan.mckinnon@gmail.com> wrote:
> >> On 01/08/2014 20:17, James wrote:
> >>> Alan McKinnon <alan.mckinnon <at> gmail.com> writes:
> >>>> New job, new environment. Existing persons suffer from
> >>>> 5-year-old-with-a-hammer syndrome and assume cron is the 
solution to
> >>>> all
> >>>> ills. Result: a towering edifice of cron jobs that may or may not
> >>>> clobber each other's work, may or may not work at all, and 
implement no
> >>>> error handling at all. But my god, can they spew out mail from 
STOUT
> >>> 
> >>> Sounds like a department full of computer scientist I inherited a 
few
> >>> decades ago...........
> >> 
> >> I've met folks like that....
> >> Brilliant in their chosen field but completely useless outside it? The
> >> kind of fellows who see nothing wrong with eating a barbeque'd 
steak
> >> with a spoon because they can get a result?
> >> 
> >>> I know nothing bout chronos, but I find it an interesting 
read....ymmv.
> >>> 
> >>> 
> >>> http://nerds.airbnb.com/introducing-chronos/
> >>> http://airbnb.github.io/chronos/
> >>> https://github.com/airbnb/chronos
> >> 
> >> Aaaaaaaah, now this sounds like something I can use. Proper 
dependency
> >> chains, Restful JSON interface so the devs can write code to drive it 
in
> >> automation.
> >> 
> >> Good find, thanks!
> > 
> > Unless I am missing something, chronos is not in the tree at all.
> 
> Correct, it isn't in the tree. But there's nothing stopping me from
> getting it in there

Neither are the dependencies. 

If you get it to work, don't forget to create a nice howto documentation as 
from what I found online, the documentation is incomplete and out of date.

--
Joost

[-- Attachment #2: Type: text/html, Size: 8740 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [gentoo-user] Recommendations for scheduler
  2014-08-02 13:31     ` J. Roeleveld
@ 2014-08-02 14:03       ` Alan McKinnon
  2014-08-02 16:53         ` [gentoo-user] " James
  2014-08-03  7:50       ` Martin Vaeth
  1 sibling, 1 reply; 52+ messages in thread
From: Alan McKinnon @ 2014-08-02 14:03 UTC (permalink / raw
  To: gentoo-user

On 02/08/2014 15:31, J. Roeleveld wrote:
> Depends on the specific requirements.
> 
> If you want:
> 
> - time based start of a schedule
> 
> - dependencies in said schedules and between schedules which can delay
> the actual start
> 
> - stop of schedule if error occurs
> 
> - ability to restart schedule from crashed point
> 
> - have schedules operate over multiple machines (eg. part run on
> database, some on a compute-cluster, some other bit making nice graphs
> and printing it,...)
> 
>  
> 
> Then you might be out of luck.
> 
> If anyone has something that is already going along these lines, please
> let me know. I am more then willing to spend time and effort to assist
> in the development. Doing a project like that on my own in my extremely
> limited free time is not really an option.
> 


Well, we've found 2 projects that at least in part seek to achieve our
general goals - chronos and Martin's new project.

Why don't we both fool around with them for a bit and get a sense of
what it will take to add features etc? Then we can meet back here and
discuss.

Always better to build on an existing foundation

-- 
Alan McKinnon
alan.mckinnon@gmail.com



^ permalink raw reply	[flat|nested] 52+ messages in thread

* [gentoo-user] Re: Recommendations for scheduler
  2014-08-02 14:03       ` Alan McKinnon
@ 2014-08-02 16:53         ` James
  2014-08-03  7:23           ` Joost Roeleveld
  0 siblings, 1 reply; 52+ messages in thread
From: James @ 2014-08-02 16:53 UTC (permalink / raw
  To: gentoo-user

Alan McKinnon <alan.mckinnon <at> gmail.com> writes:


> Well, we've found 2 projects that at least in part seek to achieve our
> general goals - chronos and Martin's new project.
> Why don't we both fool around with them for a bit and get a sense of
> what it will take to add features etc? Then we can meet back here and
> discuss. Always better to build on an existing foundation

Mesos looks promising for a variety of (Apache) reasons. Some key
technologies folks may want google about that are related:

Quincy (fair schedular)
Chronos (scheduler)
Hadoop (scheduler)
HDFS (clusterd file system)
http://gpo.zugaina.org/sys-cluster/apache-hadoop-common

Zookeeper (Fault tolerance)
SPARK ( optimized for interative jobs where a datase is resued in many
parallel operations (advanced math/science and many other apps.)
https://spark.apache.org/

Dryad  Torque   Mpiche2 MPI
Globus tookit

mesos_tech_report.pdf

It looks as though Amazon, google, facebook and many others
large in the Cluster/Cloud arena are using Mesos......?

So let's all post what we find, particularly in overlays.

hth,
James



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [gentoo-user] Re: Recommendations for scheduler
  2014-08-02 16:53         ` [gentoo-user] " James
@ 2014-08-03  7:23           ` Joost Roeleveld
  2014-08-03 12:16             ` Alan McKinnon
  2014-08-05 19:57             ` James
  0 siblings, 2 replies; 52+ messages in thread
From: Joost Roeleveld @ 2014-08-03  7:23 UTC (permalink / raw
  To: gentoo-user

On Saturday 02 August 2014 16:53:26 James wrote:
> Alan McKinnon <alan.mckinnon <at> gmail.com> writes:
> > Well, we've found 2 projects that at least in part seek to achieve our
> > general goals - chronos and Martin's new project.
> > Why don't we both fool around with them for a bit and get a sense of
> > what it will take to add features etc? Then we can meet back here and
> > discuss. Always better to build on an existing foundation
> 
> Mesos looks promising for a variety of (Apache) reasons. Some key
> technologies folks may want google about that are related:
> 
> Quincy (fair schedular)
> Chronos (scheduler)
> Hadoop (scheduler)

Hadoop not a scheduler. It's a framework for a Big Data clustered database.

> HDFS (clusterd file system)

Unless it's changed recently, not suitable for anything else then Hadoop and 
contains a single point of failure.

> http://gpo.zugaina.org/sys-cluster/apache-hadoop-common
> 
> Zookeeper (Fault tolerance)
> SPARK ( optimized for interative jobs where a datase is resued in many
> parallel operations (advanced math/science and many other apps.)
> https://spark.apache.org/
> 
> Dryad  Torque   Mpiche2 MPI
> Globus tookit
> 
> mesos_tech_report.pdf
> 
> It looks as though Amazon, google, facebook and many others
> large in the Cluster/Cloud arena are using Mesos......?
> 
> So let's all post what we find, particularly in overlays.

Unless you are dealing with Big Data projects, like Google, Facebook, Amazon, 
big banks,... you don't have much use for those projects.

Mesos looks like a nice project, just like Hadoop and related are also nice. 
But for most people, they are as usefull as using Exalytics.

A scheduler should not have a large set of dependencies that you wouldn't use 
otherwise. That makes Chronos a non-option to me.

Martin's project looks promising, but doesn't store the schedules internally. 
For repeating schedules, like what Alan was describing, you need to put those 
into scripts and start those from an existing cron.

Of the 2, I think improving Martin's project is the most likely option for me 
as it doesn't have additional dependencies and seems to be easily implemented.

--
Joost

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [gentoo-user] Re: Recommendations for scheduler
  2014-08-02 13:31     ` J. Roeleveld
  2014-08-02 14:03       ` Alan McKinnon
@ 2014-08-03  7:50       ` Martin Vaeth
  2014-08-03  8:06         ` J. Roeleveld
  1 sibling, 1 reply; 52+ messages in thread
From: Martin Vaeth @ 2014-08-03  7:50 UTC (permalink / raw
  To: gentoo-user

J. Roeleveld <joost@antarean.org> wrote:
>
> Depends on the specific requirements.
> If you want:

In a sense, most you require can be done with my mentioned "schedule"
tool, although perhaps the usage is not in the way you expected.
I reorder your points for a clearer explanation:

> - have schedules operate over multiple machines (eg. part run on 
> database, some on a compute-cluster, some other bit making nice graphs 
> and printing it,...)

Since "schedule" can use TCP for communication, this should not be
a problem if you let "schedule-server" listen world-wide
(export SCHEDULE_SERVER_OPTS=-a0.0.0.0)

For the actual scheduling you must setup your machines correspondingly:
Queue on one machine the task doing the database access you want
(with "schedule -a[serveraddress] queue command_to_access_database")
and similarly on the other machines.
Of course, ssh or anything else can be used to do this without
physically accessing the machines.

Then, on one machine (not necessarily that of the server),
you run an appropriate "driver" script.

> - time based start of a schedule
> - dependencies in said schedules and between schedules which can delay
> the actual start
> - stop of schedule if error occurs

All this is not a problem, since the "driver" script is just a
shell script which calls "schedule" to start the tasks,
wait for them being finished and/or checking their exit status.
This is perhaps inconvenient but has the advantage of being
absolutely flexible:
You can use all linux tools like "sleep" (or also use at or cron)
to get any delays you want, do tests more powerful than checking
the exit status etc.

> - ability to restart schedule from crashed point

Running non-yet started jobs after a crash is not a problem -
you just edit your "driver" script appropriately and restart it.
Jobs which were already running need to be re-queued if they
should be running again.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [gentoo-user] Re: Recommendations for scheduler
  2014-08-03  7:50       ` Martin Vaeth
@ 2014-08-03  8:06         ` J. Roeleveld
  2014-08-03 12:10           ` Martin Vaeth
  0 siblings, 1 reply; 52+ messages in thread
From: J. Roeleveld @ 2014-08-03  8:06 UTC (permalink / raw
  To: gentoo-user

On Sunday, August 03, 2014 07:50:57 AM Martin Vaeth wrote:
> J. Roeleveld <joost@antarean.org> wrote:
> > Depends on the specific requirements.
> 
> > If you want:
> In a sense, most you require can be done with my mentioned "schedule"
> tool, although perhaps the usage is not in the way you expected.

I agree, based on a quick look.

> I reorder your points for a clearer explanation:

<snipped explanation>

A useful addition to your schedule-tool would be to store the scripts in a way 
that makes editing simpler and then add an editing tool to make this process 
simpler.
Add monitoring (email alerts, webpage, front-end) to check the status of all 
the batch-jobs.

I might be mistaken, but I think the server keeps the entire queue in-memory 
and when the process dies, the status is lost?
Or is it kept somewhere?

--
Joost

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [gentoo-user] Recommendations for scheduler
  2014-08-01 19:10     ` Alan McKinnon
@ 2014-08-03  9:27       ` Bruce Schultz
  2014-08-03 12:08         ` Alan McKinnon
  0 siblings, 1 reply; 52+ messages in thread
From: Bruce Schultz @ 2014-08-03  9:27 UTC (permalink / raw
  To: gentoo-user



On 2 August 2014 5:10:43 AM AEST, Alan McKinnon <alan.mckinnon@gmail.com> wrote:
>On 01/08/2014 19:50, Сергей wrote:
>> Also you can have a look at anacron.
>> 
>> 
>> 
>
>
>Unfortunately, anacron doesn't suit my needs at all. Here's how anacron
>works:
>
>this bunch of job will all happen today regardless of what time it is.
>That's not what I need, I need something that has very little to do
>with
>time. Example:
>
>1. Start backup job on db server A
>2. When complete, copy backup to server B and do a test import
>3. If import succeeds, move backup to permanent storage and log the
>fact
>4. If import fails, raise an alert and trigger the whole cycle to start
>again at 1
>
>Meanwhile,
>
>1. All servers are regularly doing apt-get update and downloading
>.debs,
>and applying security packages. Delay this on the db server if a backup
>is in progress.
>
>Meanwhile there is the regular Friday 5am code-publish cycle and
>month-end finance runs - this is a DevOps environment.

I'm not sure if its quite what you have in mind, and it comes with a bit of a steep learning curve, but cfengine might fit the bill.

http://cfengine.com

Bruce


-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [gentoo-user] Recommendations for scheduler
  2014-08-03  9:27       ` Bruce Schultz
@ 2014-08-03 12:08         ` Alan McKinnon
  2014-08-04  3:07           ` Bruce Schultz
  0 siblings, 1 reply; 52+ messages in thread
From: Alan McKinnon @ 2014-08-03 12:08 UTC (permalink / raw
  To: gentoo-user

On 03/08/2014 11:27, Bruce Schultz wrote:
> 
> 
> On 2 August 2014 5:10:43 AM AEST, Alan McKinnon <alan.mckinnon@gmail.com> wrote:
>> On 01/08/2014 19:50, Сергей wrote:
>>> Also you can have a look at anacron.
>>>
>>>
>>>
>>
>>
>> Unfortunately, anacron doesn't suit my needs at all. Here's how anacron
>> works:
>>
>> this bunch of job will all happen today regardless of what time it is.
>> That's not what I need, I need something that has very little to do
>> with
>> time. Example:
>>
>> 1. Start backup job on db server A
>> 2. When complete, copy backup to server B and do a test import
>> 3. If import succeeds, move backup to permanent storage and log the
>> fact
>> 4. If import fails, raise an alert and trigger the whole cycle to start
>> again at 1
>>
>> Meanwhile,
>>
>> 1. All servers are regularly doing apt-get update and downloading
>> .debs,
>> and applying security packages. Delay this on the db server if a backup
>> is in progress.
>>
>> Meanwhile there is the regular Friday 5am code-publish cycle and
>> month-end finance runs - this is a DevOps environment.
> 
> I'm not sure if its quite what you have in mind, and it comes with a bit of a steep learning curve, but cfengine might fit the bill.
> 
> http://cfengine.com

Hi Bruce,

Thanks for the reply.

I only worked with cfengine once, briefly, years ago, and we quickly
decided to roll our own deployment solution to solve that very specific
vertical problem.


Isn't cfengine a deployment framework, similar in ideals to puppet and
chef?

I don't want to deploy code or manage state, I want to run code
(backups, database maintenance, repair of dodgy data in databases and
code publish in a devops environment)


-- 
Alan McKinnon
alan.mckinnon@gmail.com



^ permalink raw reply	[flat|nested] 52+ messages in thread

* [gentoo-user] Re: Recommendations for scheduler
  2014-08-03  8:06         ` J. Roeleveld
@ 2014-08-03 12:10           ` Martin Vaeth
  2014-08-03 13:36             ` J. Roeleveld
  0 siblings, 1 reply; 52+ messages in thread
From: Martin Vaeth @ 2014-08-03 12:10 UTC (permalink / raw
  To: gentoo-user

J. Roeleveld <joost@antarean.org> wrote:
>
> A useful addition to your schedule-tool would be to store the
> scripts in a way that makes editing simpler

Since it is an arbitrary script in an arbitrary language,
I think this is not in the scope of this project to do this.
In most cases I used it so far, 1-2 more or less complex lines
(maybe a few more if they would not be complex)
in an interactive zsh were enough, and these are very simple
enough to edit in zsh, i.e. I even did not write any script "file"
in the classical sense.

> I might be mistaken, but I think the server keeps the entire
> queue in-memory and when the process dies, the status is lost?

Yes, the server process must not die.

If it dies, not only the queue is lost but also the waiting processes
(that is: queued but not yet started) cannot be reached anymore:
These waiting processes do not have their own TCP socket but just
keep their established connection with the server's socket until
the server tells them through this connection to start or to cancel;
if this connection gets lost, the waiting processes die:
What else could they do, reasonably?

The already started processes have a unique ID (into which the
server's process is encoded): They reestablish the connection to report
the exit status according to this ID. If the server is stopped,
they cannot report this status, of course, and moreover,
a new server does not know their IDs either and thus will ignore these
"status reports".

Maybe this "protocol" is not the most clever solution, but it is
one which could be implemented without lots of overhead:
Mainly, I was up to a "quick" solution which is working good enough
for me: If the server has no bugs, why should it die?
Moreover, if the server dies for some strange reasons, it is probably
safer to re-queue the jobs again, anyway.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [gentoo-user] Re: Recommendations for scheduler
  2014-08-03  7:23           ` Joost Roeleveld
@ 2014-08-03 12:16             ` Alan McKinnon
  2014-08-03 13:33               ` J. Roeleveld
  2014-08-05 19:57             ` James
  1 sibling, 1 reply; 52+ messages in thread
From: Alan McKinnon @ 2014-08-03 12:16 UTC (permalink / raw
  To: gentoo-user

On 03/08/2014 09:23, Joost Roeleveld wrote:
> On Saturday 02 August 2014 16:53:26 James wrote:
>> Alan McKinnon <alan.mckinnon <at> gmail.com> writes:
>>> Well, we've found 2 projects that at least in part seek to achieve our
>>> general goals - chronos and Martin's new project.
>>> Why don't we both fool around with them for a bit and get a sense of
>>> what it will take to add features etc? Then we can meet back here and
>>> discuss. Always better to build on an existing foundation
>>
>> Mesos looks promising for a variety of (Apache) reasons. Some key
>> technologies folks may want google about that are related:
>>
>> Quincy (fair schedular)
>> Chronos (scheduler)
>> Hadoop (scheduler)
> 
> Hadoop not a scheduler. It's a framework for a Big Data clustered database.
> 
>> HDFS (clusterd file system)
> 
> Unless it's changed recently, not suitable for anything else then Hadoop and 
> contains a single point of failure.
> 
>> http://gpo.zugaina.org/sys-cluster/apache-hadoop-common
>>
>> Zookeeper (Fault tolerance)
>> SPARK ( optimized for interative jobs where a datase is resued in many
>> parallel operations (advanced math/science and many other apps.)
>> https://spark.apache.org/
>>
>> Dryad  Torque   Mpiche2 MPI
>> Globus tookit
>>
>> mesos_tech_report.pdf
>>
>> It looks as though Amazon, google, facebook and many others
>> large in the Cluster/Cloud arena are using Mesos......?
>>
>> So let's all post what we find, particularly in overlays.
> 
> Unless you are dealing with Big Data projects, like Google, Facebook, Amazon, 
> big banks,... you don't have much use for those projects.


My wife works in BigData for real, she and Joost speak the same
language, I don't :-)
She reckons Big Data is like teenage sex - everyone says they are doing
it and no-one really does ;-D


> Mesos looks like a nice project, just like Hadoop and related are also nice. 
> But for most people, they are as usefull as using Exalytics.

A bit OT, but it might be worthwhile for interested persons to get good
ebuilds going for these projects. Someone will use it on Gentoo, and it
will add value to the project. Much like gems and other
business-oriented packages benefit


> 
> A scheduler should not have a large set of dependencies that you wouldn't use 
> otherwise. That makes Chronos a non-option to me.
> 
> Martin's project looks promising, but doesn't store the schedules internally. 
> For repeating schedules, like what Alan was describing, you need to put those 
> into scripts and start those from an existing cron.

Sounds like a small feature-add. If Martin did his groundwork
correctly[1] then the core logic will work and it's just a case of
adding some persistence and loading the data back in on demand

> Of the 2, I think improving Martin's project is the most likely option for me 
> as it doesn't have additional dependencies and seems to be easily implemented.

Don't forget Martins is the guy who does eix.
Street cred? check
Knows Gentoo? check





[1] I only say it this way as I haven't evaluated his code at all yet so
have no idea how far Martin has taken it


-- 
Alan McKinnon
alan.mckinnon@gmail.com



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [gentoo-user] Recommendations for scheduler
  2014-08-02  9:33   ` Alan McKinnon
  2014-08-02 13:31     ` J. Roeleveld
@ 2014-08-03 13:02     ` Tanstaafl
  1 sibling, 0 replies; 52+ messages in thread
From: Tanstaafl @ 2014-08-03 13:02 UTC (permalink / raw
  To: gentoo-user

On 8/2/2014 5:33 AM, Alan McKinnon <alan.mckinnon@gmail.com> wrote:
> I have an unusual boss. He's a business owner and quite naturally
> profit-driven. He also employs smart people and expects us to maintain
> systems in-house.
>
> He's also a zealous FLOSS fan.
>
> So when I present him a price tag for software his first question is
> always "is there any free as in freedom software suited for the job?"
>
> I'm still trying to wrap my brains around dealing with a boss that
> thinks like this:-)

I am *sooooooooooo* jealous... ;)


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [gentoo-user] Re: Recommendations for scheduler
  2014-08-03 12:16             ` Alan McKinnon
@ 2014-08-03 13:33               ` J. Roeleveld
  0 siblings, 0 replies; 52+ messages in thread
From: J. Roeleveld @ 2014-08-03 13:33 UTC (permalink / raw
  To: gentoo-user

On Sunday, August 03, 2014 02:16:37 PM Alan McKinnon wrote:
> On 03/08/2014 09:23, Joost Roeleveld wrote:
> > On Saturday 02 August 2014 16:53:26 James wrote:
> >> Alan McKinnon <alan.mckinnon <at> gmail.com> writes:

<snipped>

> > Unless you are dealing with Big Data projects, like Google, Facebook,
> > Amazon, big banks,... you don't have much use for those projects.
> 
> My wife works in BigData for real, she and Joost speak the same
> language, I don't :-)
> She reckons Big Data is like teenage sex - everyone says they are doing
> it and no-one really does ;-D

I know a few companies that actually do use it.
But, the biggest issue with the whole "Big Data" thing is that noone really 
agrees on what it actually is.

> > Mesos looks like a nice project, just like Hadoop and related are also
> > nice. But for most people, they are as usefull as using Exalytics.
> 
> A bit OT, but it might be worthwhile for interested persons to get good
> ebuilds going for these projects. Someone will use it on Gentoo, and it
> will add value to the project. Much like gems and other
> business-oriented packages benefit

I agree, but just to implement a decent scheduler, I still think it's 
overkill.

> > A scheduler should not have a large set of dependencies that you wouldn't
> > use otherwise. That makes Chronos a non-option to me.
> > 
> > Martin's project looks promising, but doesn't store the schedules
> > internally. For repeating schedules, like what Alan was describing, you
> > need to put those into scripts and start those from an existing cron.
> 
> Sounds like a small feature-add. If Martin did his groundwork
> correctly[1] then the core logic will work and it's just a case of
> adding some persistence and loading the data back in on demand

The code looks clean and I think it shouldn't be too much work to add it.

> > Of the 2, I think improving Martin's project is the most likely option for
> > me as it doesn't have additional dependencies and seems to be easily
> > implemented.
> Don't forget Martins is the guy who does eix.
> Street cred? check
> Knows Gentoo? check
> 
> [1] I only say it this way as I haven't evaluated his code at all yet so
> have no idea how far Martin has taken it

The code is clean and does what Martin says it does.

--
Joost


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [gentoo-user] Re: Recommendations for scheduler
  2014-08-03 12:10           ` Martin Vaeth
@ 2014-08-03 13:36             ` J. Roeleveld
  2014-08-03 20:04               ` Alan McKinnon
  2014-08-04  8:41               ` Martin Vaeth
  0 siblings, 2 replies; 52+ messages in thread
From: J. Roeleveld @ 2014-08-03 13:36 UTC (permalink / raw
  To: gentoo-user

On Sunday, August 03, 2014 12:10:49 PM Martin Vaeth wrote:
> J. Roeleveld <joost@antarean.org> wrote:
> > A useful addition to your schedule-tool would be to store the
> > scripts in a way that makes editing simpler
> 
> Since it is an arbitrary script in an arbitrary language,
> I think this is not in the scope of this project to do this.
> In most cases I used it so far, 1-2 more or less complex lines
> (maybe a few more if they would not be complex)
> in an interactive zsh were enough, and these are very simple
> enough to edit in zsh, i.e. I even did not write any script "file"
> in the classical sense.
> 
> > I might be mistaken, but I think the server keeps the entire
> > queue in-memory and when the process dies, the status is lost?
> 
> Yes, the server process must not die.
> 
> If it dies, not only the queue is lost but also the waiting processes
> (that is: queued but not yet started) cannot be reached anymore:
> These waiting processes do not have their own TCP socket but just
> keep their established connection with the server's socket until
> the server tells them through this connection to start or to cancel;
> if this connection gets lost, the waiting processes die:
> What else could they do, reasonably?
> 
> The already started processes have a unique ID (into which the
> server's process is encoded): They reestablish the connection to report
> the exit status according to this ID. If the server is stopped,
> they cannot report this status, of course, and moreover,
> a new server does not know their IDs either and thus will ignore these
> "status reports".
> 
> Maybe this "protocol" is not the most clever solution, but it is
> one which could be implemented without lots of overhead:
> Mainly, I was up to a "quick" solution which is working good enough
> for me: If the server has no bugs, why should it die?
> Moreover, if the server dies for some strange reasons, it is probably
> safer to re-queue the jobs again, anyway.

With the kind of schedules I am working with (and I believe Alan will also end 
up with), restarting the whole process from the start can lead to issues.
Finding out how far the process got before the service crashed can become 
rather complex.

--
Joost


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [gentoo-user] Re: Recommendations for scheduler
  2014-08-03 13:36             ` J. Roeleveld
@ 2014-08-03 20:04               ` Alan McKinnon
  2014-08-03 20:23                 ` J. Roeleveld
  2014-08-04  8:41               ` Martin Vaeth
  1 sibling, 1 reply; 52+ messages in thread
From: Alan McKinnon @ 2014-08-03 20:04 UTC (permalink / raw
  To: gentoo-user

On 03/08/2014 15:36, J. Roeleveld wrote:
>> Maybe this "protocol" is not the most clever solution, but it is
>> > one which could be implemented without lots of overhead:
>> > Mainly, I was up to a "quick" solution which is working good enough
>> > for me: If the server has no bugs, why should it die?
>> > Moreover, if the server dies for some strange reasons, it is probably
>> > safer to re-queue the jobs again, anyway.

> With the kind of schedules I am working with (and I believe Alan will also end 
> up with), restarting the whole process from the start can lead to issues.
> Finding out how far the process got before the service crashed can become 
> rather complex.

Yes, very much so. My first concern is the database cleanups - without
scheduler guarantees I'd need transactions in MySQL.


-- 
Alan McKinnon
alan.mckinnon@gmail.com



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [gentoo-user] Re: Recommendations for scheduler
  2014-08-03 20:04               ` Alan McKinnon
@ 2014-08-03 20:23                 ` J. Roeleveld
  2014-08-03 20:57                   ` Alan McKinnon
  0 siblings, 1 reply; 52+ messages in thread
From: J. Roeleveld @ 2014-08-03 20:23 UTC (permalink / raw
  To: gentoo-user

On Sunday, August 03, 2014 10:04:50 PM Alan McKinnon wrote:
> On 03/08/2014 15:36, J. Roeleveld wrote:
> >> Maybe this "protocol" is not the most clever solution, but it is
> >> 
> >> > one which could be implemented without lots of overhead:
> >> > Mainly, I was up to a "quick" solution which is working good enough
> >> > for me: If the server has no bugs, why should it die?
> >> > Moreover, if the server dies for some strange reasons, it is probably
> >> > safer to re-queue the jobs again, anyway.
> > 
> > With the kind of schedules I am working with (and I believe Alan will also
> > end up with), restarting the whole process from the start can lead to
> > issues. Finding out how far the process got before the service crashed
> > can become rather complex.
> 
> Yes, very much so. My first concern is the database cleanups - without
> scheduler guarantees I'd need transactions in MySQL.

Or you migrate to PostgreSQL, but that is OT :)

--
Joost


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [gentoo-user] Re: Recommendations for scheduler
  2014-08-03 20:23                 ` J. Roeleveld
@ 2014-08-03 20:57                   ` Alan McKinnon
  2014-08-03 21:10                     ` J. Roeleveld
  0 siblings, 1 reply; 52+ messages in thread
From: Alan McKinnon @ 2014-08-03 20:57 UTC (permalink / raw
  To: gentoo-user

On 03/08/2014 22:23, J. Roeleveld wrote:
> On Sunday, August 03, 2014 10:04:50 PM Alan McKinnon wrote:
>> On 03/08/2014 15:36, J. Roeleveld wrote:
>>>> Maybe this "protocol" is not the most clever solution, but it is
>>>>
>>>>> one which could be implemented without lots of overhead:
>>>>> Mainly, I was up to a "quick" solution which is working good enough
>>>>> for me: If the server has no bugs, why should it die?
>>>>> Moreover, if the server dies for some strange reasons, it is probably
>>>>> safer to re-queue the jobs again, anyway.
>>>
>>> With the kind of schedules I am working with (and I believe Alan will also
>>> end up with), restarting the whole process from the start can lead to
>>> issues. Finding out how far the process got before the service crashed
>>> can become rather complex.
>>
>> Yes, very much so. My first concern is the database cleanups - without
>> scheduler guarantees I'd need transactions in MySQL.
> 
> Or you migrate to PostgreSQL, but that is OT :)


Maybe, but also valid :-)

I took one look at the schemas here and wondered "Why MySQL? This is
Postgres territory". It's a case of LAMP tunnel vision.





-- 
Alan McKinnon
alan.mckinnon@gmail.com



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [gentoo-user] Re: Recommendations for scheduler
  2014-08-03 20:57                   ` Alan McKinnon
@ 2014-08-03 21:10                     ` J. Roeleveld
  0 siblings, 0 replies; 52+ messages in thread
From: J. Roeleveld @ 2014-08-03 21:10 UTC (permalink / raw
  To: gentoo-user

On Sunday, August 03, 2014 10:57:06 PM Alan McKinnon wrote:
> On 03/08/2014 22:23, J. Roeleveld wrote:
> > On Sunday, August 03, 2014 10:04:50 PM Alan McKinnon wrote:
> >> On 03/08/2014 15:36, J. Roeleveld wrote:
> >>>> Maybe this "protocol" is not the most clever solution, but it is
> >>>> 
> >>>>> one which could be implemented without lots of overhead:
> >>>>> Mainly, I was up to a "quick" solution which is working good enough
> >>>>> for me: If the server has no bugs, why should it die?
> >>>>> Moreover, if the server dies for some strange reasons, it is probably
> >>>>> safer to re-queue the jobs again, anyway.
> >>> 
> >>> With the kind of schedules I am working with (and I believe Alan will
> >>> also
> >>> end up with), restarting the whole process from the start can lead to
> >>> issues. Finding out how far the process got before the service crashed
> >>> can become rather complex.
> >> 
> >> Yes, very much so. My first concern is the database cleanups - without
> >> scheduler guarantees I'd need transactions in MySQL.
> > 
> > Or you migrate to PostgreSQL, but that is OT :)
> 
> Maybe, but also valid :-)
> 
> I took one look at the schemas here and wondered "Why MySQL? This is
> Postgres territory". It's a case of LAMP tunnel vision.

That and that people who start with LAMP don't learn SQL.
This leads to code that is near impossible to port to a different database and 
when people actually want to do all the work to get the SQL to work on any 
database, the projects involved refuse the patches.

--
Joost


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [gentoo-user] Recommendations for scheduler
  2014-08-03 12:08         ` Alan McKinnon
@ 2014-08-04  3:07           ` Bruce Schultz
  0 siblings, 0 replies; 52+ messages in thread
From: Bruce Schultz @ 2014-08-04  3:07 UTC (permalink / raw
  To: gentoo-user

On 3 August 2014 10:08:39 PM AEST, Alan McKinnon <alan.mckinnon@gmail.com> wrote:
>On 03/08/2014 11:27, Bruce Schultz wrote:
>> 
>> 
>> On 2 August 2014 5:10:43 AM AEST, Alan McKinnon
><alan.mckinnon@gmail.com> wrote:
>>> On 01/08/2014 19:50, Сергей wrote:
>>>> Also you can have a look at anacron.
>>>>
>>>>
>>>>
>>>
>>>
>>> Unfortunately, anacron doesn't suit my needs at all. Here's how
>anacron
>>> works:
>>>
>>> this bunch of job will all happen today regardless of what time it
>is.
>>> That's not what I need, I need something that has very little to do
>>> with
>>> time. Example:
>>>
>>> 1. Start backup job on db server A
>>> 2. When complete, copy backup to server B and do a test import
>>> 3. If import succeeds, move backup to permanent storage and log the
>>> fact
>>> 4. If import fails, raise an alert and trigger the whole cycle to
>start
>>> again at 1
>>>
>>> Meanwhile,
>>>
>>> 1. All servers are regularly doing apt-get update and downloading
>>> .debs,
>>> and applying security packages. Delay this on the db server if a
>backup
>>> is in progress.
>>>
>>> Meanwhile there is the regular Friday 5am code-publish cycle and
>>> month-end finance runs - this is a DevOps environment.
>> 
>> I'm not sure if its quite what you have in mind, and it comes with a
>bit of a steep learning curve, but cfengine might fit the bill.
>> 
>> http://cfengine.com
>
>Hi Bruce,
>
>Thanks for the reply.
>
>I only worked with cfengine once, briefly, years ago, and we quickly
>decided to roll our own deployment solution to solve that very specific
>vertical problem.
>
>
>Isn't cfengine a deployment framework, similar in ideals to puppet and
>chef?
>
>I don't want to deploy code or manage state, I want to run code
>(backups, database maintenance, repair of dodgy data in databases and
>code publish in a devops environment)

Cfengine can run arbitrary commands at scheduled times, so it is capable as a replacment for cron. It also has package management built in for your package updates.

It is in the same vein as chef & puppet, but "deployment framework" is not the way I would describe it. Deployment is only be a subset of what you can do with it.

Cfengine3 was a major rewrite over version 2. The community edition is open source and should be available in Debian. The gentoo ebuild is a bit out of date currently. It also comes as a supported enterprise version which adds some sort of framework around the core - I've never personally looked into the enterprise features though.

Bruce

-- 
:B


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [gentoo-user] Re: Recommendations for scheduler
  2014-08-03 13:36             ` J. Roeleveld
  2014-08-03 20:04               ` Alan McKinnon
@ 2014-08-04  8:41               ` Martin Vaeth
  2014-08-04  9:02                 ` J. Roeleveld
  1 sibling, 1 reply; 52+ messages in thread
From: Martin Vaeth @ 2014-08-04  8:41 UTC (permalink / raw
  To: gentoo-user

J. Roeleveld <joost@antarean.org> wrote:
>
> With the kind of schedules I am working with (and I believe Alan will
> also end up with), restarting the whole process from the start can
> lead to issues.
> Finding out how far the process got before the service crashed can become
> rather complex.

I am not sure whether I understand this correctly:
schedule has not a problem to display which tasks have
finished/failed/are still running at any time.
Of course, a finer granulation than tasks are not possible ("how far
has a certain task got?") because this would require knowledge
about the task and how to check it - you need to be able to
split your tasks into more shell commands to make a finer granulation
available for "schedule".

You can just rerun your "driving" script with the effect that the
tasks which already are finished/failed will actually not be
restarted, but the behaviour is as if they would finish immediately
and report that they are finished/failed. (When you plan to do this,
I would suggest to schedule things like "sleep" as separate tasks,
too, and not build them into the "driving" script.)

If there is an unexpected problem, and e.g. you want to re-run
a failed task anyway, you can just re-queue your new task on
the same place as there was the previous task, e.g.
	schedule remove jobnr
	schedule -j jobnr queue commmand to do your task
Then the old job (and its state) is replaced by the new queued job,
and your (identical as before) driving script will start it instead
of assuming that the job is already finished.

In order to avoid races, I would recommend to do the above only
while your driving script is not running (e.g., you can put it
in the background with ctrl-z if you have written it in (...) or
if it is really a "classical" script, and then continue it with "fg";
or you even stop it completely with Ctrl-c and re-run it, depending
on what you want): The problem is that between the above two commands
the jobs after "jobnr" are renumbered.
Alternatively, you can insert your new job at the end of the joblist
and then use something like (untested)
	schedule -jjobnr insert 0 jobnr+1:-1
	schedule remove 0
to to re-sort your job list: The "insert" is race-free,
and having added a job at the end for some time will hopefully not
disturb anything.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [gentoo-user] Re: Recommendations for scheduler
  2014-08-04  8:41               ` Martin Vaeth
@ 2014-08-04  9:02                 ` J. Roeleveld
  2014-08-04 10:11                   ` Martin Vaeth
  0 siblings, 1 reply; 52+ messages in thread
From: J. Roeleveld @ 2014-08-04  9:02 UTC (permalink / raw
  To: gentoo-user

On 4 August 2014 10:41:04 CEST, Martin Vaeth <martin@mvath.de> wrote:
>J. Roeleveld <joost@antarean.org> wrote:
>>
>> With the kind of schedules I am working with (and I believe Alan will
>> also end up with), restarting the whole process from the start can
>> lead to issues.
>> Finding out how far the process got before the service crashed can
>become
>> rather complex.
>
>I am not sure whether I understand this correctly:

The schedules I am used to dealing with easily span 8 - 14 hours with occasionally even over a week.
These schedules then also can't be restarted from the beginning when they stop halfway through without risking massive consistency problems in the final data.

And then multiple of those starting at random times with occasionally a whole bunch of the same schedule put into the queue with dependencies to the previous run.

If, during that time, one of the machines has a hardware failure or the scheduling process crashes on one or more of the servers, the last state needs to be recoverable.

If you have to clean up the environment and bring it back to a state where you can restart the schedules, it saves time if you know which commands and tasks were actually running at the time.

For this, the schedules, queues and current state for each node needs to be stored on persistent storage.

Hope this clarifies it all a bit.
--
Joost

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [gentoo-user] Re: Recommendations for scheduler
  2014-08-04  9:02                 ` J. Roeleveld
@ 2014-08-04 10:11                   ` Martin Vaeth
  2014-08-04 10:40                     ` J. Roeleveld
  0 siblings, 1 reply; 52+ messages in thread
From: Martin Vaeth @ 2014-08-04 10:11 UTC (permalink / raw
  To: gentoo-user

J. Roeleveld <joost@antarean.org> wrote:
>
> These schedules then also can't be restarted from the beginning
> when they stop halfway through without risking massive consistency
> problems in the final data.

So you have a command which might break due to hardware error
and cannot be rerun. I cannot see how any general-purpose scheduler
might help you here: You either need to be able to split your command
into several (sequential) commands or you need something adapted
for your particular command.

> And then multiple of those starting at random times with
> occasionally a whole bunch of the same schedule put into the
> queue with dependencies to the previous run.

That's not a problem. Only if the granularity of one command is
not fine enough, it becomes a problem.

> If, during that time, one of the machines has a hardware failure
> or the scheduling process crashes on one or more of the servers,
> the last state needs to be recoverable.

One must distinguish two cases:

1. The machine running "schedule-server" has a hardware failure.
   (Let us assume tha "schedule-server" does not have a software failure -
   otherwise, you have problems anyway.)
2. Some other machine has a hardware failure.

Case 2. is not bad (as concerns the scheduling): Of course, the
machine will not report that it completed the job, and you will
have to think how to complete the job. But it is clear that in
such exceptional cases you have to interfere manually in some sense.

In order to deal with case 1., you can regularly (e.g. each minute)
dump the output of "schedule list" (possibly suppressing non-important
data through the options to keep it short).
One could add a logging option to decrease the possible race of 1 minute,
but in case of hardware failure a possible race cannot be excluded anyway.

In case 1. you manually have to re-queue the jobs and think what to do
with the already started jobs. However, I cannot imagine that this
occurs so frequently that this exceptional case becomes something
one should seriously think about.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [gentoo-user] Re: Recommendations for scheduler
  2014-08-04 10:11                   ` Martin Vaeth
@ 2014-08-04 10:40                     ` J. Roeleveld
  2014-08-04 13:31                       ` Martin Vaeth
  0 siblings, 1 reply; 52+ messages in thread
From: J. Roeleveld @ 2014-08-04 10:40 UTC (permalink / raw
  To: gentoo-user

On Monday, August 04, 2014 10:11:41 AM Martin Vaeth wrote:
> J. Roeleveld <joost@antarean.org> wrote:
> > These schedules then also can't be restarted from the beginning
> > when they stop halfway through without risking massive consistency
> > problems in the final data.
> 
> So you have a command which might break due to hardware error
> and cannot be rerun. I cannot see how any general-purpose scheduler
> might help you here: You either need to be able to split your command
> into several (sequential) commands or you need something adapted
> for your particular command.

A general-purpose scheduler can work, as they do exist. (With a price tag)
In the OSS world, there is, to my knowledge, none.
Yours seems to be the most promising as it looks like the missing features 
shouldn't be too difficult to add.

The commands are relatively simple, but they deal with large amounts of data. 
I am talking about ETL processes that, due to the amount of data being 
processed, can easily take several hours per step.
If, during one of these steps, the database or ETL process suffers a crash, 
the activities of the ETL process need to be rolled back to the point where 
you can restart it.

I am not talking about simple schedules related to day-to-day maintenance of a 
few servers.

> > And then multiple of those starting at random times with
> > occasionally a whole bunch of the same schedule put into the
> > queue with dependencies to the previous run.
> 
> That's not a problem. Only if the granularity of one command is
> not fine enough, it becomes a problem.

If nothing happens, it can all be stuck into a single script and the end 
result will be the same. Problems start because the real world is not 100% 
reliable.

> > If, during that time, one of the machines has a hardware failure
> > or the scheduling process crashes on one or more of the servers,
> > the last state needs to be recoverable.
> 
> One must distinguish two cases:
> 
> 1. The machine running "schedule-server" has a hardware failure.
>    (Let us assume tha "schedule-server" does not have a software failure -
>    otherwise, you have problems anyway.)
> 2. Some other machine has a hardware failure.
> 
> Case 2. is not bad (as concerns the scheduling): Of course, the
> machine will not report that it completed the job, and you will
> have to think how to complete the job. But it is clear that in
> such exceptional cases you have to interfere manually in some sense.

Agreed, this happens more often then you might think.

> In order to deal with case 1., you can regularly (e.g. each minute)
> dump the output of "schedule list" (possibly suppressing non-important
> data through the options to keep it short).

Or all the necessary information is kept in-sync on persistent storage. This 
would then also allow easy fail-over if the master-schedule-node fails. A 2nd 
machine could quickly take over.

> One could add a logging option to decrease the possible race of 1 minute,
> but in case of hardware failure a possible race cannot be excluded anyway.
> 
> In case 1. you manually have to re-queue the jobs and think what to do
> with the already started jobs. However, I cannot imagine that this
> occurs so frequently that this exceptional case becomes something
> one should seriously think about.

As I mentioned above, with BI infrastructure (large databases, complex ETL 
processes, interactive report services,....), the scheduler is busy 24/7. The 
amount of tasks, schedules, dependencies, states,.... that needs to kept track 
off can easily lead to unforeseen issues and bugs.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [gentoo-user] Re: Recommendations for scheduler
  2014-08-04 10:40                     ` J. Roeleveld
@ 2014-08-04 13:31                       ` Martin Vaeth
  2014-08-04 13:35                         ` Alan McKinnon
  2014-08-04 19:54                         ` J. Roeleveld
  0 siblings, 2 replies; 52+ messages in thread
From: Martin Vaeth @ 2014-08-04 13:31 UTC (permalink / raw
  To: gentoo-user

J. Roeleveld <joost@antarean.org> wrote:
>>
>> So you have a command which might break due to hardware error
>> and cannot be rerun. I cannot see how any general-purpose scheduler
>> might help you here: You either need to be able to split your command
>> into several (sequential) commands or you need something adapted
>> for your particular command.
>
> A general-purpose scheduler can work, as they do exist.

I doubt that they can solve your problem.
Let me repeat: You have a single program which accesses the database
in a complex way and somewhere in the course of accessing it, the
machine (or program) crashes.
No general-purpose program can recover from this: You need
particular knowledge of the database and the program if you even
want to have a *chance* to recover from such a situation.
A program with such a particular knowledge can hardly be called
"general-purpose".

> If, during one of these steps, the database or ETL process suffers a
> crash, the activities of the ETL process need to be rolled back to
> the point where you can restart it.

I agree, but you need particular knowledge of the database and
your tasks to do this which is far beyond the job of a scheduler.
As already mentioned by someone in this thread, your problem needs
to be solved on the level of the database (using
snapshopt capabilities etc.)

>> In order to deal with case 1., you can regularly (e.g. each minute)
>> dump the output of "schedule list" (possibly suppressing non-important
>> data through the options to keep it short).
>
> Or all the necessary information is kept in-sync on persistent storage.
> This would then also allow easy fail-over if the master-schedule-node
> fails

No, it wouldn't, since jobs just finishing and wanting to report their
status cannot do this when there is no server. You would need a rather
involved protocol to deal with such situations dynamically.
It can certainly be done, but it is not something which can
easily be "added" as a feature: If this is required, it has to be the
fundamental concept from the very beginning and everything else has to
follow this first aim. You need different protocols than TCP sockets,
to start with; something like "dbus over IP" with servers being able
to announce their new presence, etc.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [gentoo-user] Re: Recommendations for scheduler
  2014-08-04 13:31                       ` Martin Vaeth
@ 2014-08-04 13:35                         ` Alan McKinnon
  2014-08-04 19:46                           ` J. Roeleveld
  2014-08-04 19:54                         ` J. Roeleveld
  1 sibling, 1 reply; 52+ messages in thread
From: Alan McKinnon @ 2014-08-04 13:35 UTC (permalink / raw
  To: gentoo-user

On 04/08/2014 15:31, Martin Vaeth wrote:
> J. Roeleveld <joost@antarean.org> wrote:
>>>
>>> So you have a command which might break due to hardware error
>>> and cannot be rerun. I cannot see how any general-purpose scheduler
>>> might help you here: You either need to be able to split your command
>>> into several (sequential) commands or you need something adapted
>>> for your particular command.
>>
>> A general-purpose scheduler can work, as they do exist.
> 
> I doubt that they can solve your problem.
> Let me repeat: You have a single program which accesses the database
> in a complex way and somewhere in the course of accessing it, the
> machine (or program) crashes.
> No general-purpose program can recover from this: You need
> particular knowledge of the database and the program if you even
> want to have a *chance* to recover from such a situation.
> A program with such a particular knowledge can hardly be called
> "general-purpose".


Joost,

Either make the ETL tool pick up where it stopped and continue as it is
the only that knows what it was doing and how far it got. Or, wrap the
entire script in a single transaction.


-- 
Alan McKinnon
alan.mckinnon@gmail.com



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [gentoo-user] Re: Recommendations for scheduler
  2014-08-04 13:35                         ` Alan McKinnon
@ 2014-08-04 19:46                           ` J. Roeleveld
  2014-08-04 20:38                             ` Alan McKinnon
  0 siblings, 1 reply; 52+ messages in thread
From: J. Roeleveld @ 2014-08-04 19:46 UTC (permalink / raw
  To: gentoo-user

On 4 August 2014 15:35:41 CEST, Alan McKinnon <alan.mckinnon@gmail.com> wrote:
>On 04/08/2014 15:31, Martin Vaeth wrote:
>> J. Roeleveld <joost@antarean.org> wrote:
>>>>
>>>> So you have a command which might break due to hardware error
>>>> and cannot be rerun. I cannot see how any general-purpose scheduler
>>>> might help you here: You either need to be able to split your
>command
>>>> into several (sequential) commands or you need something adapted
>>>> for your particular command.
>>>
>>> A general-purpose scheduler can work, as they do exist.
>> 
>> I doubt that they can solve your problem.
>> Let me repeat: You have a single program which accesses the database
>> in a complex way and somewhere in the course of accessing it, the
>> machine (or program) crashes.
>> No general-purpose program can recover from this: You need
>> particular knowledge of the database and the program if you even
>> want to have a *chance* to recover from such a situation.
>> A program with such a particular knowledge can hardly be called
>> "general-purpose".
>
>
>Joost,
>
>Either make the ETL tool pick up where it stopped and continue as it is
>the only that knows what it was doing and how far it got. Or, wrap the
>entire script in a single transaction.

Alan,

That would be the ideal solution.
However, a single transaction dealing with around 500,000,000 rows will get me shot by the DBAs :)
(Never mind that the performance of this will be such that having it all done by an office full of secretaries might be quicker.)

Having the ETL process clever enough to be able to pick up from any point requires a degree of forward thinking and planning that is never done in real life.
I would love to design it like that as it isn't too difficult. But I always get brought into these projects when implementing these structures will require a full rewrite and getting the original architects to admit their design can't be made restartable without human intervention.

At which point the business simply says it is acceptable to have people do a manual rollback and restart the schedules from wherever it went wrong.

I'm sure your wife has similar experiences as this is why these projects are always late to deliver and over budget.

--
Joost
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [gentoo-user] Re: Recommendations for scheduler
  2014-08-04 13:31                       ` Martin Vaeth
  2014-08-04 13:35                         ` Alan McKinnon
@ 2014-08-04 19:54                         ` J. Roeleveld
  2014-08-05  6:33                           ` Martin Vaeth
  1 sibling, 1 reply; 52+ messages in thread
From: J. Roeleveld @ 2014-08-04 19:54 UTC (permalink / raw
  To: gentoo-user

On 4 August 2014 15:31:40 CEST, Martin Vaeth <martin@mvath.de> wrote:
>J. Roeleveld <joost@antarean.org> wrote:
>>>
>>> So you have a command which might break due to hardware error
>>> and cannot be rerun. I cannot see how any general-purpose scheduler
>>> might help you here: You either need to be able to split your
>command
>>> into several (sequential) commands or you need something adapted
>>> for your particular command.
>>
>> A general-purpose scheduler can work, as they do exist.
>
>I doubt that they can solve your problem.
>Let me repeat: You have a single program which accesses the database
>in a complex way and somewhere in the course of accessing it, the
>machine (or program) crashes.
>No general-purpose program can recover from this: You need
>particular knowledge of the database and the program if you even
>want to have a *chance* to recover from such a situation.
>A program with such a particular knowledge can hardly be called
>"general-purpose".

The scheduler needs to be able to show which process failed/didn't finish. 
Then humans need to ensure that part finishes/reruns properly.
Then humans need to be able to mark the failed process as succeeded.

At which point the scheduler continues with the schedule(s)

>> If, during one of these steps, the database or ETL process suffers a
>> crash, the activities of the ETL process need to be rolled back to
>> the point where you can restart it.
>
>I agree, but you need particular knowledge of the database and
>your tasks to do this which is far beyond the job of a scheduler.
>As already mentioned by someone in this thread, your problem needs
>to be solved on the level of the database (using
>snapshopt capabilities etc.)

Or human intervention. Which requires a clear indication of where it went wrong and allows a simple action to continue the schedule from where it was after these humans solved the issues and ensure consistency.

>>> In order to deal with case 1., you can regularly (e.g. each minute)
>>> dump the output of "schedule list" (possibly suppressing
>non-important
>>> data through the options to keep it short).
>>
>> Or all the necessary information is kept in-sync on persistent
>storage.
>> This would then also allow easy fail-over if the master-schedule-node
>> fails
>
>No, it wouldn't, since jobs just finishing and wanting to report their
>status cannot do this when there is no server. You would need a rather
>involved protocol to deal with such situations dynamically.
>It can certainly be done, but it is not something which can
>easily be "added" as a feature: If this is required, it has to be the
>fundamental concept from the very beginning and everything else has to
>follow this first aim. You need different protocols than TCP sockets,
>to start with; something like "dbus over IP" with servers being able
>to announce their new presence, etc.

I think it's doable with standard networking protocols.
But, either you have a master server which controls everything. Or you have a master process which has failover functionality using classical distributed software techniques.

These emails are actually quite useful as I am getting a clear pucture in my head on how I could approach this properly.

Thanks,

Joost

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [gentoo-user] Re: Recommendations for scheduler
  2014-08-04 19:46                           ` J. Roeleveld
@ 2014-08-04 20:38                             ` Alan McKinnon
  2014-08-05 11:42                               ` J. Roeleveld
  0 siblings, 1 reply; 52+ messages in thread
From: Alan McKinnon @ 2014-08-04 20:38 UTC (permalink / raw
  To: gentoo-user

On 04/08/2014 21:46, J. Roeleveld wrote:
> On 4 August 2014 15:35:41 CEST, Alan McKinnon <alan.mckinnon@gmail.com> wrote:
>> On 04/08/2014 15:31, Martin Vaeth wrote:
>>> J. Roeleveld <joost@antarean.org> wrote:
>>>>>
>>>>> So you have a command which might break due to hardware error
>>>>> and cannot be rerun. I cannot see how any general-purpose scheduler
>>>>> might help you here: You either need to be able to split your
>> command
>>>>> into several (sequential) commands or you need something adapted
>>>>> for your particular command.
>>>>
>>>> A general-purpose scheduler can work, as they do exist.
>>>
>>> I doubt that they can solve your problem.
>>> Let me repeat: You have a single program which accesses the database
>>> in a complex way and somewhere in the course of accessing it, the
>>> machine (or program) crashes.
>>> No general-purpose program can recover from this: You need
>>> particular knowledge of the database and the program if you even
>>> want to have a *chance* to recover from such a situation.
>>> A program with such a particular knowledge can hardly be called
>>> "general-purpose".
>>
>>
>> Joost,
>>
>> Either make the ETL tool pick up where it stopped and continue as it is
>> the only that knows what it was doing and how far it got. Or, wrap the
>> entire script in a single transaction.
> 
> Alan,
> 
> That would be the ideal solution.

You have the same concerns I do - how do you make a transaction around
500 million rows. So I asked the in-house expert - Mrs Alan :-)


> However, a single transaction dealing with around 500,000,000 rows will get me shot by the DBAs :)
> (Never mind that the performance of this will be such that having it all done by an office full of secretaries might be quicker.)

She reckons an ETL job *must* be self-contained; if it isn't then it's
broken by design. It must be idempotent too, which can be as simple as
"Truncate, Load, Commit"

> Having the ETL process clever enough to be able to pick up from any point requires a degree of forward thinking and planning that is never done in real life.
> I would love to design it like that as it isn't too difficult. But I always get brought into these projects when implementing these structures will require a full rewrite and getting the original architects to admit their design can't be made restartable without human intervention.


I agree with that design actually - it's the job of the hardware and OS
guys to make stuff reliable that the application layer can rely on. When
a SAN connection goes away, it usually comes back and the app layer just
carries on (never mind that it retried 100 times meanwhile).

Sometimes this doesn't work out. The easiest, cheapest and quickest way
to handle it is to just restart the whole job from the beginning. This
offends the engineer in us sometimes, but it really is the best way and
all of Unix is built on this very idea :-)

If the SAn goes away too often and it causes issues, the manybe the best
approach is to get the SAN and facilities guys to get their act together

> At which point the business simply says it is acceptable to have people do a manual rollback and restart the schedules from wherever it went wrong.

Exactly. One of the few cases where business has the correct idea.
There's only some many pennies to spend and so many dollars to be delivered.


> 
> I'm sure your wife has similar experiences as this is why these projects are always late to deliver and over budget.

She says her projects are subject to the same universal inviolate rule
as mine:

time and cost is always best engineering estimate times pi

We learn to deal with it. Which brings us back to Martin's initial
statement: a scheduler cannot deal with any of this, the job itself
must. It's an unpredictable event and schedulers can only deal with
predictable events


-- 
Alan McKinnon
alan.mckinnon@gmail.com



^ permalink raw reply	[flat|nested] 52+ messages in thread

* [gentoo-user] Re: Recommendations for scheduler
  2014-08-04 19:54                         ` J. Roeleveld
@ 2014-08-05  6:33                           ` Martin Vaeth
  2014-08-05 11:32                             ` J. Roeleveld
  0 siblings, 1 reply; 52+ messages in thread
From: Martin Vaeth @ 2014-08-05  6:33 UTC (permalink / raw
  To: gentoo-user

J. Roeleveld <joost@antarean.org> wrote:
>>
>>No, it wouldn't, since jobs just finishing and wanting to report their
>>status cannot do this when there is no server. You would need a rather
>>involved protocol to deal with such situations dynamically.
>>It can certainly be done, but it is not something which can
>>easily be "added" as a feature: If this is required, it has to be the
>>fundamental concept from the very beginning and everything else has to
>>follow this first aim. You need different protocols than TCP sockets,
>>to start with; something like "dbus over IP" with servers being able
>>to announce their new presence, etc.
>
> I think it's doable with standard networking protocols.

Yes, you can "tunnel" such a protocol over existing protocols,
but "essentially" you must use a different one.
Unless you want a static setup (use server A, if that fail use
server B, and server A reports everything to server B)
it cannot be done in a simple way that you have only
one port open on the server: The client also needs a port open
to be informed about the "current" server. Even worse, you need
a "daemon" running for each client to handle this port.
In such a case, you might make each client its own server,
by spreading all changes to all clients immediately.

> But, either you have a master server which controls everything.
> Or you have a master process which has failover functionality
> using classical distributed software techniques.

This summarizes it quite good.
The concept of my "schedule" is to follow the first path (with the
advantage of being simple, having only one part, clients do nothing
while their "task" is runnning).
If you want to follow the latter, you need a rather different CLI
and a different protocol - which is practically everything "schedule"
consists of; so it is probably simpler to rewrite this from scratch.
As I said: It is not a "feature" you can easily add later on; it is a
fundamental decision you must choose from the very beginning.
When you are at it you should probably also encrypt the communication
and establish methods for authentification which is also something
I currently omitted in "schedule" for simplicity (although this might
be easier to add later on).

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [gentoo-user] Re: Recommendations for scheduler
  2014-08-05  6:33                           ` Martin Vaeth
@ 2014-08-05 11:32                             ` J. Roeleveld
  2014-08-08 23:21                               ` Martin Vaeth
  0 siblings, 1 reply; 52+ messages in thread
From: J. Roeleveld @ 2014-08-05 11:32 UTC (permalink / raw
  To: gentoo-user

On Tuesday, August 05, 2014 06:33:59 AM Martin Vaeth wrote:
> J. Roeleveld <joost@antarean.org> wrote:
> >>No, it wouldn't, since jobs just finishing and wanting to report their
> >>status cannot do this when there is no server. You would need a rather
> >>involved protocol to deal with such situations dynamically.
> >>It can certainly be done, but it is not something which can
> >>easily be "added" as a feature: If this is required, it has to be the
> >>fundamental concept from the very beginning and everything else has to
> >>follow this first aim. You need different protocols than TCP sockets,
> >>to start with; something like "dbus over IP" with servers being able
> >>to announce their new presence, etc.
> >>
> > I think it's doable with standard networking protocols.
> 
> Yes, you can "tunnel" such a protocol over existing protocols,
> but "essentially" you must use a different one.
> Unless you want a static setup (use server A, if that fail use
> server B, and server A reports everything to server B)
> it cannot be done in a simple way that you have only
> one port open on the server: The client also needs a port open
> to be informed about the "current" server. Even worse, you need
> a "daemon" running for each client to handle this port.
> In such a case, you might make each client its own server,
> by spreading all changes to all clients immediately.

Not necessarily, the client listens on a port and the server connects to the 
clients it maintains. It then also knows when a client is dead and 
corresponding jobs have an issue.

> > But, either you have a master server which controls everything.
> > Or you have a master process which has failover functionality
> > using classical distributed software techniques.
> 
> This summarizes it quite good.
> The concept of my "schedule" is to follow the first path (with the
> advantage of being simple, having only one part, clients do nothing
> while their "task" is runnning).
> If you want to follow the latter, you need a rather different CLI
> and a different protocol - which is practically everything "schedule"
> consists of; so it is probably simpler to rewrite this from scratch.
> As I said: It is not a "feature" you can easily add later on; it is a
> fundamental decision you must choose from the very beginning.
> When you are at it you should probably also encrypt the communication
> and establish methods for authentification which is also something
> I currently omitted in "schedule" for simplicity (although this might
> be easier to add later on).

I agree. "schedule" is good for most uses we might encounter. For the business 
case I have, I will need to write something myself.

Thanks to this discussion we've been having, I now have a much better idea on 
how to approach this project. For that I am very thankful.

--
Joost


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [gentoo-user] Re: Recommendations for scheduler
  2014-08-04 20:38                             ` Alan McKinnon
@ 2014-08-05 11:42                               ` J. Roeleveld
  0 siblings, 0 replies; 52+ messages in thread
From: J. Roeleveld @ 2014-08-05 11:42 UTC (permalink / raw
  To: gentoo-user

On Monday, August 04, 2014 10:38:57 PM Alan McKinnon wrote:
> On 04/08/2014 21:46, J. Roeleveld wrote:
> > On 4 August 2014 15:35:41 CEST, Alan McKinnon <alan.mckinnon@gmail.com> > 
>> Either make the ETL tool pick up where it stopped and continue as it is
> >> the only that knows what it was doing and how far it got. Or, wrap the
> >> entire script in a single transaction.
> > 
> > Alan,
> > 
> > That would be the ideal solution.
> 
> You have the same concerns I do - how do you make a transaction around
> 500 million rows. So I asked the in-house expert - Mrs Alan :-)

Have a very large temporary tablespace on the database server.

> > However, a single transaction dealing with around 500,000,000 rows will
> > get me shot by the DBAs :) (Never mind that the performance of this will
> > be such that having it all done by an office full of secretaries might be
> > quicker.)
> She reckons an ETL job *must* be self-contained; if it isn't then it's
> broken by design. It must be idempotent too, which can be as simple as
> "Truncate, Load, Commit"

Most common tactic (done by humans):
- delete from <target table> where INS_PCS_ID = <crashed run-id>;
- update target table set VLD_TO = null where UPD_PCS_ID = <crashed run-id>;
Then, restart the crashed run-id.

For this, you need to know which command failed to know where to find the 
actual run-id you need to roll back.

> > Having the ETL process clever enough to be able to pick up from any point
> > requires a degree of forward thinking and planning that is never done in
> > real life. I would love to design it like that as it isn't too difficult.
> > But I always get brought into these projects when implementing these
> > structures will require a full rewrite and getting the original
> > architects to admit their design can't be made restartable without human
> > intervention.
> I agree with that design actually - it's the job of the hardware and OS
> guys to make stuff reliable that the application layer can rely on. When
> a SAN connection goes away, it usually comes back and the app layer just
> carries on (never mind that it retried 100 times meanwhile).

Yes, until you find out the clustered FS being used causes the crashes... 
(Yes, been in that situation...)

> Sometimes this doesn't work out. The easiest, cheapest and quickest way
> to handle it is to just restart the whole job from the beginning. This
> offends the engineer in us sometimes, but it really is the best way and
> all of Unix is built on this very idea :-)

Which is generally done. Usually, requiring a manual clean up prior to 
restart. If done properly, the ETL process has the capability to roll back the 
failed run prior to redoing it.
This, however, requires extensive planning and design at the initial 
implementation phase.

> If the SAn goes away too often and it causes issues, the manybe the best
> approach is to get the SAN and facilities guys to get their act together

Instead of finger-pointing.

> > At which point the business simply says it is acceptable to have people do
> > a manual rollback and restart the schedules from wherever it went wrong.
> Exactly. One of the few cases where business has the correct idea.
> There's only some many pennies to spend and so many dollars to be delivered.

Nightly processes that fail and then have to wait for the day-shift to arrive 
often cost the business more because the reports are delayed.

> > I'm sure your wife has similar experiences as this is why these projects
> > are always late to deliver and over budget.
> She says her projects are subject to the same universal inviolate rule
> as mine:
> 
> time and cost is always best engineering estimate times pi

"Overhead, testing, maintenance, ....", yes, it all adds to.

> We learn to deal with it. Which brings us back to Martin's initial
> statement: a scheduler cannot deal with any of this, the job itself
> must. It's an unpredictable event and schedulers can only deal with
> predictable events

True, but keeping the schedules and state stored in a way to make it easy to 
find out how far the whole process got makes recovery simpler.
Otherwise it's often quicker to simply roll back the entire schedule and 
restart. Even if only the last 2 of the 50 commands didn't run yet.

--
Joost


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [gentoo-user] Re: Recommendations for scheduler
  2014-08-03  7:23           ` Joost Roeleveld
  2014-08-03 12:16             ` Alan McKinnon
@ 2014-08-05 19:57             ` James
  2014-08-05 20:43               ` J. Roeleveld
  1 sibling, 1 reply; 52+ messages in thread
From: James @ 2014-08-05 19:57 UTC (permalink / raw
  To: gentoo-user

Joost Roeleveld <joost <at> antarean.org> writes:

> > Mesos looks promising for a variety of (Apache) reasons. Some key
> > technologies folks may want google about that are related:
> > 
> > Quincy (fair schedular)
> > Chronos (scheduler)
> > Hadoop (scheduler)
> 
> Hadoop not a scheduler. It's a framework for a Big Data clustered   
> database.

> > HDFS (clusterd file system)
> Unless it's changed recently, not suitable for anything else then Hadoop 
> and  contains a single point of failure.

I'm curious as to more information about this 'single point of failure. Can
you be more specific or provides links?

On this resource: 

http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithQJM.html

JournalNode machines talks about surviving faults:

"increase the number of failures the system can tolerate, you should run an
odd number of JNs, (i.e. 3, 5, 7, etc.). Note that when running with N
JournalNodes, the system can tolerate at most (N - 1) / 2 failures and
continue to function normally. "

> 
> > http://gpo.zugaina.org/sys-cluster/apache-hadoop-common
> > 
> > Zookeeper (Fault tolerance)
> > SPARK ( optimized for interative jobs where a datase is resued in many
> > parallel operations (advanced math/science and many other apps.)
> > https://spark.apache.org/
> > 
> > Dryad  Torque   Mpiche2 MPI
> > Globus tookit
> > 
> > mesos_tech_report.pdf
> > 
> > It looks as though Amazon, google, facebook and many others
> > large in the Cluster/Cloud arena are using Mesos......?
> > 
> > So let's all post what we find, particularly in overlays.
> 
> Unless you are dealing with Big Data projects, like Google, Facebook,
Amazon,  big banks,... you don't have much use for those projects.

Many scientific applications are using the cluster (cloud) or big data
approach to all sorts of problems. Furthermore, as GPU and the new
Arm systems with dozens and dozens of cpu cores inside one computer become
readily available, the cluster-cloud (big data) approach will become much
more pervasive in the next few years, imho.

http://blog.rescale.com/reservoir-simulation-moves-to-the-cloud/

There are thousands of small companies needing reservoir simulation, not to 
mention the millions of folks working on carbon sequestration.....
Anything to do with Biological or Chemical Science is using or moving
to the Cloud-Clustered world. For me, a Cluster is just a cloud internally
managee, rather than outsourcing it to others; ymmv.

> Mesos looks like a nice project, just like Hadoop and related are also 
> nice. But for most people, they are as usefull as using Exalytics.

I'm not excited about an Oracle solution to anything. Many of the folks
I know consult on moving technologies away from Oracle's spear of influence,
not limited to mysql; ymmv. I know of one very large communications company
that went broke and had to merge because of those ridiculous Oracle fees.
Caveat Emptor; long live Postresql.  

> A scheduler should not have a large set of dependencies that you wouldn't
> use otherwise. That makes Chronos a non-option to me.

Those other technologies are often useful to folks who would be attracted to
something like chronos.

> Martin's project looks promising, but doesn't store the schedules 
> internally. For repeating schedules, like what Alan was describing, you 
> need to put those into scripts and start those from an existing cron.
> Of the 2, I think improving Martin's project is the most likely option 
> for me as it doesn't have additional dependencies and seems to be 
> easily implemented.
> Joost

Understood.
Like others, I'll be curious to follow what develops out of Martin's work.

For me Chronos, Mesos and the other aforementioned technologies look to be
more viable; particularly if one is preparing for a clustered world with
CPUs, GPUs, SoCs and Arm machines distributed about the ethernet  as
resources to be scheduled and utilized in a variety of schema. It's the
quest for one-infrastructure to solve many problems where scenarios compete. 

Big data is not the only reason for cloud-clusters. Theoretically,
(Clustered) systems can have a far greater resource utilization of networked
resources than traditional (distributed) approaches. I grant you that this
is a work in progress, but I personally know of dozens of mathematically
complex distributed systems that are  migrating to the clustered approach
rather than something custom or traditionally distributed.

Granted, Cloud <--> Clustered <--> Distributed are all overlaping approaches
to big problems. I do appreciate the candor of this thread.

James

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [gentoo-user] Re: Recommendations for scheduler
  2014-08-05 19:57             ` James
@ 2014-08-05 20:43               ` J. Roeleveld
  2014-08-05 21:29                 ` Alan McKinnon
  2014-08-06  8:29                 ` Peter Humphrey
  0 siblings, 2 replies; 52+ messages in thread
From: J. Roeleveld @ 2014-08-05 20:43 UTC (permalink / raw
  To: gentoo-user

On 5 August 2014 21:57:56 CEST, James <wireless@tampabay.rr.com> wrote:
>Joost Roeleveld <joost <at> antarean.org> writes:
>
>
>> > Mesos looks promising for a variety of (Apache) reasons. Some key
>> > technologies folks may want google about that are related:
>> > 
>> > Quincy (fair schedular)
>> > Chronos (scheduler)
>> > Hadoop (scheduler)
>> 
>> Hadoop not a scheduler. It's a framework for a Big Data clustered   
>> database.
>
>> > HDFS (clusterd file system)
>> Unless it's changed recently, not suitable for anything else then
>Hadoop 
>> and  contains a single point of failure.
>
>I'm curious as to more information about this 'single point of failure.
>Can
>you be more specific or provides links?
>
>On this resource: 
>
>http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithQJM.html
>
>JournalNode machines talks about surviving faults:
>
>"increase the number of failures the system can tolerate, you should
>run an
>odd number of JNs, (i.e. 3, 5, 7, etc.). Note that when running with N
>JournalNodes, the system can tolerate at most (N - 1) / 2 failures and
>continue to function normally. "

Just read that part. Looks like they solved it partly since 2.2.
The problem lies with the NameNodes.
Prior to 2.2, you only had 1. If that one dies, you loose the entire cluster. If that one is unrecoverable, you loose all the data.

After 2.2, you can configure a standby NameNode. However, it still requires manual restart.

Considering that Hadoop is most often running on old machines, chances for hardware failure are higher when compared with clusters using newer hardware.

I'm not sure how other cluster FSs deal with this, but I consider it a design flaw if the disappearance of a single machine in a 100+ node cluster dies, the entire cluster ends up in a broken state.
It's like running a single Raid5 with 100+ drives.
Anyone stupid enough to do that deserves to loose their data.

>> > http://gpo.zugaina.org/sys-cluster/apache-hadoop-common
>> > 
>> > Zookeeper (Fault tolerance)
>> > SPARK ( optimized for interative jobs where a datase is resued in
>many
>> > parallel operations (advanced math/science and many other apps.)
>> > https://spark.apache.org/
>> > 
>> > Dryad  Torque   Mpiche2 MPI
>> > Globus tookit
>> > 
>> > mesos_tech_report.pdf
>> > 
>> > It looks as though Amazon, google, facebook and many others
>> > large in the Cluster/Cloud arena are using Mesos......?
>> > 
>> > So let's all post what we find, particularly in overlays.
>> 
>> Unless you are dealing with Big Data projects, like Google, Facebook,
>Amazon,  big banks,... you don't have much use for those projects.
>
>Many scientific applications are using the cluster (cloud) or big data
>approach to all sorts of problems. Furthermore, as GPU and the new
>Arm systems with dozens and dozens of cpu cores inside one computer
>become
>readily available, the cluster-cloud (big data) approach will become
>much
>more pervasive in the next few years, imho.
>
>http://blog.rescale.com/reservoir-simulation-moves-to-the-cloud/
>
>There are thousands of small companies needing reservoir simulation,
>not to 
>mention the millions of folks working on carbon sequestration.....
>Anything to do with Biological or Chemical Science is using or moving
>to the Cloud-Clustered world. For me, a Cluster is just a cloud
>internally
>managee, rather than outsourcing it to others; ymmv.

My apologies. I forgot the scientific research here. But that was mostly because they have been dealing with really large datasets and corresponding large compute clusters for decades.

The term Big Data is generally applied to financial and social media data.

>> Mesos looks like a nice project, just like Hadoop and related are
>also 
>> nice. But for most people, they are as usefull as using Exalytics.
>
>I'm not excited about an Oracle solution to anything. Many of the folks
>I know consult on moving technologies away from Oracle's spear of
>influence,
>not limited to mysql; ymmv. I know of one very large communications
>company
>that went broke and had to merge because of those ridiculous Oracle
>fees.
>Caveat Emptor; long live Postresql.  

I'd be interested in the name of that company. Even offlist.

And I definitely agree. PostgreSQL is often a valid alternative. Unfortunately, it is rarely possible to use it as a back end to enterprise software as these are all designed to be used with databases from the usual suspects (Oracle, IBM, Microsoft, ....)

Same goes for OSS projects. The developers are often unable to properly code the SQL layer and end up simply using MySQL and its broken SQL implementation.

>> A scheduler should not have a large set of dependencies that you
>wouldn't
>> use otherwise. That makes Chronos a non-option to me.
>
>Those other technologies are often useful to folks who would be
>attracted to
>something like chronos.

If you already use Mesos, using Chronos makes sense.
If you're only interested in a scheduler, installing Mesos just to use Chronos doesn't make sense.

>> Martin's project looks promising, but doesn't store the schedules 
>> internally. For repeating schedules, like what Alan was describing,
>you 
>> need to put those into scripts and start those from an existing cron.
>> Of the 2, I think improving Martin's project is the most likely
>option 
>> for me as it doesn't have additional dependencies and seems to be 
>> easily implemented.
>> Joost
>
>Understood.
>Like others, I'll be curious to follow what develops out of Martin's
>work.

I believe Martin's scheduler will be very valuable. Even for me.
I am very likely going to start using this for some of my regular maintenance activities on the home network.

But as the rest of the thread shows, I wouldn't be able to use it as a scheduler for large projects where the schedules can get very complex very quickly.

The type of scheduler needed for these requires a different approach, which would be overkill for the home network environment where Martin's excels. 

>For me Chronos, Mesos and the other aforementioned technologies look to
>be
>more viable; particularly if one is preparing for a clustered world
>with
>CPUs, GPUs, SoCs and Arm machines distributed about the ethernet  as
>resources to be scheduled and utilized in a variety of schema. It's the
>quest for one-infrastructure to solve many problems where scenarios
>compete. 

I fully agree, see my comment above where I state Chronos makes sense when Mesos does as well.

>Big data is not the only reason for cloud-clusters. Theoretically,
>(Clustered) systems can have a far greater resource utilization of
>networked
>resources than traditional (distributed) approaches. I grant you that
>this
>is a work in progress, but I personally know of dozens of
>mathematically
>complex distributed systems that are  migrating to the clustered
>approach
>rather than something custom or traditionally distributed.

I still remember running seti@home and similar programs in the past. Those were large clusters, but with a very badly designed network.

There is a use-case for large well integrated clusters, loosely coupled clusters and big machines.

Here is the difference between horizontal (many machines) and vertical (1 really big machine) clustering.
The vertical only has clustering between different processes. 

>Granted, Cloud <--> Clustered <--> Distributed are all overlaping
>approaches
>to big problems. I do appreciate the candor of this thread.

They are. It started with distributed computing in a lab, then moved onto the internet.
Then people started to build a mini internet with a lot of old computers and Clusters were born.
Then that ended up back on the internet with clusters being made accessible online. And this is what is considered to be " The Cloud". 
If you take the general definition of The Cloud, which is along the lines off: "being able to access your data anywhere using any device", running your own server and being able to access the data on there from anywhere with internet access using your laptop, smartphone, tablet,.... then you are using the cloud. 

If anyone is actually planning to implement Mesos and Chronos on Gentoo, I would be interested in joining the effort as it does sound like fun. I just don't have the time to do a lot of work on that at the moment.

--
Joost

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [gentoo-user] Re: Recommendations for scheduler
  2014-08-05 20:43               ` J. Roeleveld
@ 2014-08-05 21:29                 ` Alan McKinnon
  2014-08-06  8:29                 ` Peter Humphrey
  1 sibling, 0 replies; 52+ messages in thread
From: Alan McKinnon @ 2014-08-05 21:29 UTC (permalink / raw
  To: gentoo-user

On 05/08/2014 22:43, J. Roeleveld wrote:
> I believe Martin's scheduler will be very valuable. Even for me.
> I am very likely going to start using this for some of my regular maintenance activities on the home network.
> 
> But as the rest of the thread shows, I wouldn't be able to use it as a scheduler for large projects where the schedules can get very complex very quickly.


Martin will be happy to know I think his work will fit my needs just
nicely :-)



-- 
Alan McKinnon
alan.mckinnon@gmail.com



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [gentoo-user] Re: Recommendations for scheduler
  2014-08-05 20:43               ` J. Roeleveld
  2014-08-05 21:29                 ` Alan McKinnon
@ 2014-08-06  8:29                 ` Peter Humphrey
  2014-08-06 10:26                   ` J. Roeleveld
  1 sibling, 1 reply; 52+ messages in thread
From: Peter Humphrey @ 2014-08-06  8:29 UTC (permalink / raw
  To: gentoo-user

On Tuesday 05 August 2014 22:43:42 J. Roeleveld wrote:

> I still remember running seti@home and similar programs in the past. Those
> were large clusters, but with a very badly designed network.

Was that in the days before BOINC, Joost? Do you think it's any better now? I 
run 5 BOINC projects here in the same general area as SETI. They seem to work 
all right, except for getting changes in what they call computing preferences 
propagated around the projects.

(Just an aside - I don't want to hijack this interesting thread.)

-- 
Regards
Peter



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [gentoo-user] Re: Recommendations for scheduler
  2014-08-06  8:29                 ` Peter Humphrey
@ 2014-08-06 10:26                   ` J. Roeleveld
  0 siblings, 0 replies; 52+ messages in thread
From: J. Roeleveld @ 2014-08-06 10:26 UTC (permalink / raw
  To: gentoo-user

On Wednesday, August 06, 2014 09:29:53 AM Peter Humphrey wrote:
> On Tuesday 05 August 2014 22:43:42 J. Roeleveld wrote:
> > I still remember running seti@home and similar programs in the past. Those
> > were large clusters, but with a very badly designed network.
> 
> Was that in the days before BOINC, Joost? Do you think it's any better now?
> I run 5 BOINC projects here in the same general area as SETI. They seem to
> work all right, except for getting changes in what they call computing
> preferences propagated around the projects.
> 
> (Just an aside - I don't want to hijack this interesting thread.)

Yes, I did it for a short period sometime in 1999.

It worked alright, I just meant that running it on thousands of personal 
computers using dial-up to the internet is a "badly designed network" for a 
cluster.

--
Joost


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [gentoo-user] Re: Recommendations for scheduler
  2014-08-05 11:32                             ` J. Roeleveld
@ 2014-08-08 23:21                               ` Martin Vaeth
  0 siblings, 0 replies; 52+ messages in thread
From: Martin Vaeth @ 2014-08-08 23:21 UTC (permalink / raw
  To: gentoo-user

> On Tuesday, August 05, 2014 06:33:59 AM Martin Vaeth wrote:
>
>> When you are at it you should probably also encrypt the communication

schedule-0.15 is finally able to use encryption, hence the current mild
security risks will practically vanish, even if listening to a
world-wide port.

schedule-1.0 will probably soon be ready with encryption strengthened
even more.

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2014-08-08 23:21 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-08-01 17:32 [gentoo-user] Recommendations for scheduler Alan McKinnon
2014-08-01 17:49 ` Сергей
2014-08-01 17:50   ` Сергей
2014-08-01 19:10     ` Alan McKinnon
2014-08-03  9:27       ` Bruce Schultz
2014-08-03 12:08         ` Alan McKinnon
2014-08-04  3:07           ` Bruce Schultz
2014-08-01 18:17 ` [gentoo-user] " James
2014-08-01 19:19   ` Alan McKinnon
2014-08-01 19:35     ` covici
2014-08-02  9:18       ` Alan McKinnon
2014-08-02 13:34         ` J. Roeleveld
2014-08-01 21:17   ` J. Roeleveld
2014-08-01 21:02 ` Martin Vaeth
2014-08-01 21:22   ` J. Roeleveld
2014-08-01 22:06     ` Martin Vaeth
2014-08-02  9:27   ` Alan McKinnon
2014-08-01 21:13 ` [gentoo-user] " J. Roeleveld
2014-08-02  9:33   ` Alan McKinnon
2014-08-02 13:31     ` J. Roeleveld
2014-08-02 14:03       ` Alan McKinnon
2014-08-02 16:53         ` [gentoo-user] " James
2014-08-03  7:23           ` Joost Roeleveld
2014-08-03 12:16             ` Alan McKinnon
2014-08-03 13:33               ` J. Roeleveld
2014-08-05 19:57             ` James
2014-08-05 20:43               ` J. Roeleveld
2014-08-05 21:29                 ` Alan McKinnon
2014-08-06  8:29                 ` Peter Humphrey
2014-08-06 10:26                   ` J. Roeleveld
2014-08-03  7:50       ` Martin Vaeth
2014-08-03  8:06         ` J. Roeleveld
2014-08-03 12:10           ` Martin Vaeth
2014-08-03 13:36             ` J. Roeleveld
2014-08-03 20:04               ` Alan McKinnon
2014-08-03 20:23                 ` J. Roeleveld
2014-08-03 20:57                   ` Alan McKinnon
2014-08-03 21:10                     ` J. Roeleveld
2014-08-04  8:41               ` Martin Vaeth
2014-08-04  9:02                 ` J. Roeleveld
2014-08-04 10:11                   ` Martin Vaeth
2014-08-04 10:40                     ` J. Roeleveld
2014-08-04 13:31                       ` Martin Vaeth
2014-08-04 13:35                         ` Alan McKinnon
2014-08-04 19:46                           ` J. Roeleveld
2014-08-04 20:38                             ` Alan McKinnon
2014-08-05 11:42                               ` J. Roeleveld
2014-08-04 19:54                         ` J. Roeleveld
2014-08-05  6:33                           ` Martin Vaeth
2014-08-05 11:32                             ` J. Roeleveld
2014-08-08 23:21                               ` Martin Vaeth
2014-08-03 13:02     ` [gentoo-user] " Tanstaafl

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox