[gentoo-user] monit and friends.

public inbox for gentoo-user@lists.gentoo.org
 help / color / mirror / Atom feed

* [gentoo-user] monit and friends.
@ 2017-10-16 12:11 Alan McKinnon
  2017-10-16 15:08 ` [gentoo-user] " Ian Zimmerman
  0 siblings, 1 reply; 9+ messages in thread
From: Alan McKinnon @ 2017-10-16 12:11 UTC (permalink / raw
  To: gentoo-user

I'm about to embark on a biggish rollout of local watchdogs in my
monitoring solutions - about 100 hosts or so.

First tool I reached for was my trusty monit, been using it for years.
Before I start though, I figured I should ask around if anyone has
experince n a package that does what monit does better than monit does
it. I find I type way too much stuff into monitrc, too many hard-coded
file paths, too much stuff I have to look up in long-form to put into
monitrc. Unfortunately, systemd with it's respawn feature isn't a global
option, too many systems are not systemd. SysVInit is the common denominator

My needs here are pretty simple:
local watchdog that checks if a program is running and restart it if
not. If that fails 3 times or so, alert me.
Maybe a few file/dir/fifo monitors as well. Not much else.

I don't need any of monit's graphing features or M/monit, I have other
tools for that. And mostly don't even need it's http API either.

-- 
Alan McKinnon
alan.mckinnon@gmail.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [gentoo-user] Re: monit and friends.
  2017-10-16 12:11 [gentoo-user] monit and friends Alan McKinnon
@ 2017-10-16 15:08 ` Ian Zimmerman
  2017-10-16 15:12   ` Alan McKinnon
  0 siblings, 1 reply; 9+ messages in thread
From: Ian Zimmerman @ 2017-10-16 15:08 UTC (permalink / raw
  To: gentoo-user

On 2017-10-16 14:11, Alan McKinnon wrote:

> My needs here are pretty simple:
> local watchdog that checks if a program is running and restart it if
> not. If that fails 3 times or so, alert me.
> Maybe a few file/dir/fifo monitors as well. Not much else.
> 
> I don't need any of monit's graphing features or M/monit, I have other
> tools for that. And mostly don't even need it's http API either.

supervisor (aka supervisord)

http://supervisord.org/

python based, not sure if that's okay with you

-- 
Please don't Cc: me privately on mailing lists and Usenet,
if you also post the followup to the list or newsgroup.
Do obvious transformation on domain to reply privately _only_ on Usenet.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [gentoo-user] Re: monit and friends.
  2017-10-16 15:08 ` [gentoo-user] " Ian Zimmerman
@ 2017-10-16 15:12   ` Alan McKinnon
  2017-10-16 15:41     ` Mick
  0 siblings, 1 reply; 9+ messages in thread
From: Alan McKinnon @ 2017-10-16 15:12 UTC (permalink / raw
  To: gentoo-user

On 16/10/2017 17:08, Ian Zimmerman wrote:
> On 2017-10-16 14:11, Alan McKinnon wrote:
> 
>> My needs here are pretty simple:
>> local watchdog that checks if a program is running and restart it if
>> not. If that fails 3 times or so, alert me.
>> Maybe a few file/dir/fifo monitors as well. Not much else.
>>
>> I don't need any of monit's graphing features or M/monit, I have other
>> tools for that. And mostly don't even need it's http API either.
> 
> supervisor (aka supervisord)
> 
> http://supervisord.org/
> 
> python based, not sure if that's okay with you
> 

I forgot about supervisord. Like monit, it runs everywhere and might be
easier for the team-mates to understand and work with.

Python is not a problem, all these hosts are ansible-managed anyway, so
they all have to run python-2.7

Good find, thanks!

-- 
Alan McKinnon
alan.mckinnon@gmail.com



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [gentoo-user] Re: monit and friends.
  2017-10-16 15:12   ` Alan McKinnon
@ 2017-10-16 15:41     ` Mick
  2017-10-16 15:50       ` Alan McKinnon
  0 siblings, 1 reply; 9+ messages in thread
From: Mick @ 2017-10-16 15:41 UTC (permalink / raw
  To: gentoo-user

[-- Attachment #1: Type: text/plain, Size: 2094 bytes --]

On Monday, 16 October 2017 16:12:53 BST Alan McKinnon wrote:
> On 16/10/2017 17:08, Ian Zimmerman wrote:
> > On 2017-10-16 14:11, Alan McKinnon wrote:
> >> My needs here are pretty simple:
> >> local watchdog that checks if a program is running and restart it if
> >> not. If that fails 3 times or so, alert me.
> >> Maybe a few file/dir/fifo monitors as well. Not much else.
> >> 
> >> I don't need any of monit's graphing features or M/monit, I have other
> >> tools for that. And mostly don't even need it's http API either.
> > 
> > supervisor (aka supervisord)
> > 
> > http://supervisord.org/
> > 
> > python based, not sure if that's okay with you
> 
> I forgot about supervisord. Like monit, it runs everywhere and might be
> easier for the team-mates to understand and work with.
> 
> Python is not a problem, all these hosts are ansible-managed anyway, so
> they all have to run python-2.7
> 
> Good find, thanks!

I've used Nagios in the past, but have not kept up with its development and 
the many plugins it provides.  It could do any of the above tasks and much 
more.  It can run scripts (perl, or bash) via daemons (nrpe) on the remote 
systems to restart applications, et al.  The Nagios server possessed the 
ability to set up quite intelligent monitoring and alert hierarchies with 
multilayered comms structures to make sure you are not woken up at 2 a.m. by 
your boss, just because a ping failed to his home NAS.  I also found the logs 
which can be also stored on SQL quite useful both in troubleshooting problems 
and in producing reports.  It can monitor network connectivity, remote OS 
parameters and applications.  Writing your own plugin/module to monitor quite 
specialised use cases is not particularly difficult either.

I expect you may find Nagios more complicated to set up than monit, at least 
initially, but if you don't have the luxury of time to invest on setting up 
Nagios monit may be a better fit.  I don't have in depth experience with other 
monitoring software to comment, so something else may suit better your 
specific needs.
-- 
Regards,
Mick

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [gentoo-user] Re: monit and friends.
  2017-10-16 15:41     ` Mick
@ 2017-10-16 15:50       ` Alan McKinnon
  2017-10-16 16:10         ` Ralph Seichter
                           ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Alan McKinnon @ 2017-10-16 15:50 UTC (permalink / raw
  To: gentoo-user

On 16/10/2017 17:41, Mick wrote:
> On Monday, 16 October 2017 16:12:53 BST Alan McKinnon wrote:
>> On 16/10/2017 17:08, Ian Zimmerman wrote:
>>> On 2017-10-16 14:11, Alan McKinnon wrote:
>>>> My needs here are pretty simple:
>>>> local watchdog that checks if a program is running and restart it if
>>>> not. If that fails 3 times or so, alert me.
>>>> Maybe a few file/dir/fifo monitors as well. Not much else.
>>>>
>>>> I don't need any of monit's graphing features or M/monit, I have other
>>>> tools for that. And mostly don't even need it's http API either.
>>>
>>> supervisor (aka supervisord)
>>>
>>> http://supervisord.org/
>>>
>>> python based, not sure if that's okay with you
>>
>> I forgot about supervisord. Like monit, it runs everywhere and might be
>> easier for the team-mates to understand and work with.
>>
>> Python is not a problem, all these hosts are ansible-managed anyway, so
>> they all have to run python-2.7
>>
>> Good find, thanks!
> 
> I've used Nagios in the past, but have not kept up with its development and 
> the many plugins it provides.  It could do any of the above tasks and much 
> more.  It can run scripts (perl, or bash) via daemons (nrpe) on the remote 
> systems to restart applications, et al.  The Nagios server possessed the 
> ability to set up quite intelligent monitoring and alert hierarchies with 
> multilayered comms structures to make sure you are not woken up at 2 a.m. by 
> your boss, just because a ping failed to his home NAS.  I also found the logs 
> which can be also stored on SQL quite useful both in troubleshooting problems 
> and in producing reports.  It can monitor network connectivity, remote OS 
> parameters and applications.  Writing your own plugin/module to monitor quite 
> specialised use cases is not particularly difficult either.
> 
> I expect you may find Nagios more complicated to set up than monit, at least 
> initially, but if you don't have the luxury of time to invest on setting up 
> Nagios monit may be a better fit.  I don't have in depth experience with other 
> monitoring software to comment, so something else may suit better your 
> specific needs.
> 


Nagios and I go way back, way way waaaaaay back. I now recommend it
never be used unless there really is no other option. There is just so
many problems with actually using the bloody thing, but let's not get
into that :-)

I have a full monitoring system that tracks and reports on the state of
most things, but as it's a monitoring system it is forbidden to make
changes of any kind at all, and that includes restarting failed daemons.
Turns out that daemons that failed for no good reason are becoming more
and more common in this day and age, mostly because we treat them like
cattle not pets and use virtualization and containers so much. And
there's our old friend the Linux oom-killer....

What I need here is a small app that will be a constrained,
single-purpose watchdog. If a daemon fails, the watchdog attempts 3
restarts to get it going, and records the fact it did it (that goes into
the big monitoring system as a reportable fact). If the restart fails,
then a human needs to attend to it as it is seriously or beyond the
scope of a watchdog.

Like you, I'm tired of being woken at 2am because something dropped 1
ping when the nightly database maintenance fired up on the vmware
cluster :-)


-- 
Alan McKinnon
alan.mckinnon@gmail.com



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [gentoo-user] Re: monit and friends.
  2017-10-16 15:50       ` Alan McKinnon
@ 2017-10-16 16:10         ` Ralph Seichter
  2017-10-16 16:18           ` Alan McKinnon
  2017-10-17  0:13         ` Michael Orlitzky
  2017-10-18 13:45         ` skyclan
  2 siblings, 1 reply; 9+ messages in thread
From: Ralph Seichter @ 2017-10-16 16:10 UTC (permalink / raw
  To: gentoo-user

On 16.10.2017 17:50, Alan McKinnon wrote:

> Nagios and I go way back, way way waaaaaay back. I now recommend it
> never be used unless there really is no other option.

Have you tried Icinga 2 (*) yet? It originally started as a Nagios fork
and uses plugins to monitor, but the rule-based configuration mechanism
of Icinga 2 is IMO more powerful and easier than Nagios' mechanism. I've
used both Nagios and Icinga for years, and I definitely prefer Icinga 2.

-Ralph

(*) https://www.icinga.com/products/icinga-2/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [gentoo-user] Re: monit and friends.
  2017-10-16 16:10         ` Ralph Seichter
@ 2017-10-16 16:18           ` Alan McKinnon
  0 siblings, 0 replies; 9+ messages in thread
From: Alan McKinnon @ 2017-10-16 16:18 UTC (permalink / raw
  To: gentoo-user

On 16/10/2017 18:10, Ralph Seichter wrote:
> On 16.10.2017 17:50, Alan McKinnon wrote:
> 
>> Nagios and I go way back, way way waaaaaay back. I now recommend it
>> never be used unless there really is no other option.
> 
> Have you tried Icinga 2 (*) yet? It originally started as a Nagios fork
> and uses plugins to monitor, but the rule-based configuration mechanism
> of Icinga 2 is IMO more powerful and easier than Nagios' mechanism. I've
> used both Nagios and Icinga for years, and I definitely prefer Icinga 2.
> 
> -Ralph
> 
> (*) https://www.icinga.com/products/icinga-2/
> 

Yes, I know Icinga as well. It fixes many of Nagios' shortcomings - the
first batch of commits after the fork took care of many of those - but
still suffers from all of Nagios' design faults.

In short, I'm not interested in going back to Nagios after a year's
migration to get away from it. Same for Icinga, Shinken, Sensu and all
the other many nagios forks out there. Also Zabbix.

My current monitoring is snmp-based, and all I need monit for is as a
very narrowly-defined single-purpose watchdog.

-- 
Alan McKinnon
alan.mckinnon@gmail.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [gentoo-user] Re: monit and friends.
  2017-10-16 15:50       ` Alan McKinnon
  2017-10-16 16:10         ` Ralph Seichter
@ 2017-10-17  0:13         ` Michael Orlitzky
  2017-10-18 13:45         ` skyclan
  2 siblings, 0 replies; 9+ messages in thread
From: Michael Orlitzky @ 2017-10-17  0:13 UTC (permalink / raw
  To: gentoo-user

On 10/16/2017 11:50 AM, Alan McKinnon wrote:
> 
> What I need here is a small app that will be a constrained,
> single-purpose watchdog. If a daemon fails, the watchdog attempts 3
> restarts to get it going, and records the fact it did it (that goes into
> the big monitoring system as a reportable fact). If the restart fails,
> then a human needs to attend to it as it is seriously or beyond the
> scope of a watchdog.
> 

Can the daemon be run in the foreground?

  start-app; start-app; start-app; /usr/local/bin/page-alan



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [gentoo-user] Re: monit and friends.
  2017-10-16 15:50       ` Alan McKinnon
  2017-10-16 16:10         ` Ralph Seichter
  2017-10-17  0:13         ` Michael Orlitzky
@ 2017-10-18 13:45         ` skyclan
  2 siblings, 0 replies; 9+ messages in thread
From: skyclan @ 2017-10-18 13:45 UTC (permalink / raw
  To: gentoo-user

Hi Alan,

This isn't exactly what you describe for your needs but have you 
considered using auto-remediation outside of the box?  I've been using 
StackStorm https://stackstorm.com/ for the last year in an environment 
of ~1500 physical servers for this purpose and it's been quite successful.

It has been handling cases like restarting SNMP daemons that segfault, 
hadoop instances that loose to contact with the ZooKeeper cluster, 
restarting nginx daemons that stop responding to requests by analysing 
the last write date in nginx's access logs, the list goes on.

StackStorm is event driven platform that has many integrations available 
allowing it to interact with internal and external service providers. 
It's Python based and can use ssh to execute remote commands which 
sounds like an acceptable approach since you're using ansible.

Connecting SNMP traps up to StackStorm's event bus to trigger automated 
responses based on the trap contents would be inline with common use cases.

Regards,
Carlos

On 16/10/17 17:50, Alan McKinnon wrote:
> Nagios and I go way back, way way waaaaaay back. I now recommend it
> never be used unless there really is no other option. There is just so
> many problems with actually using the bloody thing, but let's not get
> into that:-)
> 
> I have a full monitoring system that tracks and reports on the state of
> most things, but as it's a monitoring system it is forbidden to make
> changes of any kind at all, and that includes restarting failed daemons.
> Turns out that daemons that failed for no good reason are becoming more
> and more common in this day and age, mostly because we treat them like
> cattle not pets and use virtualization and containers so much. And
> there's our old friend the Linux oom-killer....
> 
> What I need here is a small app that will be a constrained,
> single-purpose watchdog. If a daemon fails, the watchdog attempts 3
> restarts to get it going, and records the fact it did it (that goes into
> the big monitoring system as a reportable fact). If the restart fails,
> then a human needs to attend to it as it is seriously or beyond the
> scope of a watchdog.
> 
> Like you, I'm tired of being woken at 2am because something dropped 1
> ping when the nightly database maintenance fired up on the vmware
> cluster:-)

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2017-10-18 13:45 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-10-16 12:11 [gentoo-user] monit and friends Alan McKinnon
2017-10-16 15:08 ` [gentoo-user] " Ian Zimmerman
2017-10-16 15:12   ` Alan McKinnon
2017-10-16 15:41     ` Mick
2017-10-16 15:50       ` Alan McKinnon
2017-10-16 16:10         ` Ralph Seichter
2017-10-16 16:18           ` Alan McKinnon
2017-10-17  0:13         ` Michael Orlitzky
2017-10-18 13:45         ` skyclan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox