From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gentoo-user+bounces-180399-garchives=archives.gentoo.org@lists.gentoo.org>
Received: from lists.gentoo.org (pigeon.gentoo.org [208.92.234.80])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by finch.gentoo.org (Postfix) with ESMTPS id 69B461396D9
	for <garchives@archives.gentoo.org>; Wed, 18 Oct 2017 13:45:35 +0000 (UTC)
Received: from pigeon.gentoo.org (localhost [127.0.0.1])
	by pigeon.gentoo.org (Postfix) with SMTP id 6F605E0ED5;
	Wed, 18 Oct 2017 13:45:28 +0000 (UTC)
Received: from mail.kiwifrog.net (mail.kiwifrog.net [82.241.90.118])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by pigeon.gentoo.org (Postfix) with ESMTPS id 095F3E0ECC
	for <gentoo-user@lists.gentoo.org>; Wed, 18 Oct 2017 13:45:24 +0000 (UTC)
Received: from [127.0.0.1] (mail.kiwifrog.net [192.168.7.250])
	by mail.kiwifrog.net (Postfix) with ESMTP id 8318C11C235
	for <gentoo-user@lists.gentoo.org>; Wed, 18 Oct 2017 15:45:22 +0200 (CEST)
Subject: Re: [gentoo-user] Re: monit and friends.
To: gentoo-user@lists.gentoo.org
References: <96762772-dd49-7464-da0c-c0a878a6e7de@gmail.com>
 <20171016150852.u6w6ekna7la4pxhb@matica.foolinux.mooo.com>
 <e417e7dd-c20d-a389-dfc4-0981f9a3c03c@gmail.com>
 <6623947.tej94tQTD2@dell_xps>
 <e57aa55a-c651-2c39-34a8-ab13a4a981f1@gmail.com>
From: skyclan@gmx.net
Message-ID: <a7d03aee-988d-f100-8ed3-3e6e9bfb4ecb@kiwifrog.net>
Date: Wed, 18 Oct 2017 15:45:23 +0200
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
 Thunderbird/52.4.0
Precedence: bulk
List-Post: <mailto:gentoo-user@lists.gentoo.org>
List-Help: <mailto:gentoo-user+help@lists.gentoo.org>
List-Unsubscribe: <mailto:gentoo-user+unsubscribe@lists.gentoo.org>
List-Subscribe: <mailto:gentoo-user+subscribe@lists.gentoo.org>
List-Id: Gentoo Linux mail <gentoo-user.gentoo.org>
X-BeenThere: gentoo-user@lists.gentoo.org
Reply-to: gentoo-user@lists.gentoo.org
MIME-Version: 1.0
In-Reply-To: <e57aa55a-c651-2c39-34a8-ab13a4a981f1@gmail.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-GB
Content-Transfer-Encoding: 7bit
X-Archives-Salt: 05cc1397-753d-4c6d-8250-c9a9ed87795a
X-Archives-Hash: e36faf9fe58db050b604d892bdd43920

Hi Alan,

This isn't exactly what you describe for your needs but have you 
considered using auto-remediation outside of the box?  I've been using 
StackStorm https://stackstorm.com/ for the last year in an environment 
of ~1500 physical servers for this purpose and it's been quite successful.

It has been handling cases like restarting SNMP daemons that segfault, 
hadoop instances that loose to contact with the ZooKeeper cluster, 
restarting nginx daemons that stop responding to requests by analysing 
the last write date in nginx's access logs, the list goes on.

StackStorm is event driven platform that has many integrations available 
allowing it to interact with internal and external service providers. 
It's Python based and can use ssh to execute remote commands which 
sounds like an acceptable approach since you're using ansible.

Connecting SNMP traps up to StackStorm's event bus to trigger automated 
responses based on the trap contents would be inline with common use cases.

Regards,
Carlos

On 16/10/17 17:50, Alan McKinnon wrote:
> Nagios and I go way back, way way waaaaaay back. I now recommend it
> never be used unless there really is no other option. There is just so
> many problems with actually using the bloody thing, but let's not get
> into that:-)
> 
> I have a full monitoring system that tracks and reports on the state of
> most things, but as it's a monitoring system it is forbidden to make
> changes of any kind at all, and that includes restarting failed daemons.
> Turns out that daemons that failed for no good reason are becoming more
> and more common in this day and age, mostly because we treat them like
> cattle not pets and use virtualization and containers so much. And
> there's our old friend the Linux oom-killer....
> 
> What I need here is a small app that will be a constrained,
> single-purpose watchdog. If a daemon fails, the watchdog attempts 3
> restarts to get it going, and records the fact it did it (that goes into
> the big monitoring system as a reportable fact). If the restart fails,
> then a human needs to attend to it as it is seriously or beyond the
> scope of a watchdog.
> 
> Like you, I'm tired of being woken at 2am because something dropped 1
> ping when the nightly database maintenance fired up on the vmware
> cluster:-)