From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from lists.gentoo.org (pigeon.gentoo.org [208.92.234.80]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by finch.gentoo.org (Postfix) with ESMTPS id BEDF3139731 for ; Mon, 16 Oct 2017 15:55:14 +0000 (UTC) Received: from pigeon.gentoo.org (localhost [127.0.0.1]) by pigeon.gentoo.org (Postfix) with SMTP id 419042BC05C; Mon, 16 Oct 2017 15:55:09 +0000 (UTC) Received: from mail-wm0-x231.google.com (mail-wm0-x231.google.com [IPv6:2a00:1450:400c:c09::231]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by pigeon.gentoo.org (Postfix) with ESMTPS id BF74D2BC014 for ; Mon, 16 Oct 2017 15:55:08 +0000 (UTC) Received: by mail-wm0-x231.google.com with SMTP id u138so4764688wmu.5 for ; Mon, 16 Oct 2017 08:55:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=subject:to:references:from:message-id:date:user-agent:mime-version :in-reply-to:content-language:content-transfer-encoding; bh=42SyfawJ3zIcZ93wP8g+7SsGH+YfdSSHIgdO4KaNz58=; b=C7veSJiOH3UKxDPIDAhZPrv2En0iQtyG1dVQAo5DwpbCd0+4vbm9VxiokU3/8QLF0o 79M2K7VjLkaY27Xy7iKhB8DDaekuX078MFsNEOpF6CGuwtbQ7ooczMYjFECvo9pinRDK xkPwHt3Hm1+FskLjpQje3rx0A6VLsfyjEeRcsD1mCaBqdPGVxnWPk7ziXMG+B860Cc+u XRhwpdbE0vh1O9RIYo5Xnd3OjB9vHruqNHZUigyUF5BIpqrA6tIA6TVSnC6zlBrRJGuM 21i6UqGM/VcHlajSlolwsMWXW635x7t8dNg68v25I5jtRmLZQ4gfCnAnAhPLPj27d9WH KEKQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=42SyfawJ3zIcZ93wP8g+7SsGH+YfdSSHIgdO4KaNz58=; b=ODvTkDhCdo3CHOTyCItmTtsebYjyg00mgn/ao+jLO23WkO/CYkThEzq0liHZ7prx/b 3Rde0EnwhjhBQb36rWHNzu0UcqzqmikXT8Nvqy/Kz/VOPYbInbfAJhdkWbcJosBBRh5H Cl96MuPnS7YtJS7rPusxVZeSBqIkQoDJYndkXZ9gmHxKfyhg+0kJ61gtm7Xq+TjQVoV9 WlZHxvrSA6GvvuhPpZloUf5coGxOYe0ZLMSapz17NFMCSkB+tjn9mlKpAq451Q2P95Z3 jNozU9tnRWD5AzG843OkutMMqAD08v9+bRfD9IbniWoti6DSTGeEqklO1zEQyxxM0fme VHRg== X-Gm-Message-State: AMCzsaU7pAy6eYDFm8Mk38x1k2s4JpRLgnRFU3OKTvCa9VKXYXurzgwZ La2HDsVgSRvWbyokNmy9wScDSg== X-Google-Smtp-Source: ABhQp+R7K54Ak+UJ5rixH1i6yEOKqMG5QFeKJHAMUae5qigGnxqAIPeCXQep/UwnbmVg34wMbW7m3w== X-Received: by 10.28.148.67 with SMTP id w64mr1174602wmd.132.1508169306970; Mon, 16 Oct 2017 08:55:06 -0700 (PDT) Received: from [172.20.0.40] ([197.101.48.133]) by smtp.googlemail.com with ESMTPSA id 50sm7141930wry.84.2017.10.16.08.55.05 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 16 Oct 2017 08:55:06 -0700 (PDT) Subject: Re: [gentoo-user] Re: monit and friends. To: gentoo-user@lists.gentoo.org References: <96762772-dd49-7464-da0c-c0a878a6e7de@gmail.com> <20171016150852.u6w6ekna7la4pxhb@matica.foolinux.mooo.com> <6623947.tej94tQTD2@dell_xps> From: Alan McKinnon Message-ID: Date: Mon, 16 Oct 2017 17:50:07 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.4.0 Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Id: Gentoo Linux mail X-BeenThere: gentoo-user@lists.gentoo.org Reply-to: gentoo-user@lists.gentoo.org MIME-Version: 1.0 In-Reply-To: <6623947.tej94tQTD2@dell_xps> Content-Type: text/plain; charset=utf-8 Content-Language: en-GB Content-Transfer-Encoding: 7bit X-Archives-Salt: 66c45981-ef08-49ba-81f5-51f2d1738bf8 X-Archives-Hash: e4bab993b12a73e1f7f8a6df7af5cdeb On 16/10/2017 17:41, Mick wrote: > On Monday, 16 October 2017 16:12:53 BST Alan McKinnon wrote: >> On 16/10/2017 17:08, Ian Zimmerman wrote: >>> On 2017-10-16 14:11, Alan McKinnon wrote: >>>> My needs here are pretty simple: >>>> local watchdog that checks if a program is running and restart it if >>>> not. If that fails 3 times or so, alert me. >>>> Maybe a few file/dir/fifo monitors as well. Not much else. >>>> >>>> I don't need any of monit's graphing features or M/monit, I have other >>>> tools for that. And mostly don't even need it's http API either. >>> >>> supervisor (aka supervisord) >>> >>> http://supervisord.org/ >>> >>> python based, not sure if that's okay with you >> >> I forgot about supervisord. Like monit, it runs everywhere and might be >> easier for the team-mates to understand and work with. >> >> Python is not a problem, all these hosts are ansible-managed anyway, so >> they all have to run python-2.7 >> >> Good find, thanks! > > I've used Nagios in the past, but have not kept up with its development and > the many plugins it provides. It could do any of the above tasks and much > more. It can run scripts (perl, or bash) via daemons (nrpe) on the remote > systems to restart applications, et al. The Nagios server possessed the > ability to set up quite intelligent monitoring and alert hierarchies with > multilayered comms structures to make sure you are not woken up at 2 a.m. by > your boss, just because a ping failed to his home NAS. I also found the logs > which can be also stored on SQL quite useful both in troubleshooting problems > and in producing reports. It can monitor network connectivity, remote OS > parameters and applications. Writing your own plugin/module to monitor quite > specialised use cases is not particularly difficult either. > > I expect you may find Nagios more complicated to set up than monit, at least > initially, but if you don't have the luxury of time to invest on setting up > Nagios monit may be a better fit. I don't have in depth experience with other > monitoring software to comment, so something else may suit better your > specific needs. > Nagios and I go way back, way way waaaaaay back. I now recommend it never be used unless there really is no other option. There is just so many problems with actually using the bloody thing, but let's not get into that :-) I have a full monitoring system that tracks and reports on the state of most things, but as it's a monitoring system it is forbidden to make changes of any kind at all, and that includes restarting failed daemons. Turns out that daemons that failed for no good reason are becoming more and more common in this day and age, mostly because we treat them like cattle not pets and use virtualization and containers so much. And there's our old friend the Linux oom-killer.... What I need here is a small app that will be a constrained, single-purpose watchdog. If a daemon fails, the watchdog attempts 3 restarts to get it going, and records the fact it did it (that goes into the big monitoring system as a reportable fact). If the restart fails, then a human needs to attend to it as it is seriously or beyond the scope of a watchdog. Like you, I'm tired of being woken at 2am because something dropped 1 ping when the nightly database maintenance fired up on the vmware cluster :-) -- Alan McKinnon alan.mckinnon@gmail.com