From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from lists.gentoo.org ([140.105.134.102] helo=robin.gentoo.org) by nuthatch.gentoo.org with esmtp (Exim 4.43) id 1EISBw-0004Df-Rl for garchives@archives.gentoo.org; Thu, 22 Sep 2005 14:36:53 +0000 Received: from robin.gentoo.org (localhost [127.0.0.1]) by robin.gentoo.org (8.13.5/8.13.5) with SMTP id j8MES2vH001411; Thu, 22 Sep 2005 14:28:02 GMT Received: from core1.needhosting.net (core1.needhosting.net [65.254.55.226]) by robin.gentoo.org (8.13.5/8.13.5) with ESMTP id j8MES1am001352 for ; Thu, 22 Sep 2005 14:28:02 GMT Received: from [69.212.231.242] (helo=[192.168.1.2] ident=9Kxj28ygmVJ) by core1.needhosting.net with esmtpsa (TLSv1:AES256-SHA:256) (Exim 4.52) id 1EIS9I-0002Tt-FI for gentoo-server@lists.gentoo.org; Thu, 22 Sep 2005 10:34:08 -0400 Message-ID: <4332C0DE.5060503@fire-eyes.org> Date: Thu, 22 Sep 2005 10:34:06 -0400 From: fire-eyes User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.11) Gecko/20050913 X-Accept-Language: en-us, en Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Id: Gentoo Linux mail X-BeenThere: gentoo-server@gentoo.org Reply-to: gentoo-server@lists.gentoo.org MIME-Version: 1.0 To: gentoo-server@lists.gentoo.org Subject: [gentoo-server] Server goes down twice in two days, looking for input Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - core1.needhosting.net X-AntiAbuse: Original Domain - lists.gentoo.org X-AntiAbuse: Originator/Caller UID/GID - [0 0] / [47 12] X-AntiAbuse: Sender Address Domain - fire-eyes.org X-Source: X-Source-Args: X-Source-Dir: X-Archives-Salt: 86b9f0b2-a4b7-441f-8f92-2450dd02eaac X-Archives-Hash: 47beb8ec6e87fa266c48dd72959171b7 Hi there! I have had our gentoo server go down twice in under two days. I am currently trying to figure out what is happening. Facts: - Dual PIII 933 MHz system (ServerWorks OSB4) - 3.5GB RAM - 2.6.11.2-grsec-20050614 kernel (self rolled) - SCSI: Adaptec AIC-7892P, 32MB cache + Disks + For Operating System - 2x IBM DDYS-T09170N SCSI U160 10KRPM 9.1GB in a RAID1, 1x of the same for hotspare + For storage etc - 3x IBM IC35L036UWD210-0 SCSSI U160 10KRPM - 1x IBM DDYS-T36950N SCSI U160 10KRPM - In a RAID5 Tuesday afternoon, I was informed that there might be problems with this server. I had just been working on it via shell. I went back, and found it unresponsive. I went into the server room, only to catch it ending a reboot and being almost totally back up. It behaved the rest of the day. I was not able to find any indications of problems in the logs. Wednesday evening, I was again working on the system via ssh, and it stopped responding. I got into the server room fast enough this time. I tried to log in as root, and could not. I could type the username, but upon hitting enter, nothing happened. That was true for any console. I have syslogd output *.* to console 10, so flipping over there, I saw nothing out of the ordinary. The last long, at the time I noticed it stop responding, was a simple run-of-the-mill firewall log. After a few more minutes, the system was completely unresponsive, save for SysReq. I Synced, tErmed, Synced again, remounted everything read-only and forced it to reboot. Again I was not able to find any logs indicating any errors at all. The only two possibilities I see is that I was goofing with samba at various points, both days. However, samba was not running at either time the system went down. The other, more interesting one, is that at both times when the system went down, I was creating a tar.bz2 out of a kernel source. The problems happened well after I had started them. Wondering about disks, I threw smartctl -a at both of the arrays (sda , sdb), which didn't give anything out of the ordinary. However when I run smartctl -t offline or -t short or -t long on sda or sdb, it immediately fails on STDOUT. This I find odd, because I have done these tests in the past. Granted it was on a different kernel, which I no longer have around. Here is an example: # smartctl -t short /dev/sda smartctl version 5.33 [i686-pc-linux-gnu] Copyright (C) 2002-4 Bruce Allen Home page is http://smartmontools.sourceforge.net/ Short Background Self Test Failed Looking at logs, I don't see anything strange. Including dmesg. I am worried by the smartctl results, however I realize there is a small possibility that it's due to kernel changes. Any ideas out there? Thank you for reading this! I *LOVE* Gentoo in production. -- gentoo-server@gentoo.org mailing list