From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from pigeon.gentoo.org ([208.92.234.80] helo=lists.gentoo.org) by finch.gentoo.org with esmtp (Exim 4.60) (envelope-from ) id 1NxlCS-0006KA-RJ for garchives@archives.gentoo.org; Fri, 02 Apr 2010 18:02:35 +0000 Received: from pigeon.gentoo.org (localhost [127.0.0.1]) by pigeon.gentoo.org (Postfix) with SMTP id 89A52E0960 for ; Fri, 2 Apr 2010 18:02:31 +0000 (UTC) Received: from mail-pw0-f53.google.com (mail-pw0-f53.google.com [209.85.160.53]) by pigeon.gentoo.org (Postfix) with ESMTP id 5DCABE0930 for ; Fri, 2 Apr 2010 17:19:00 +0000 (UTC) Received: by pwj10 with SMTP id 10so1679927pwj.40 for ; Fri, 02 Apr 2010 10:18:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:received:message-id:subject:from:to:content-type :content-transfer-encoding; bh=ckLWcUDfMZpol/l6Mpeeqv5ygNbLgAwIuif65Nl/kJg=; b=VWlpaj1pf2ZMCAYR71j/heWc3Mz9F3TbwYi0O8qnVnnTKNvXKE5GPT62LFANM7HtNR McodmqRSgpNQuIOBVZwncIvGAltgyMW9MEev8Ga2guTyJCwT59kgGoT7cOpX7CiiK6C7 VNAX3q3BhKHdC5XCQQjsUx9TyQxmloo/zMHOE= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=MkUqD+G3xvPQt4BoaWfE8z+Eb0F4qNfBPqeMf6KJUPYH1bW1YDVYI8tPEcoOOxFxlN wPN2vpSkZl9tdZo0pZ2Z4aKOhMp2U6wcER42d2YUDvPyLolzJtC0lcUkpj2yMOrDHdF2 37DpycKugQHxTIWaDlwprSKkkzzUB48F6ILr8= Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Id: Gentoo Linux mail X-BeenThere: gentoo-amd64@lists.gentoo.org Reply-to: gentoo-amd64@lists.gentoo.org MIME-Version: 1.0 Received: by 10.143.13.4 with HTTP; Fri, 2 Apr 2010 10:18:59 -0700 (PDT) In-Reply-To: References: <5bdc1c8b1003281014w666f1cf7o20beeb736aaf7319@mail.gmail.com> <5bdc1c8b1003300656u6d1f6aa4nea031e5a60f1492@mail.gmail.com> <5bdc1c8b1003301326g3bd92c72ra4c4585dbed88f69@mail.gmail.com> Date: Fri, 2 Apr 2010 10:18:59 -0700 Received: by 10.142.152.37 with SMTP id z37mr882908wfd.84.1270228739616; Fri, 02 Apr 2010 10:18:59 -0700 (PDT) Message-ID: Subject: Re: [gentoo-amd64] Re: RAID1 boot - no bootable media found From: Mark Knecht To: gentoo-amd64@lists.gentoo.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Archives-Salt: 6fa64273-4a70-49fa-9c69-11ef0b7f2485 X-Archives-Hash: 6ca99992e6668ef5a1492d53948e6656 Good stuff. I'll snip out the less important to keep the response shorter but don't think for a second that I didn't appreciate it. I do! On Fri, Apr 2, 2010 at 2:43 AM, Duncan <1i5t5.duncan@cox.net> wrote: > Mark Knecht posted on Thu, 01 Apr 2010 11:57:47 -0700 as excerpted: > > Making the titles different is a very good idea. =C2=A0It's what I ended = up > doing too, as otherwise, it can get confusing pretty fast. > > Something else you might want to do, for purposes of identifying the > drives at the grub boot prompt if something goes wrong or you are > otherwise trying to boot something on another drive, is create a (probabl= y > empty) differently named file on each one, say grub.sda, grub.sdb, etc. I'll consider that, once I get the hard problems solved. >> Roughly speaking 1TB read at 100MB/S should take 10,000 seconds or 2.7 >> hours. I'm at 18% after 28 minutes so that seems about right. (With no >> errors so far assuming I'm using the right command) > > I used the -w switch here, which actually goes over the disk a total of 8 > times, alternating writing and then reading back to verify the written > pattern, for four different write patterns (0xaa, 0x55, 0xff, 0x00, which > is alternating 10101010, alternating 01010101, all ones, all zeroes). OK, makes sense then. I ran one pass of badblocks on each of the drives. No problem found. I know some Linux folks don't like Spinrite but I've had good luck with it so that's running now. Problem is it cannot run the drives at the same time and it looks like it wants at least 24 hours to do the whole drive so using it would take 3 days. I will likely let it run through the first drive (I'm busy today) and then tomorrow drop back into Linux and possibly try your badblocks on all 3 drives. I'm not overly concerned about losing the install. > > > [8 second spin-down timeouts] > >> Very true. Here is the same drive model I put in a new machine for my >> dad. It's been powered up and running Gentoo as a typical desktop >> machine for about 50 days. He doesn't use it more than about an hour a >> day on average. It's already hit 31K load/unload cycles. At 10% of 300K >> that about 1.5 years of life before I hit that spec. I've watched his >> system a bit and his system seems to add 1 to the count almost exactly >> every 2 minutes on average. Is that a common cron job maybe? > > It's unlikely to be a cron job. =C2=A0But check your logging, and check w= hat > sort of atime you're using on your mounts (relatime is the new kernel > default, but it was atime until relatively recently, say 2.6.30 or .31 or > some such, and noatime is recommended unless you have something that > actually depends on atime, alpine is known to need it for mail, and some > backup software uses it, tho little else on a modern system will, I alway= s > use noatime on my real disk mounts, as opposed to say tmpfs, here). =C2= =A0If > there's something writing to the log every two minutes or less, and the > buffers are set to timeout dirty data and flush to disk every two > minutes... =C2=A0And simply accessing a file will change the atime on it = if you > have that turned on, thus necessitating a write to disk to update the > atime, with those dirty buffers flushed every X minutes or seconds as wel= l. Here is fstab from my dad's machine which racks up 30 Load_Cycle_Counts and hour: # NOTE: If your BOOT partition is ReiserFS, add the notail option to opts. LABEL=3D"myboot" /boot ext2 noauto,noatime 1= 2 LABEL=3D"myroot" / ext3 noatime 0= 1 LABEL=3D"myswap" none swap sw 0= 0 LABEL=3D"homeherb" /home/herb ext3 noatime 0= 1 /dev/cdrom /mnt/cdrom auto noauto,ro 0 0 #/dev/fd0 /mnt/floppy auto noauto 0 0 # glibc 2.2 and above expects tmpfs to be mounted at /dev/shm for # POSIX shared memory (shm_open, shm_unlink). # (tmpfs is a dynamically expandable/shrinkable ramdisk, and will # use almost no memory if not populated with files) shm /dev/shm tmpfs nodev,nosuid,noexec 0 0 On the other hand there is some cron stuff going on every 10 minutes or so. Possibly it's not 1 event ever 2 minutes but maybe 5 events every 10 minutes? Apr 2 07:10:01 gandalf cron[6310]: (root) CMD (test -x /usr/sbin/run-crons && /usr/sbin/run-crons ) Apr 2 07:20:01 gandalf cron[6322]: (root) CMD (test -x /usr/sbin/run-crons && /usr/sbin/run-crons ) Apr 2 07:30:01 gandalf cron[6335]: (root) CMD (test -x /usr/sbin/run-crons && /usr/sbin/run-crons ) Apr 2 07:40:01 gandalf cron[6348]: (root) CMD (test -x /usr/sbin/run-crons && /usr/sbin/run-crons ) Apr 2 07:50:01 gandalf cron[6361]: (root) CMD (test -x /usr/sbin/run-crons && /usr/sbin/run-crons ) Apr 2 07:59:01 gandalf cron[6374]: (root) CMD (rm -f /var/spool/cron/lastrun/cron.hourly) Apr 2 08:00:01 gandalf cron[6376]: (root) CMD (test -x /usr/sbin/run-crons && /usr/sbin/run-crons ) Apr 2 08:10:01 gandalf cron[6388]: (root) CMD (test -x /usr/sbin/run-crons && /usr/sbin/run-crons ) Apr 2 08:20:01 gandalf cron[6401]: (root) CMD (test -x /usr/sbin/run-crons && /usr/sbin/run-crons ) Apr 2 08:30:01 gandalf cron[6414]: (root) CMD (test -x /usr/sbin/run-crons && /usr/sbin/run-crons ) Apr 2 08:40:01 gandalf cron[6427]: (root) CMD (test -x /usr/sbin/run-crons && /usr/sbin/run-crons ) Apr 2 08:50:01 gandalf cron[6440]: (root) CMD (test -x /usr/sbin/run-crons && /usr/sbin/run-crons ) Apr 2 08:59:01 gandalf cron[6453]: (root) CMD (rm -f /var/spool/cron/lastrun/cron.hourly) Apr 2 09:00:01 gandalf cron[6455]: (root) CMD (test -x /usr/sbin/run-crons && /usr/sbin/run-crons ) Apr 2 09:10:01 gandalf cron[6467]: (root) CMD (test -x /usr/sbin/run-crons && /usr/sbin/run-crons ) Apr 2 09:18:01 gandalf sshd[6479]: Accepted keyboard-interactive/pam for root from 67.188.27.80 port 51981 ssh2 Apr 2 09:18:01 gandalf sshd[6479]: pam_unix(sshd:session): session opened for user root by (uid=3D0) > > Note that I don't have #193, the load-cycle counts. =C2=A0There's a coupl= e > different technologies here. =C2=A0The ramp-type load/unload yours uses i= s > typical of the smaller 2.5" laptop drives. =C2=A0These are designed for f= ar > shorter idle/standby timeouts and thus a far higher cycle count, load > cycles, typical rating 300,000 to 600,000. =C2=A0Standard desktop/server = drives > use a contact park method and a lower power cycle count, typically 50,000 > or so. =C2=A0That's the difference. I also purchased two Enterprise Edition drives - the 500GB size. They are also spec'ed at 300K http://www.wdc.com/en/products/products.asp?DriveID=3D489 My intention was to use them in a RAID0 and then back them up daily to RAID1 for more safety. However I'm starting to think this TLER feature may well be part of this problem. I don't want to start using them however until I understand this 30/minute issue. No reason to wear everything out! > > One thing they recommend with RAID, which I did NOT do, BTW, and which I'= m > beginning to worry about since I'm approaching the end of my 5 year > warranties, is buying either different brands or models, or at least > ensuring you're getting different lot numbers of the same model. =C2=A0Th= e idea > being, if they're all the same model and lot number, and they're all part > of the same RAID so in similar operating conditions, they're all likely t= o > go out pretty close to each other. =C2=A0That's one reason to be glad I'm > running 4-way RAID-1, I suppose, as one hopes that when they start going, > even if they are the same model and lot number, at least one of the four > can hang on long enough for me to buy replacements and transfer the > critical data. Exactly! My plan for this box is a 3 disk RAID1 as 3 disks is all it will h= old. Most folks don't understand that if 1 drive has a 1% chance of failing then 3 drives is more like a 3% chance of failing assuming they are are truly independent. If they all come from the same lot and 1 fails then it's logically more likely that the other 2 will fail in the next few days or weeks. Certainly much faster then getting them from different companies. >> >> INFO: task kjournald:5064 blocked for more than 120 seconds. "echo 0 > >> /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > [snipped the trace] > > Ouch! =C2=A0Blocked for 2 minutes... > > Yes, between the logs and the 2-minute hung-task, that does look like som= e > serious issues, chipset or other... > > Talking about which... > > Can you try different SATA cables? =C2=A0I'm assuming you and your dad ar= en't > using the same cables. =C2=A0Maybe it's the cables, not the chipset. Now that's an interesting thought. n my other machines I used the cables Intel shipped with the MB. However in this case I couldn't because the SATA connectors don't point upward but come out horizontally. Due to proximity to the drive container I had to get 90 degree cables and all 3 drives are using those right now. I can switch two of the drives to the Intel cables. That said Spinrite has been running for hours without and problem at all and it will tell me if there are delays, sectors not found, etc., so if it was as blatant a problem as it appears to be when running Linux then I really think I would have seen it by now. I would have guessed I would have seen it running badblocks also, but possibly not. > > Also, consider slowing the data down. =C2=A0Disable UDMA or reduce it to = a > lower speed, or check the pinouts and try jumpering OPT1 to force SATA-1 > speeds (150 MB/sec instead of 300 MB/sec) as detailed here (watch the > wrap!): > > http://wdc.custhelp.com/cgi-bin/wdc.cfg/php/enduser/std_adp.php? > p_faqid=3D1337 > > If that solves the issue, then you know it's related to signal timing. Will try it. > > Unfortunately, this can be mobo related. =C2=A0I had very similar issues = with > memory at one point, and had to slow it down from the rated PC3200, to > PC3000 speed (declock it from 200 MHz to 183 MHz), in the BIOS. > Unfortunately, initially the BIOS didn't have a setting for that; it > wasn't until a BIOS update that I got it. =C2=A0Until I got the update an= d > declocked it, it would work most of the time, but was borderline. =C2=A0T= he > thing was, the memory was solid and tested so in memtest86+, but that > tests memory cells, not speed, and at the rated speed, that memory and > that board just didn't like each other, and there'd be occasional issues > (bunzip2 erroring out due to checksum mismatch was a common one, and > occasional crashes). Ultimately, I fixed the problem when I upgraded > memory. OK, so I have 6 of these drives and multiple PCs. While not a perfect test I can try putting a couple into another machine and building a 2 drive RAID1 just to see what happens. > > So having experienced the issue with memory, I know exactly how > frustrating it can be. =C2=A0But if you slow it down with the jumper and = it > works, then you can try different cables, or take off the jumper and try > lower UDMA speeds (but still higher than SATA-1/150MB/sec), using hdparm > or something. =C2=A0Or exchange either the drives or the mobo, if you can= , or > buy an add-on SATA card and disable the onboard one. > > Oh, and double-check the kernel driver you are using for it as well. > Maybe there's another that'll work better, or driver options you can feed > to it, or something. The kernel driver is ahci. Don't know that I have any alternatives when booting AHCI from BIOS, but I can look at the other modes with other drivers and see if the problems still occurs. That's a bit of work but probably worth it. This is all a big table of experiments that eventually limit the problem to a single location. (Hopefully!) > > Oh, and if you hadn't re-fdisked, re-created new md devices, remkfsed, an= d > reloaded the system from backup, after you switched to AHCI, try that. > AHCI and the kernel driver for it is almost certainly what you want, not > compatibility mode, but that could potentially screw things up too, if yo= u > switched it and didn't redo the disk afterward. > > I do wish you luck! =C2=A0Seeing those errors brought back BAD memories o= f the > memory problems I had, so while yours is disk not memory, I can definitel= y > sympathize! As always, thanks for the help. I'm very interested, and yes, even a little frustrated! ;-) Cheers, Mark