From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from lists.gentoo.org ([140.105.134.102] helo=robin.gentoo.org) by nuthatch.gentoo.org with esmtp (Exim 4.62) (envelope-from ) id 1Gqunj-00086p-6P for garchives@archives.gentoo.org; Sun, 03 Dec 2006 17:06:52 +0000 Received: from robin.gentoo.org (localhost [127.0.0.1]) by robin.gentoo.org (8.13.8/8.13.8) with SMTP id kB3H50DF003995; Sun, 3 Dec 2006 17:05:00 GMT Received: from ciao.gmane.org (main.gmane.org [80.91.229.2]) by robin.gentoo.org (8.13.8/8.13.8) with ESMTP id kB3H4xbW000186 for ; Sun, 3 Dec 2006 17:04:59 GMT Received: from list by ciao.gmane.org with local (Exim 4.43) id 1Gqulr-00028d-8x for gentoo-amd64@lists.gentoo.org; Sun, 03 Dec 2006 18:04:55 +0100 Received: from ip68-230-97-209.ph.ph.cox.net ([68.230.97.209]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Sun, 03 Dec 2006 18:04:55 +0100 Received: from 1i5t5.duncan by ip68-230-97-209.ph.ph.cox.net with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Sun, 03 Dec 2006 18:04:55 +0100 X-Injected-Via-Gmane: http://gmane.org/ To: gentoo-amd64@lists.gentoo.org From: Duncan <1i5t5.duncan@cox.net> Subject: [gentoo-amd64] Re: fsck seems to screw up my harddisk Date: Sun, 3 Dec 2006 17:04:39 +0000 (UTC) Message-ID: References: <7573e9640611271044k23c6a0afwbb2317e23ffbc54d@mail.gmail.com> <7573e9640611271749k6764b39fh88f20086e2edfc8e@mail.gmail.com> Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Id: Gentoo Linux mail X-BeenThere: gentoo-amd64@gentoo.org Reply-to: gentoo-amd64@lists.gentoo.org Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Complaints-To: usenet@sea.gmane.org X-Gmane-NNTP-Posting-Host: ip68-230-97-209.ph.ph.cox.net User-Agent: pan 0.120 (Plate of Shrimp) Sender: news X-Archives-Salt: eadec53e-5005-479c-90aa-9b14d2b92828 X-Archives-Hash: 54e3d4365b08f232bd6b23bc19258b8b "Guido Doornberg" posted eb2db630612030540k38b658b2q3571a712efc510d0@mail.gmail.com, excerpted below, on Sun, 03 Dec 2006 14:40:11 +0100: > Well, I downloaded and started a fresh 2006.1 livecd, repartitioned de > hdd, started mke2fs and this time with the -c option. > > So, it started checking and after about 15 minutes this kept on showing up > on my screen: > > ata1: error=ox40 {uncorrectable error} ata1: translated ATA stat/err > 0X51/40 SCSI SK/ASC/ASCQ 0x3/11/04 > > after a while i got a couple of other messages, and now it keeps on > talking about Buffer I/O error on device sda3, and after that various > sectors and blocks are called. > > I did look after my power supply and I'm for 99% sure that's not the > problem. So, correct me if i'm wrong but that would mean my harddisk is > the problem? But how than is it possible that I can use it normally if I > don't let fsck check it? > > I know this isn't really gentoo specific anymore, but if anyone knows what > to do i'm happy to hear it. Your suspiciouns seem correct to me as well. I've had several hard drives go partially bad over the last several years. The last one I know was due to heat as I'm in Phoenix, AZ, with highs in the summer approaching 50 C (122 F), and my AC went out. Since it followed the same basic pattern of another one previous to that, I expect the problem with the previous one was heat as well tho I'm not positive. What happens when the drives overheat is the platters expand and the heads crash into them, thereby digging grooves (which I could see taking the drive apart later) in the platters. Of course, the data will be destroyed at for those disk cylinders, basically wherever the head seeked to while the platter was hot enough to crash it, but the rest of the drive is recoverable, and from my experience, somewhat stable, provided the drive doesn't overheat again. Due to the way I have my system setup (see below) and what was damaged, I was actually able to continue to use the system for some time. Never-the-less, get anything you want saved off it ASAP, preferably leaving it shut off until you can, just in case, after which you can work around the problem if you wish, marking badblocks, and use the disk for either temp stuff only or always backed up stuff, from then on. It's possible for particularly drives used for mobile applications to have similar head-crashes, due to dropping the laptop or whatever, and there may be other ways to generate that pattern as well. How to work around the issue? First, as I said, backup the disk, or at least anything of value on it. Of course, this likely won't apply since you were setting up a new system on it anyway, but for completeness... If you run into areas that won't easily copy, and you want to recover the data if possible, there's a package, sys-apps/dd-rescue, or it should be available on any good recovery LiveCD. (I doubt it's on the Gentoo install CDs, but you can check). dd_rescue is the same idea as the normal Unix dd utility, but the rescue version is designed to try from the beginning of a partition forward until it runs into problems, then from the end backward, then in the middle of anything still left unread, until it has copied as much of the partition as possible. You can then fsck the recovered copy and see what can be repaired. Note, however, that this is a process that will take awhile, hours, possibly days, depending on how much of the disk is damaged, as the drive tries several times to read the data, and if it fails the software will have it try /again/ several times. Depending on your i/o system, you aren't likely to be able to do much else with the system while this is going on, as it'll tend to lock things up pretty badly during the try and fail and try again phase. Of course, this will be repeated for each bad block, so it WILL take awhile if more than a handful of blocks are damaged. Recovery of all the data is obviously not guaranteed in any case, and you may simply decide it's not worth the hassle. Google or see the dd_rescue manpage for details. It should be noted that dd_rescue can be configured to report the badblocks as it goes, so you can skip the badblocks mapping step below if you use it to recover existing data, and save its badblocks report to be reused later. If you skip data recovery attempts, or simply want to test any disk before you use it, you'll want another app, badblocks, likely installed already as a part of sys-fs/e2fsprogs. badblocks can scan the disk in either (non-destructive) read/read-over/compare mode, non-destructive read/write-back/read-back/compare mode, or destructive write-pattern/read-back/compare mode. Do NOT use the destructive mode if there's stuff on the partition you want to keep, as it WILL overwrite it. However you generate the badblocks report, using either the output of dd_rescue or badblocks, you then use this information when setting up your disk again. It's probably wise to setup multiple partitions, leaving the large bad areas unpartitioned. For smaller bad areas of just a handful of blocks, one of the parameters you can feed mkfs is a badblocks list. Again, check the manpages or google for the details, but when you are done, you should be left with a working and fsck-able set of partitions once again, since the badblocks are either excluded from the area you partitioned, or listed as badblocks in the superblock area of the filesystem you created using that parameter with your mkfs, and therefore avoided. --- * For reliability purposes, I had my system setup with multiple copies of most of my partitions. The idea was periodically, when the system seemed stable, I'd backup my main working copy of all the critical partitions, and could therefore boot a not-too-old backup copy in the event something broke on my main working copy. Basically, all it took (and all it continues to take) is appending a different root= parameter to the kernel command line, to boot the rootmirror. Thus, when portions of the drive were damaged, they were naturally the portions the head had tried to seek to during the time the drive was overheated, which means they were in the partitions mounted at the time. The unmounted partitions were therefore undamaged and after finding the system crashed due to the overheating, once I cooled things back down, I could boot to the backup partitions and resume from there. As it happened, only a couple of my working partitions were damaged, and I was able to use the working copy of all the other partitions. In terms of partitioning strategy... with my old system I made the mistake of separating /var and /usr onto their own partitions, and then trying to mix and match backup partitions with working copy partitions. That didn't work so well, because the portage records of what were installed were from the backup and therefore outdated /var partition, while /usr and root were the working copies, so portage had the wrong package versions as being installed. Since I had use FEATURES=buildpkg and had all the packages available in binary format, it was easy to simply reinstall everything from them, updating the portage database, but because it wasn't accurate, it couldn't unmerge the non-existing old versions, so I ended up with a bunch of stale and orphaned files strewn around. When I upgraded from that disk, which I did as soon as I could since I didn't trust it even tho it was working, I therefore setup things a bit differently. What I'd suggest today would be keeping /var and /usr on your root partition, but putting /var/log and /var/tmp and /usr/portage and /usr/src, as well as stuff such as /home, on on other partitions. (You can use one and either use mount --move or simply symlink, if you want to put several dirs from different places in the tree on the same partition.) Basically, anything that portage installs stuff to, along with its database in /var/db, should be kept on the same partition, so every backup of that partition will have the portage database in sync with what's actually installed, since it's the same partition. Here, my / partition and backup snapshots are 10 GB each. That's plenty of room to spare for me, since less than two GB are actually used. I'd recommend a total of three copies of it, the working or main copy, and two snapshot backups of the same exact partition size. The idea being that you can alternate backups, so even if something happens after you've erased the one backup in preparation for copying over the working system as a new snapshot, so that backup is erased or incomplete at the same time the working copy dies, you'll still have the other backup to fall back to. Similarly with partitions such as /home and /usr/local that hold data I want to be sure and keep. 2-3 copies of each, a working and 1-2 backup copies. /var/log you probably don't need a copy of. Same with wherever you have your portage tree, since you can always just sync it to get another if it's destroyed, and with /tmp and /var/tmp, since that's temp data anyway and doesn't need a redundant copy kept. Actually, while that can be implemented well on one or two disks, here, I got tired of hard drive problems, and I'm now running a four-disk kernel based SATA RAID, Seagate drives, 5-yr warrantee, altho they aren't quite as fast as some of the others you can buy. Booting requires RAID-1 so I have a small RAID-1 partition mirrored across all four drives. That's /boot. Most of my system is RAID-6, which in a four-way system is effectively a two-way stripe with two parity stripes as well. Thus, I can lose any two of the four drives and anything on the RAID-6 will still be recoverable. Stuff like /tmp, the portage tree, etc, that's either easily redownloaded off the net or is temporary anyway, is on a 4-way RAID-0 for speed. If any of the four drives goes down, all that data is lost, but that's fine, since it's either temporary or easily recovered anyway. Likewise, my swap is four-way striped. Disk read/write speed on this four-way striped area is incredibly fast (for hard drive access), since drives are so much slower than the bus connecting them to the system, meaning the system can keep the bus busy doing i/o to all four devices instead of just to one, and then having to wait for the slow drive. The problem with RAID-0, however, is that while it's far faster, it's also far riskier, since you lose it if you lose any of the component devices. Fortunately, the data that is easiest to replace is also generally the most speed critical, so it works out quite well. =8^) I have RAID-1 mirrored for /boot, RAID-6 for safety for most of my system, and RAID-0 for speed where I don't care if the data dies. On top of that, for the parts of the system I really care about, I keep several snapshots around on the RAID-6, thus protecting me both from fat-finger syndrome deletions (where RAID won't help, unfortunately) with the multiple snapshots, and from device failure with the RAID-6. As an added bonus, since I'm running kernel-RAID, it's not hardware specific, so if the SATA chip dies, all I have to do is buy a new 4-way SATA board, plug the existing drives into the new board, and compile a new kernel (from a liveCD or whatever) with the appropriate new SATA drivers, and I'm up and running again. If I had gone hardware RAID and it died, I'd have to get another one like it to plug into, if I wanted to recover my data, something I don't have to worry about with kernel-raid. =8^) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- gentoo-amd64@gentoo.org mailing list