* [gentoo-user] recovery from /var corruption? @ 2010-02-26 3:33 Mark Knecht 2010-02-26 9:09 ` Neil Bothwick ` (3 more replies) 0 siblings, 4 replies; 16+ messages in thread From: Mark Knecht @ 2010-02-26 3:33 UTC (permalink / raw To: gentoo-user So I got my wife's machine booted today using a install disk and played a bit with e2fsck. The machine stopped being happy last night due to some sort of corruption on the /var partition. e2fsck complained about 3 or 4 files and then repaired the partition. The machine booted cleanly as far as I can tell. So, something went bad and I managed to sneak around it for a while and now I'm sort of living with the machine wondering what to do. Do I just watch the logs looking for problems? I have no way of knowing right now whether this was a disk problem that's going to come back, a 1 time deal due to power, or something else entirely. As these cheap machines that don't use RAID what's the right way to go? emerge -e @world and then wait for the next event? Do nothing and wait? We've got decent personal data backups as well as basic /etc data. Thanks, Mark ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [gentoo-user] recovery from /var corruption? 2010-02-26 3:33 [gentoo-user] recovery from /var corruption? Mark Knecht @ 2010-02-26 9:09 ` Neil Bothwick 2010-02-26 9:46 ` Alex Schuster ` (2 subsequent siblings) 3 siblings, 0 replies; 16+ messages in thread From: Neil Bothwick @ 2010-02-26 9:09 UTC (permalink / raw To: gentoo-user [-- Attachment #1: Type: text/plain, Size: 645 bytes --] On Thu, 25 Feb 2010 19:33:23 -0800, Mark Knecht wrote: > So I got my wife's machine booted today using a install disk and > played a bit with e2fsck. The machine stopped being happy last night > due to some sort of corruption on the /var partition. e2fsck > complained about 3 or 4 files and then repaired the partition. The > machine booted cleanly as far as I can tell. > > So, something went bad and I managed to sneak around it for a while > and now I'm sort of living with the machine wondering what to do. Check the disk with smartmontools. -- Neil Bothwick All mail what i send is thoughly proof-red, definately! [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [gentoo-user] recovery from /var corruption? 2010-02-26 3:33 [gentoo-user] recovery from /var corruption? Mark Knecht 2010-02-26 9:09 ` Neil Bothwick @ 2010-02-26 9:46 ` Alex Schuster 2010-02-26 15:17 ` Mark Knecht 2010-02-26 11:47 ` daid kahl 2010-02-26 17:38 ` daid kahl 3 siblings, 1 reply; 16+ messages in thread From: Alex Schuster @ 2010-02-26 9:46 UTC (permalink / raw To: gentoo-user Mark Knecht writes: > Do I just watch the logs looking for problems? I have no way of > knowing right now whether this was a disk problem that's going to come > back, a 1 time deal due to power, or something else entirely. > > As these cheap machines that don't use RAID what's the right way to > go? emerge -e @world and then wait for the next event? Do nothing and > wait? Emerge smartmontools, then: smartctl -h /dev/sda # get overview of what the drive thinks about itself smartctl -t short /dev/sda # start short self test Wait smartctl -l selftest /dev/sda # see results smartctl -t long /dev/sda # start long self test Wait a lot longer smartctl -l selftest /dev/sda # see results You can continue working in the meanwhile, there will be no performance impact. You will see something like this in the log: === START OF READ SMART DATA SECTION === SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 2275 - # 2 Extended offline Completed without error 00% 2270 - # 3 Extended offline Completed without error 00% 1799 - # 4 Extended offline Completed without error 00% 197 - # 5 Extended offline Completed without error 00% 26 - I you have a '-' in the right column, the disk has found no errors. If there is a number, than it's the position of the first error. There's also badblocks, this will check every block and output the bad ones: badblocks -sv /dev/sda badblocks -svn /dev/sda will do a read-write test. In case of a bad block, the drive should exchange it with a spare one. Maybe this happens already in read-only mode, I am not sure. Also watch for errors in syslog or via dmesg, there should be some when bad blocks are being accessed. Wonko ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [gentoo-user] recovery from /var corruption? 2010-02-26 9:46 ` Alex Schuster @ 2010-02-26 15:17 ` Mark Knecht 2010-02-26 16:01 ` Alex Schuster 0 siblings, 1 reply; 16+ messages in thread From: Mark Knecht @ 2010-02-26 15:17 UTC (permalink / raw To: gentoo-user On Fri, Feb 26, 2010 at 1:46 AM, Alex Schuster <wonko@wonkology.org> wrote: > Mark Knecht writes: > >> Do I just watch the logs looking for problems? I have no way of >> knowing right now whether this was a disk problem that's going to come >> back, a 1 time deal due to power, or something else entirely. >> >> As these cheap machines that don't use RAID what's the right way to >> go? emerge -e @world and then wait for the next event? Do nothing and >> wait? > > Emerge smartmontools, then: > > smartctl -h /dev/sda # get overview of what the drive thinks about itself > > smartctl -t short /dev/sda # start short self test > Wait > smartctl -l selftest /dev/sda # see results > > smartctl -t long /dev/sda # start long self test > Wait a lot longer > smartctl -l selftest /dev/sda # see results > > You can continue working in the meanwhile, there will be no performance > impact. You will see something like this in the log: > > === START OF READ SMART DATA SECTION === > SMART Self-test log structure revision number 1 > Num Test_Description Status Remaining LifeTime(hours) > LBA_of_first_error > # 1 Short offline Completed without error 00% 2275 - > # 2 Extended offline Completed without error 00% 2270 - > # 3 Extended offline Completed without error 00% 1799 - > # 4 Extended offline Completed without error 00% 197 - > # 5 Extended offline Completed without error 00% 26 - > > I you have a '-' in the right column, the disk has found no errors. If > there is a number, than it's the position of the first error. > > There's also badblocks, this will check every block and output the bad > ones: badblocks -sv /dev/sda > > badblocks -svn /dev/sda will do a read-write test. In case of a bad block, > the drive should exchange it with a spare one. Maybe this happens already > in read-only mode, I am not sure. > > Also watch for errors in syslog or via dmesg, there should be some when > bad blocks are being accessed. > > Wonko > > Hi Wonko, Yes, I do use smartctl on some other machines although I'm not very good about it and your write-up is helpful so thanks for that. My wife's machines is older and and I don't think SMART is supported on her drive. Note the lack of a * on the SMART line in hdparm -I: dragonfly ~ # hdparm -I /dev/hda /dev/hda: ATA device, with non-removable media Model Number: WDC WD1600BB-00FTA0 Serial Number: WD-WMAES2091586 Firmware Revision: 15.05R15 Standards: Supported: 6 5 4 Likely used: 6 Configuration: Logical max current cylinders 16383 16383 heads 16 16 sectors/track 63 63 -- CHS current addressable sectors: 16514064 LBA user addressable sectors: 268435455 LBA48 user addressable sectors: 312581808 Logical/Physical Sector size: 512 bytes device size with M = 1024*1024: 152627 MBytes device size with M = 1000*1000: 160041 MBytes (160 GB) cache/buffer size = 2048 KBytes (type=DualPortCache) Capabilities: LBA, IORDY(can be disabled) Standby timer values: spec'd by Standard, with device specific minimum R/W multiple sector transfer: Max = 16 Current = 16 Recommended acoustic management value: 128, current value: 254 DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 *udma5 Cycle time: min=120ns recommended=120ns PIO: pio0 pio1 pio2 pio3 pio4 Cycle time: no flow control=120ns IORDY flow control=120ns Commands/features: Enabled Supported: SMART feature set Security Mode feature set * Power Management feature set * Write cache * Look-ahead * Host Protected Area feature set * WRITE_BUFFER command * READ_BUFFER command * DOWNLOAD_MICROCODE SET_MAX security extension Automatic Acoustic Management feature set * 48-bit Address feature set * Device Configuration Overlay feature set * Mandatory FLUSH_CACHE * FLUSH_CACHE_EXT * SMART error logging * SMART self-test Security: supported not enabled not locked not frozen not expired: security count not supported: enhanced erase HW reset results: CBLID- above Vih Device num = 0 determined by CSEL Checksum: correct dragonfly ~ # dragonfly ~ # smartctl -H /dev/hda smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ SMART Disabled. Use option -s with argument 'on' to enable it. dragonfly ~ # smartctl -s on /dev/hda smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF ENABLE/DISABLE COMMANDS SECTION === Error SMART Enable failed: Input/output error Smartctl: SMART Enable Failed. A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options. dragonfly ~ # I've not tried the -T permissive options. I've never used badblocks as it seems I should only do that off-line. This might be a good time to boot with a CD and try it out. Maybe I should just get a new drive that supports SMART? - Mark ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [gentoo-user] recovery from /var corruption? 2010-02-26 15:17 ` Mark Knecht @ 2010-02-26 16:01 ` Alex Schuster 2010-02-26 16:53 ` Mark Knecht 0 siblings, 1 reply; 16+ messages in thread From: Alex Schuster @ 2010-02-26 16:01 UTC (permalink / raw To: gentoo-user Mark Knecht writes: > Yes, I do use smartctl on some other machines although I'm not very > good about it and your write-up is helpful so thanks for that. > > My wife's machines is older and and I don't think SMART is > supported on her drive. Note the lack of a * on the SMART line in > hdparm -I: Okay, but it still states: > * SMART error logging > * SMART self-test So maybe smartctl -t long /dev/hda still works? Just give it a try. > dragonfly ~ # smartctl -H /dev/hda > smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce > Allen Home page is http://smartmontools.sourceforge.net/ > > SMART Disabled. Use option -s with argument 'on' to enable it. > dragonfly ~ # smartctl -s on /dev/hda > smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce > Allen Home page is http://smartmontools.sourceforge.net/ > > === START OF ENABLE/DISABLE COMMANDS SECTION === > Error SMART Enable failed: Input/output error > Smartctl: SMART Enable Failed. > > A mandatory SMART command failed: exiting. To continue, add one or > more '-T permissive' options. > dragonfly ~ # > > I've not tried the -T permissive options. I would :) There is also a BIOS setting for SMART, but I think this does not matter here, and it's only for being able to report a failing drive before booting. > I've never used badblocks as it seems I should only do that off-line. > This might be a good time to boot with a CD and try it out. In read-only mode, you can use it when the system is running. Only the write test (option -n) refuses to run if partitions are mounted from the drive. So I'd do the 'badblocks -sv /dev/hda' right now, if you do not need the drive at full speed for a while. You can interrupt it at any point with Ctrl-Z and continue with the fg command. > Maybe I should just get a new drive that supports SMART? When the drive is that old it does not support SMART, you probably can get one ten times as huge for much less than it had cost you. And I would trust a new drive much more than such an old one. Depends on how important the data is, if a total loss would not be too painful and I had backups, and I would not need more speed and size, I would keep it if it shows no errors. Wonko ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [gentoo-user] recovery from /var corruption? 2010-02-26 16:01 ` Alex Schuster @ 2010-02-26 16:53 ` Mark Knecht 2010-02-26 17:27 ` Alex Schuster 0 siblings, 1 reply; 16+ messages in thread From: Mark Knecht @ 2010-02-26 16:53 UTC (permalink / raw To: gentoo-user On Fri, Feb 26, 2010 at 8:01 AM, Alex Schuster <wonko@wonkology.org> wrote: > Mark Knecht writes: > >> Yes, I do use smartctl on some other machines although I'm not very >> good about it and your write-up is helpful so thanks for that. >> >> My wife's machines is older and and I don't think SMART is >> supported on her drive. Note the lack of a * on the SMART line in >> hdparm -I: > > Okay, but it still states: > >> * SMART error logging >> * SMART self-test > > So maybe smartctl -t long /dev/hda still works? Just give it a try. No, -t long fails the same way. Basically every time I try to use smartctl on the drive it seems to issue one of these 3-line reports about SectorIDNotFound in dmesg. My other machines don't do this. Not a good sign I think... hda: task_no_data_intr: status=0x51 { DriveReady SeekComplete Error } hda: task_no_data_intr: error=0x10 { SectorIdNotFound }, LBAsect=16777008, sector=18446744073709551615 hda: possibly failed opcode: 0xb0 hda: task_no_data_intr: status=0x51 { DriveReady SeekComplete Error } hda: task_no_data_intr: error=0x10 { SectorIdNotFound }, LBAsect=262192, sector=18446744073709551615 hda: possibly failed opcode: 0xb0 hda: task_no_data_intr: status=0x51 { DriveReady SeekComplete Error } hda: task_no_data_intr: error=0x10 { SectorIdNotFound }, LBAsect=48, sector=18446744073709551615 hda: possibly failed opcode: 0xb0 These command create the same sort of lines in dmesg: dragonfly ~ # smartctl -i /dev/hda smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Model Family: Western Digital Caviar family Device Model: WDC WD1600BB-00FTA0 Serial Number: WD-WMAES2091586 Firmware Version: 15.05R15 User Capacity: 160,041,885,696 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 6 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Fri Feb 26 08:49:00 2010 PST SMART support is: Available - device has SMART capability. SMART support is: Disabled SMART Disabled. Use option -s with argument 'on' to enable it. dragonfly ~ # smartctl -P show /dev/hda smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ Drive found in smartmontools Database. Drive identity strings: MODEL: WDC WD1600BB-00FTA0 FIRMWARE: 15.05R15 match smartmontools Drive Database entry: MODEL REGEXP: ^WDC WD(2|3|4|6|8|10|12|16|18|20|25)00BB-.*$ FIRMWARE REGEXP: .* MODEL FAMILY: Western Digital Caviar family ATTRIBUTE OPTIONS: None preset; no -v options are required. dragonfly ~ # <SNIP> >> >> I've not tried the -T permissive options. > > I would :) There is also a BIOS setting for SMART, but I think this does > not matter here, and it's only for being able to report a failing drive > before booting. Tried -T permissive and -T verypermissive. Same result. More lines and told it's not turning on. Could this have ANYTHING to do with kernel configuation? Is there anything required at the kernel level that I might not have turned on? > >> I've never used badblocks as it seems I should only do that off-line. >> This might be a good time to boot with a CD and try it out. > > In read-only mode, you can use it when the system is running. Only the > write test (option -n) refuses to run if partitions are mounted from the > drive. So I'd do the 'badblocks -sv /dev/hda' right now, if you do not > need the drive at full speed for a while. You can interrupt it at any > point with Ctrl-Z and continue with the fg command. > OK, I've started that test and will report back later what it says. Thanks! - Mark ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [gentoo-user] recovery from /var corruption? 2010-02-26 16:53 ` Mark Knecht @ 2010-02-26 17:27 ` Alex Schuster 2010-02-26 17:51 ` Mark Knecht 0 siblings, 1 reply; 16+ messages in thread From: Alex Schuster @ 2010-02-26 17:27 UTC (permalink / raw To: gentoo-user Mark Knecht writes: > On Fri, Feb 26, 2010 at 8:01 AM, Alex Schuster <wonko@wonkology.org> > wrote: > > Okay, but it still states: > >> * SMART error logging > >> * SMART self-test > > > > So maybe smartctl -t long /dev/hda still works? Just give it a try. > > No, -t long fails the same way. Basically every time I try to use > smartctl on the drive it seems to issue one of these 3-line reports > about SectorIDNotFound in dmesg. My other machines don't do this. Not > a good sign I think... > > hda: task_no_data_intr: status=0x51 { DriveReady SeekComplete Error } > hda: task_no_data_intr: error=0x10 { SectorIdNotFound }, > LBAsect=16777008, sector=18446744073709551615 > hda: possibly failed opcode: 0xb0 Uh-oh. Okay, I guess it just won't work then. > Could this have ANYTHING to do with kernel configuation? Is there > anything required at the kernel level that I might not have turned on? I'm pretty sure it has nothing to do with the kernel, but with your drive being incapable of the SMART commands. But I guess using badblocks is not that different in the end. The SMART selftest runs in the background and does not create disk I/O, but I think it does nothing so much different from badblocks. Wonko ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [gentoo-user] recovery from /var corruption? 2010-02-26 17:27 ` Alex Schuster @ 2010-02-26 17:51 ` Mark Knecht 2010-02-26 17:59 ` Volker Armin Hemmann 0 siblings, 1 reply; 16+ messages in thread From: Mark Knecht @ 2010-02-26 17:51 UTC (permalink / raw To: gentoo-user On Fri, Feb 26, 2010 at 9:27 AM, Alex Schuster <wonko@wonkology.org> wrote: > Mark Knecht writes: > >> On Fri, Feb 26, 2010 at 8:01 AM, Alex Schuster <wonko@wonkology.org> >> wrote: > >> > Okay, but it still states: >> >> * SMART error logging >> >> * SMART self-test >> > >> > So maybe smartctl -t long /dev/hda still works? Just give it a try. >> >> No, -t long fails the same way. Basically every time I try to use >> smartctl on the drive it seems to issue one of these 3-line reports >> about SectorIDNotFound in dmesg. My other machines don't do this. Not >> a good sign I think... >> >> hda: task_no_data_intr: status=0x51 { DriveReady SeekComplete Error } >> hda: task_no_data_intr: error=0x10 { SectorIdNotFound }, >> LBAsect=16777008, sector=18446744073709551615 >> hda: possibly failed opcode: 0xb0 > > Uh-oh. Okay, I guess it just won't work then. > > >> Could this have ANYTHING to do with kernel configuation? Is there >> anything required at the kernel level that I might not have turned on? > > I'm pretty sure it has nothing to do with the kernel, but with your drive > being incapable of the SMART commands. > > But I guess using badblocks is not that different in the end. The SMART > selftest runs in the background and does not create disk I/O, but I think > it does nothing so much different from badblocks. > > Wonko > > The machine _mostly_ crashed while running badblocks. I say mostly because the mouse is still alive but I can no longer ssh in and cannot open a terminal on my wife's desktop or get to the console. I tried to Ctrl-C out out of badblocks here (this is running shelled in) before I figured out it was a total crash which messed up the terminal a bit but you can see what it was reporting before the crash dragonfly ~ # badblocks -sv /dev/hda Checking blocks 0 to 156290903 Checking for bad blocks (read-only test): 89360960done, 35:00 elapsed 89360961done, 35:09 elapsed 89360962 89360963 ^C^C18% done, 35:27 elapsed So, there seem to be problems, possibly with the drive, or maybe it's some sort of overheating problem on the processor and this was just the way the processor failed before the crash? I ran memtest86 night before last for 8 hours and had no memory problems. I'll remove memory and PCI cards, reseat everything, and then see what happens. - Mark ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [gentoo-user] recovery from /var corruption? 2010-02-26 17:51 ` Mark Knecht @ 2010-02-26 17:59 ` Volker Armin Hemmann 2010-02-26 18:19 ` Paul Hartman 2010-02-26 18:26 ` Mark Knecht 0 siblings, 2 replies; 16+ messages in thread From: Volker Armin Hemmann @ 2010-02-26 17:59 UTC (permalink / raw To: gentoo-user On Freitag 26 Februar 2010, Mark Knecht wrote: > > The machine _mostly_ crashed while running badblocks. I say mostly > because the mouse is still alive but I can no longer ssh in and cannot > open a terminal on my wife's desktop or get to the console. because it is not crashed but waiting for the ide timeouts. > > I tried to Ctrl-C out out of badblocks here (this is running shelled > in) before I figured out it was a total crash which messed up the > terminal a bit but you can see what it was reporting before the crash > > dragonfly ~ # badblocks -sv /dev/hda > Checking blocks 0 to 156290903 > Checking for bad blocks (read-only test): 89360960done, 35:00 elapsed > 89360961done, 35:09 elapsed > 89360962 > 89360963 > ^C^C18% done, 35:27 elapsed > > So, there seem to be problems, possibly with the drive, or maybe it's > some sort of overheating problem on the processor and this was just > the way the processor failed before the crash? > > I ran memtest86 night before last for 8 hours and had no memory > problems. I'll remove memory and PCI cards, reseat everything, and > then see what happens. protip: if you are running badblocks (or ddrescue) on a probably damaged device - attach it with an usb adapter. That way your box is still usable. /me hates linux kernel for making processes in D unkillable and sucking very much on diskio. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [gentoo-user] recovery from /var corruption? 2010-02-26 17:59 ` Volker Armin Hemmann @ 2010-02-26 18:19 ` Paul Hartman 2010-02-26 18:26 ` Mark Knecht 1 sibling, 0 replies; 16+ messages in thread From: Paul Hartman @ 2010-02-26 18:19 UTC (permalink / raw To: gentoo-user On Fri, Feb 26, 2010 at 11:59 AM, Volker Armin Hemmann <volkerarmin@googlemail.com> wrote: > protip: if you are running badblocks (or ddrescue) on a probably damaged > device - attach it with an usb adapter. That way your box is still usable. +1, i had a bad drive and it's so much easier to unplug/replug the USB instead of rebooting and etc. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [gentoo-user] recovery from /var corruption? 2010-02-26 17:59 ` Volker Armin Hemmann 2010-02-26 18:19 ` Paul Hartman @ 2010-02-26 18:26 ` Mark Knecht 2010-02-26 18:37 ` Volker Armin Hemmann 2010-02-26 18:48 ` Mark Knecht 1 sibling, 2 replies; 16+ messages in thread From: Mark Knecht @ 2010-02-26 18:26 UTC (permalink / raw To: gentoo-user On Fri, Feb 26, 2010 at 9:59 AM, Volker Armin Hemmann <volkerarmin@googlemail.com> wrote: > On Freitag 26 Februar 2010, Mark Knecht wrote: > >> >> The machine _mostly_ crashed while running badblocks. I say mostly >> because the mouse is still alive but I can no longer ssh in and cannot >> open a terminal on my wife's desktop or get to the console. > > because it is not crashed but waiting for the ide timeouts. So if I let it continue running is it going to come back in the next hour or two? I am assuming the IDE timeouts are because the drive is having trouble, correct? That's the theory here? If so then unless the software can mark them bad and somehow create good files out of bad then I'm still left with a machine that is going to need serious work done before it's a happy box again, correct? On the other hand, because I have reasonably good user backups (although no real system backups) right now if I bite the bullet and build the machine then when my wife gets it back it's hopefully going to be more reliable, wouldn't it? I'm thinking that maybe I just copy a little stuff off the box - /etc and the like - and then boot the machine with the Gentoo install CD or System Resuce CD and see what the drive is doing? That doesn't cost me anything to look around, but if SMART won't turn on and badblocks is suggesting the drive is having trouble maybe running something like badblocks and actually __marking__ blocks as bad and then reloading Gentoo would work in the long run? (A lot of work though.) I'm really not interested in buying new drive because the machine is ATA100/133 and if it's not the drive then the money is wasted for a new machine. The cheapest at NewEgg is about $40. Why spend the buck for an old Intel Centrino machine? > >> >> I tried to Ctrl-C out out of badblocks here (this is running shelled >> in) before I figured out it was a total crash which messed up the >> terminal a bit but you can see what it was reporting before the crash >> >> dragonfly ~ # badblocks -sv /dev/hda >> Checking blocks 0 to 156290903 >> Checking for bad blocks (read-only test): 89360960done, 35:00 elapsed >> 89360961done, 35:09 elapsed >> 89360962 >> 89360963 >> ^C^C18% done, 35:27 elapsed >> >> So, there seem to be problems, possibly with the drive, or maybe it's >> some sort of overheating problem on the processor and this was just >> the way the processor failed before the crash? >> >> I ran memtest86 night before last for 8 hours and had no memory >> problems. I'll remove memory and PCI cards, reseat everything, and >> then see what happens. > > protip: if you are running badblocks (or ddrescue) on a probably damaged > device - attach it with an usb adapter. That way your box is still usable. > > /me hates linux kernel for making processes in D unkillable and sucking very > much on diskio. > > Good inputs. Thanks! Cheers, Mark ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [gentoo-user] recovery from /var corruption? 2010-02-26 18:26 ` Mark Knecht @ 2010-02-26 18:37 ` Volker Armin Hemmann 2010-02-26 18:48 ` Mark Knecht 1 sibling, 0 replies; 16+ messages in thread From: Volker Armin Hemmann @ 2010-02-26 18:37 UTC (permalink / raw To: gentoo-user On Freitag 26 Februar 2010, Mark Knecht wrote: > On Fri, Feb 26, 2010 at 9:59 AM, Volker Armin Hemmann > > <volkerarmin@googlemail.com> wrote: > > On Freitag 26 Februar 2010, Mark Knecht wrote: > >> The machine _mostly_ crashed while running badblocks. I say mostly > >> because the mouse is still alive but I can no longer ssh in and cannot > >> open a terminal on my wife's desktop or get to the console. > > > > because it is not crashed but waiting for the ide timeouts. > > So if I let it continue running is it going to come back in the next > hour or two? yes > I am assuming the IDE timeouts are because the drive is > having trouble, correct? That's the theory here? yes > If so then unless the software can mark them bad and somehow create good files out of bad > then I'm still left with a machine that is going to need serious work > done before it's a happy box again, correct? and with 'serious work' you mean 'replace the harddisk' ... > > On the other hand, because I have reasonably good user backups > (although no real system backups) right now if I bite the bullet and > build the machine then when my wife gets it back it's hopefully going > to be more reliable, wouldn't it? yes > > I'm thinking that maybe I just copy a little stuff off the box - /etc > and the like - and then boot the machine with the Gentoo install CD or > System Resuce CD and see what the drive is doing? you could do that. > > That doesn't cost me anything to look around, but if SMART won't turn > on and badblocks is suggesting the drive is having trouble maybe > running something like badblocks and actually __marking__ blocks as > bad and then reloading Gentoo would work in the long run? (A lot of > work though.) you would need to save the badblocks to a file, than feed that file to mkfs. And you are not even save - because when a drive starts to have bad blocks the chance that more are popping up some is pretty high. So you might be lucky and the drive is able to run for a long while (even maybe mapping out bad blocks while testing them - so always run badblocks twice), but you have at least a as a good chance that the whole thing starts over in a couple of weeks. > > I'm really not interested in buying new drive because the machine is > ATA100/133 and if it's not the drive then the money is wasted for a > new machine. The cheapest at NewEgg is about $40. Why spend the buck > for an old Intel Centrino machine? you could take the drive with you when you buy a new machine. Moving harddisks is not that hard. Or put it in an usb enclosure when you don't need it anymore. ide-usb enclosures are cheap. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [gentoo-user] recovery from /var corruption? 2010-02-26 18:26 ` Mark Knecht 2010-02-26 18:37 ` Volker Armin Hemmann @ 2010-02-26 18:48 ` Mark Knecht 1 sibling, 0 replies; 16+ messages in thread From: Mark Knecht @ 2010-02-26 18:48 UTC (permalink / raw To: gentoo-user On Fri, Feb 26, 2010 at 10:26 AM, Mark Knecht <markknecht@gmail.com> wrote: <SNIP> > > On the other hand, because I have reasonably good user backups > (although no real system backups) right now if I bite the bullet and > build the machine then when my wife gets it back it's hopefully going > to be more reliable, wouldn't it? > > I'm thinking that maybe I just copy a little stuff off the box - /etc > and the like - and then boot the machine with the Gentoo install CD or > System Resuce CD and see what the drive is doing? > <SNIP> As a related idea I dug out an old copy of Spinrite which I'll run on all the partitions just to see what it says. However if the problem is currently 1 partition (/var) which is still mostly readable, could I not just create a new var partition - the drive has space free - and then copy important stuff from old var to new var, change fstab and then basically just go on from there? Cheers, Mark ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [gentoo-user] recovery from /var corruption? 2010-02-26 3:33 [gentoo-user] recovery from /var corruption? Mark Knecht 2010-02-26 9:09 ` Neil Bothwick 2010-02-26 9:46 ` Alex Schuster @ 2010-02-26 11:47 ` daid kahl 2010-02-26 17:38 ` daid kahl 3 siblings, 0 replies; 16+ messages in thread From: daid kahl @ 2010-02-26 11:47 UTC (permalink / raw To: gentoo-user On 26 February 2010 12:33, Mark Knecht <markknecht@gmail.com> wrote: > So I got my wife's machine booted today using a install disk and > played a bit with e2fsck. The machine stopped being happy last night > due to some sort of corruption on the /var partition. e2fsck > complained about 3 or 4 files and then repaired the partition. The > machine booted cleanly as far as I can tell. Hey buddy! This happened to me, too! See below for my savage ranting for a good laugh. My rule for this is rsnapshot my present system as it is, grab a disk image backup (taken less frequently), and then go to town with portage. I emerged 620 packages today. (Much more in fact if I count rebuilding and stuff.) Only OO.o update is remaining in world. I don't think there's a good and safe way around it. I find inode corruption can be sneaky and hit other stuff. Assuming your backs all exist and stuff, then you can hit up stuff like rsync with the update flag for your personal files between newest and safest backups. Rant: Okay, so Mac OS is getting it to the face now, officially, and forever in my world. I've almost kind of said this before, and I can't remember why I don't follow my own advice, but nothing can be worse than twice-monthly 10% inode corruption. Now check this out: The e2fs program is told "do not mount sda3" and "if you ever do, mount it ro." Even though Mac OS is crazy enough not to use /etc/fstab, it will still (supposedly) listen to rules in here. I found some very retarded way of effectively serial-device referencing sda3, and I said, "do not mount this drive at boot, and if you do, do it ro." Then I went into a Disk Utility thing. I told that the same thing. So that's three times I've said, "Never touch this drive with a 10 foot pole, plz thx!" Yeah, please explain to me how an unmounted, only ro drive can receive rectal examination of 11.4% inode corruption. Others, please take this as a lesson (in some form or another). I think it's the badly coded e2fs program, but that thing is so bad that if it is to blame, it happened after I tried to uninstall the program too, so who knows. So I'm going to put a tiny Tiger install this weekend so I can get nice boot, a few firmware accesses (kill the silly booting sound, and delay an annoying 20 second boot delay in the case there is no EFI partition...ugh). And then I am going to never look at it's ugly face again. System Rescue CD, partimage, and rsnapshot are my friends! (I had so many packages because over the holidays I didn't do sync and world updates, and then I decided to go back to the wonderful ~x86, but since I was super busy and I don't like backing up a system that's untested, then I didn't have good backups of the updates. Maybe a poor choice, but in any case, that was not the reason I was trying to kick myself in the face. Be bloody lucky, or don't use retarded softwarez--- daid > > So, something went bad and I managed to sneak around it for a while > and now I'm sort of living with the machine wondering what to do. > > Do I just watch the logs looking for problems? I have no way of > knowing right now whether this was a disk problem that's going to come > back, a 1 time deal due to power, or something else entirely. > > As these cheap machines that don't use RAID what's the right way to > go? emerge -e @world and then wait for the next event? Do nothing and > wait? > > We've got decent personal data backups as well as basic /etc data. > > Thanks, > Mark > > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [gentoo-user] recovery from /var corruption? 2010-02-26 3:33 [gentoo-user] recovery from /var corruption? Mark Knecht ` (2 preceding siblings ...) 2010-02-26 11:47 ` daid kahl @ 2010-02-26 17:38 ` daid kahl 2010-02-26 18:57 ` Mark Knecht 3 siblings, 1 reply; 16+ messages in thread From: daid kahl @ 2010-02-26 17:38 UTC (permalink / raw To: gentoo-user On 26 February 2010 12:33, Mark Knecht <markknecht@gmail.com> wrote: > So I got my wife's machine booted today using a install disk and > played a bit with e2fsck. The machine stopped being happy last night > due to some sort of corruption on the /var partition. e2fsck > complained about 3 or 4 files and then repaired the partition. The > machine booted cleanly as far as I can tell. > > So, something went bad and I managed to sneak around it for a while > and now I'm sort of living with the machine wondering what to do. > > Do I just watch the logs looking for problems? I have no way of > knowing right now whether this was a disk problem that's going to come > back, a 1 time deal due to power, or something else entirely. > > As these cheap machines that don't use RAID what's the right way to > go? emerge -e @world and then wait for the next event? Do nothing and > wait? > > We've got decent personal data backups as well as basic /etc data. > > Thanks, > Mark > I reconsidered your problem, and I actually wonder if emerging world is a valid notion in this case, as the world file is under /var and this is reported as corrupt. In this sense, it may be entirely non-trivial to regenerate (without backup) the correct world-file for a system. Am I out in the deep end, or is this, in fact, the critical point that needs consideration here? ~daid ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [gentoo-user] recovery from /var corruption? 2010-02-26 17:38 ` daid kahl @ 2010-02-26 18:57 ` Mark Knecht 0 siblings, 0 replies; 16+ messages in thread From: Mark Knecht @ 2010-02-26 18:57 UTC (permalink / raw To: gentoo-user On Fri, Feb 26, 2010 at 9:38 AM, daid kahl <daidxor@gmail.com> wrote: > On 26 February 2010 12:33, Mark Knecht <markknecht@gmail.com> wrote: >> So I got my wife's machine booted today using a install disk and >> played a bit with e2fsck. The machine stopped being happy last night >> due to some sort of corruption on the /var partition. e2fsck >> complained about 3 or 4 files and then repaired the partition. The >> machine booted cleanly as far as I can tell. >> >> So, something went bad and I managed to sneak around it for a while >> and now I'm sort of living with the machine wondering what to do. >> >> Do I just watch the logs looking for problems? I have no way of >> knowing right now whether this was a disk problem that's going to come >> back, a 1 time deal due to power, or something else entirely. >> >> As these cheap machines that don't use RAID what's the right way to >> go? emerge -e @world and then wait for the next event? Do nothing and >> wait? >> >> We've got decent personal data backups as well as basic /etc data. >> >> Thanks, >> Mark >> > > I reconsidered your problem, and I actually wonder if emerging world > is a valid notion in this case, as the world file is under /var and > this is reported as corrupt. > > In this sense, it may be entirely non-trivial to regenerate (without > backup) the correct world-file for a system. > > Am I out in the deep end, or is this, in fact, the critical point that > needs consideration here? > > ~daid Hi daid, In general you are correct. If I didn't have a copy of the world file then it would be a bit hit and miss. In this case I do have it saved elsewhere so it's actually quite easy. This failure is more (it seems) a few bad blocks on one partition and not a total drive failure. I'm leaning toward a new /var partition and just ignoring the partition that has problems. It will sit on the disk but it's only 10GB out of 160GB so it's not the end of the world by any means. Thanks! - Mark ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2010-02-26 18:57 UTC | newest] Thread overview: 16+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-02-26 3:33 [gentoo-user] recovery from /var corruption? Mark Knecht 2010-02-26 9:09 ` Neil Bothwick 2010-02-26 9:46 ` Alex Schuster 2010-02-26 15:17 ` Mark Knecht 2010-02-26 16:01 ` Alex Schuster 2010-02-26 16:53 ` Mark Knecht 2010-02-26 17:27 ` Alex Schuster 2010-02-26 17:51 ` Mark Knecht 2010-02-26 17:59 ` Volker Armin Hemmann 2010-02-26 18:19 ` Paul Hartman 2010-02-26 18:26 ` Mark Knecht 2010-02-26 18:37 ` Volker Armin Hemmann 2010-02-26 18:48 ` Mark Knecht 2010-02-26 11:47 ` daid kahl 2010-02-26 17:38 ` daid kahl 2010-02-26 18:57 ` Mark Knecht
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox