[gentoo-user] recovery from /var corruption?

public inbox for gentoo-user@lists.gentoo.org
 help / color / mirror / Atom feed

* [gentoo-user] recovery from /var corruption?
@ 2010-02-26  3:33 Mark Knecht
  2010-02-26  9:09 ` Neil Bothwick
                   ` (3 more replies)
  0 siblings, 4 replies; 16+ messages in thread
From: Mark Knecht @ 2010-02-26  3:33 UTC (permalink / raw
  To: gentoo-user

So I got my wife's machine booted today using a install disk and
played a bit with e2fsck. The machine stopped being happy last night
due to some sort of corruption on the /var partition. e2fsck
complained about 3 or 4 files and then repaired the partition. The
machine booted cleanly as far as I can tell.

So, something went bad and I managed to sneak around it for a while
and now I'm sort of living with the machine wondering what to do.

Do I just watch the logs looking for problems? I have no way of
knowing right now whether this was a disk problem that's going to come
back, a 1 time deal due to power, or something else entirely.

As these cheap machines that don't use RAID what's the right way to
go? emerge -e @world and then wait for the next event? Do nothing and
wait?

We've got decent personal data backups as well as basic /etc data.

Thanks,
Mark

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [gentoo-user] recovery from /var corruption?
  2010-02-26  3:33 [gentoo-user] recovery from /var corruption? Mark Knecht
@ 2010-02-26  9:09 ` Neil Bothwick
  2010-02-26  9:46 ` Alex Schuster
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 16+ messages in thread
From: Neil Bothwick @ 2010-02-26  9:09 UTC (permalink / raw
  To: gentoo-user

[-- Attachment #1: Type: text/plain, Size: 645 bytes --]

On Thu, 25 Feb 2010 19:33:23 -0800, Mark Knecht wrote:

> So I got my wife's machine booted today using a install disk and
> played a bit with e2fsck. The machine stopped being happy last night
> due to some sort of corruption on the /var partition. e2fsck
> complained about 3 or 4 files and then repaired the partition. The
> machine booted cleanly as far as I can tell.
> 
> So, something went bad and I managed to sneak around it for a while
> and now I'm sort of living with the machine wondering what to do.

Check the disk with smartmontools.


-- 
Neil Bothwick

All mail what i send is thoughly proof-red, definately!

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [gentoo-user] recovery from /var corruption?
  2010-02-26  3:33 [gentoo-user] recovery from /var corruption? Mark Knecht
  2010-02-26  9:09 ` Neil Bothwick
@ 2010-02-26  9:46 ` Alex Schuster
  2010-02-26 15:17   ` Mark Knecht
  2010-02-26 11:47 ` daid kahl
  2010-02-26 17:38 ` daid kahl
  3 siblings, 1 reply; 16+ messages in thread
From: Alex Schuster @ 2010-02-26  9:46 UTC (permalink / raw
  To: gentoo-user

Mark Knecht writes:

> Do I just watch the logs looking for problems? I have no way of
> knowing right now whether this was a disk problem that's going to come
> back, a 1 time deal due to power, or something else entirely.
> 
> As these cheap machines that don't use RAID what's the right way to
> go? emerge -e @world and then wait for the next event? Do nothing and
> wait?

Emerge smartmontools, then:

smartctl -h /dev/sda  # get overview of what the drive thinks about itself

smartctl -t short /dev/sda     # start short self test
Wait
smartctl -l selftest /dev/sda  # see results

smartctl -t long /dev/sda      # start long self test
Wait a lot longer
smartctl -l selftest /dev/sda  # see results

You can continue working in the meanwhile, there will be no performance 
impact. You will see something like this in the log:

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description   Status              Remaining  LifeTime(hours)  
LBA_of_first_error
# 1  Short offline      Completed without error   00%    2275       -
# 2  Extended offline   Completed without error   00%    2270       -
# 3  Extended offline   Completed without error   00%    1799       -
# 4  Extended offline   Completed without error   00%     197       -
# 5  Extended offline   Completed without error   00%      26       -

I you have a '-' in the right column, the disk has found no errors. If 
there is a number, than it's the position of the first error.

There's also badblocks, this will check every block and output the bad 
ones: badblocks -sv /dev/sda

badblocks -svn /dev/sda will do a read-write test. In case of a bad block, 
the drive should exchange it with a spare one. Maybe this happens already 
in read-only mode, I am not sure.

Also watch for errors in syslog or via dmesg, there should be some when 
bad blocks are being accessed.

	Wonko



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [gentoo-user] recovery from /var corruption?
  2010-02-26  9:46 ` Alex Schuster
@ 2010-02-26 15:17   ` Mark Knecht
  2010-02-26 16:01     ` Alex Schuster
  0 siblings, 1 reply; 16+ messages in thread
From: Mark Knecht @ 2010-02-26 15:17 UTC (permalink / raw
  To: gentoo-user

On Fri, Feb 26, 2010 at 1:46 AM, Alex Schuster <wonko@wonkology.org> wrote:
> Mark Knecht writes:
>
>> Do I just watch the logs looking for problems? I have no way of
>> knowing right now whether this was a disk problem that's going to come
>> back, a 1 time deal due to power, or something else entirely.
>>
>> As these cheap machines that don't use RAID what's the right way to
>> go? emerge -e @world and then wait for the next event? Do nothing and
>> wait?
>
> Emerge smartmontools, then:
>
> smartctl -h /dev/sda  # get overview of what the drive thinks about itself
>
> smartctl -t short /dev/sda     # start short self test
> Wait
> smartctl -l selftest /dev/sda  # see results
>
> smartctl -t long /dev/sda      # start long self test
> Wait a lot longer
> smartctl -l selftest /dev/sda  # see results
>
> You can continue working in the meanwhile, there will be no performance
> impact. You will see something like this in the log:
>
> === START OF READ SMART DATA SECTION ===
> SMART Self-test log structure revision number 1
> Num  Test_Description   Status              Remaining  LifeTime(hours)
> LBA_of_first_error
> # 1  Short offline      Completed without error   00%    2275       -
> # 2  Extended offline   Completed without error   00%    2270       -
> # 3  Extended offline   Completed without error   00%    1799       -
> # 4  Extended offline   Completed without error   00%     197       -
> # 5  Extended offline   Completed without error   00%      26       -
>
> I you have a '-' in the right column, the disk has found no errors. If
> there is a number, than it's the position of the first error.
>
> There's also badblocks, this will check every block and output the bad
> ones: badblocks -sv /dev/sda
>
> badblocks -svn /dev/sda will do a read-write test. In case of a bad block,
> the drive should exchange it with a spare one. Maybe this happens already
> in read-only mode, I am not sure.
>
> Also watch for errors in syslog or via dmesg, there should be some when
> bad blocks are being accessed.
>
>        Wonko
>
>

Hi Wonko,
   Yes, I do use smartctl on some other machines although I'm not very
good about it and your write-up is helpful so thanks for that.

   My wife's machines is older and and I don't think SMART is
supported on her drive. Note the lack of a * on the SMART line in
hdparm -I:

dragonfly ~ # hdparm -I /dev/hda

/dev/hda:

ATA device, with non-removable media
	Model Number:       WDC WD1600BB-00FTA0
	Serial Number:      WD-WMAES2091586
	Firmware Revision:  15.05R15
Standards:
	Supported: 6 5 4
	Likely used: 6
Configuration:
	Logical		max	current
	cylinders	16383	16383
	heads		16	16
	sectors/track	63	63
	--
	CHS current addressable sectors:   16514064
	LBA    user addressable sectors:  268435455
	LBA48  user addressable sectors:  312581808
	Logical/Physical Sector size:           512 bytes
	device size with M = 1024*1024:      152627 MBytes
	device size with M = 1000*1000:      160041 MBytes (160 GB)
	cache/buffer size  = 2048 KBytes (type=DualPortCache)
Capabilities:
	LBA, IORDY(can be disabled)
	Standby timer values: spec'd by Standard, with device specific minimum
	R/W multiple sector transfer: Max = 16	Current = 16
	Recommended acoustic management value: 128, current value: 254
	DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 *udma5
	     Cycle time: min=120ns recommended=120ns
	PIO: pio0 pio1 pio2 pio3 pio4
	     Cycle time: no flow control=120ns  IORDY flow control=120ns
Commands/features:
	Enabled	Supported:
	    	SMART feature set
	    	Security Mode feature set
	   *	Power Management feature set
	   *	Write cache
	   *	Look-ahead
	   *	Host Protected Area feature set
	   *	WRITE_BUFFER command
	   *	READ_BUFFER command
	   *	DOWNLOAD_MICROCODE
	    	SET_MAX security extension
	    	Automatic Acoustic Management feature set
	   *	48-bit Address feature set
	   *	Device Configuration Overlay feature set
	   *	Mandatory FLUSH_CACHE
	   *	FLUSH_CACHE_EXT
	   *	SMART error logging
	   *	SMART self-test
Security:
		supported
	not	enabled
	not	locked
	not	frozen
	not	expired: security count
	not	supported: enhanced erase
HW reset results:
	CBLID- above Vih
	Device num = 0 determined by CSEL
Checksum: correct
dragonfly ~ #

dragonfly ~ # smartctl -H /dev/hda
smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

SMART Disabled. Use option -s with argument 'on' to enable it.
dragonfly ~ # smartctl -s on /dev/hda
smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF ENABLE/DISABLE COMMANDS SECTION ===
Error SMART Enable failed: Input/output error
Smartctl: SMART Enable Failed.

A mandatory SMART command failed: exiting. To continue, add one or
more '-T permissive' options.
dragonfly ~ #

I've not tried the -T permissive options.

I've never used badblocks as it seems I should only do that off-line.
This might be a good time to boot with a CD and try it out.

Maybe I should just get a new drive that supports SMART?

- Mark



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [gentoo-user] recovery from /var corruption?
  2010-02-26 15:17   ` Mark Knecht
@ 2010-02-26 16:01     ` Alex Schuster
  2010-02-26 16:53       ` Mark Knecht
  0 siblings, 1 reply; 16+ messages in thread
From: Alex Schuster @ 2010-02-26 16:01 UTC (permalink / raw
  To: gentoo-user

Mark Knecht writes:

>    Yes, I do use smartctl on some other machines although I'm not very
> good about it and your write-up is helpful so thanks for that.
> 
>    My wife's machines is older and and I don't think SMART is
> supported on her drive. Note the lack of a * on the SMART line in
> hdparm -I:

Okay, but it still states:

> 	   *	SMART error logging
> 	   *	SMART self-test

So maybe smartctl -t long /dev/hda still works? Just give it a try.

> dragonfly ~ # smartctl -H /dev/hda
> smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce
> Allen Home page is http://smartmontools.sourceforge.net/
> 
> SMART Disabled. Use option -s with argument 'on' to enable it.
> dragonfly ~ # smartctl -s on /dev/hda
> smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce
> Allen Home page is http://smartmontools.sourceforge.net/
> 
> === START OF ENABLE/DISABLE COMMANDS SECTION ===
> Error SMART Enable failed: Input/output error
> Smartctl: SMART Enable Failed.
> 
> A mandatory SMART command failed: exiting. To continue, add one or
> more '-T permissive' options.
> dragonfly ~ #
> 
> I've not tried the -T permissive options.

I would :)  There is also a BIOS setting for SMART, but I think this does 
not matter here, and it's only for being able to report a failing drive 
before booting.

> I've never used badblocks as it seems I should only do that off-line.
> This might be a good time to boot with a CD and try it out.

In read-only mode, you can use it when the system is running. Only the 
write test (option -n) refuses to run if partitions are mounted from the 
drive. So I'd do the 'badblocks -sv /dev/hda' right now, if you do not 
need the drive at full speed for a while. You can interrupt it at any 
point with Ctrl-Z and continue with the fg command.

> Maybe I should just get a new drive that supports SMART?

When the drive is that old it does not support SMART, you probably can get 
one ten times as huge for much less than it had cost you. And I would 
trust a new drive much more than such an old one. Depends on how important 
the data is, if a total loss would not be too painful and I had backups, 
and I would not need more speed and size, I would keep it if it shows no 
errors.

	Wonko

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [gentoo-user] recovery from /var corruption?
  2010-02-26 16:01     ` Alex Schuster
@ 2010-02-26 16:53       ` Mark Knecht
  2010-02-26 17:27         ` Alex Schuster
  0 siblings, 1 reply; 16+ messages in thread
From: Mark Knecht @ 2010-02-26 16:53 UTC (permalink / raw
  To: gentoo-user

On Fri, Feb 26, 2010 at 8:01 AM, Alex Schuster <wonko@wonkology.org> wrote:
> Mark Knecht writes:
>
>>    Yes, I do use smartctl on some other machines although I'm not very
>> good about it and your write-up is helpful so thanks for that.
>>
>>    My wife's machines is older and and I don't think SMART is
>> supported on her drive. Note the lack of a * on the SMART line in
>> hdparm -I:
>
> Okay, but it still states:
>
>>          *    SMART error logging
>>          *    SMART self-test
>
> So maybe smartctl -t long /dev/hda still works? Just give it a try.

No, -t long fails the same way. Basically every time I try to use
smartctl on the drive it seems to issue one of these 3-line reports
about SectorIDNotFound in dmesg. My other machines don't do this. Not
a good sign I think...

hda: task_no_data_intr: status=0x51 { DriveReady SeekComplete Error }
hda: task_no_data_intr: error=0x10 { SectorIdNotFound },
LBAsect=16777008, sector=18446744073709551615
hda: possibly failed opcode: 0xb0
hda: task_no_data_intr: status=0x51 { DriveReady SeekComplete Error }
hda: task_no_data_intr: error=0x10 { SectorIdNotFound },
LBAsect=262192, sector=18446744073709551615
hda: possibly failed opcode: 0xb0
hda: task_no_data_intr: status=0x51 { DriveReady SeekComplete Error }
hda: task_no_data_intr: error=0x10 { SectorIdNotFound }, LBAsect=48,
sector=18446744073709551615
hda: possibly failed opcode: 0xb0

These command create the same sort of lines in dmesg:

dragonfly ~ # smartctl -i /dev/hda
smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar family
Device Model:     WDC WD1600BB-00FTA0
Serial Number:    WD-WMAES2091586
Firmware Version: 15.05R15
User Capacity:    160,041,885,696 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   6
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Fri Feb 26 08:49:00 2010 PST
SMART support is: Available - device has SMART capability.
SMART support is: Disabled

SMART Disabled. Use option -s with argument 'on' to enable it.
dragonfly ~ # smartctl -P show /dev/hda
smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

Drive found in smartmontools Database.  Drive identity strings:
MODEL:              WDC WD1600BB-00FTA0
FIRMWARE:           15.05R15
match smartmontools Drive Database entry:
MODEL REGEXP:       ^WDC WD(2|3|4|6|8|10|12|16|18|20|25)00BB-.*$
FIRMWARE REGEXP:    .*
MODEL FAMILY:       Western Digital Caviar family
ATTRIBUTE OPTIONS:  None preset; no -v options are required.
dragonfly ~ #

<SNIP>
>>
>> I've not tried the -T permissive options.
>
> I would :)  There is also a BIOS setting for SMART, but I think this does
> not matter here, and it's only for being able to report a failing drive
> before booting.

Tried -T permissive and -T verypermissive. Same result. More lines and
told it's not turning on.

Could this have ANYTHING to do with kernel configuation? Is there
anything required at the kernel level that I might not have turned on?

>
>> I've never used badblocks as it seems I should only do that off-line.
>> This might be a good time to boot with a CD and try it out.
>
> In read-only mode, you can use it when the system is running. Only the
> write test (option -n) refuses to run if partitions are mounted from the
> drive. So I'd do the 'badblocks -sv /dev/hda' right now, if you do not
> need the drive at full speed for a while. You can interrupt it at any
> point with Ctrl-Z and continue with the fg command.
>
OK, I've started that test and will report back later what it says.

Thanks!

- Mark

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [gentoo-user] recovery from /var corruption?
  2010-02-26 16:53       ` Mark Knecht
@ 2010-02-26 17:27         ` Alex Schuster
  2010-02-26 17:51           ` Mark Knecht
  0 siblings, 1 reply; 16+ messages in thread
From: Alex Schuster @ 2010-02-26 17:27 UTC (permalink / raw
  To: gentoo-user

Mark Knecht writes:

> On Fri, Feb 26, 2010 at 8:01 AM, Alex Schuster <wonko@wonkology.org>
> wrote:

> > Okay, but it still states:
> >>          *    SMART error logging
> >>          *    SMART self-test
> > 
> > So maybe smartctl -t long /dev/hda still works? Just give it a try.
> 
> No, -t long fails the same way. Basically every time I try to use
> smartctl on the drive it seems to issue one of these 3-line reports
> about SectorIDNotFound in dmesg. My other machines don't do this. Not
> a good sign I think...
> 
> hda: task_no_data_intr: status=0x51 { DriveReady SeekComplete Error }
> hda: task_no_data_intr: error=0x10 { SectorIdNotFound },
> LBAsect=16777008, sector=18446744073709551615
> hda: possibly failed opcode: 0xb0

Uh-oh. Okay, I guess it just won't work then.


> Could this have ANYTHING to do with kernel configuation? Is there
> anything required at the kernel level that I might not have turned on?

I'm pretty sure it has nothing to do with the kernel, but with your drive 
being incapable of the SMART commands.

But I guess using badblocks is not that different in the end. The SMART 
selftest runs in the background and does not create disk I/O, but I think 
it does nothing so much different from badblocks.

	Wonko



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [gentoo-user] recovery from /var corruption?
  2010-02-26 17:27         ` Alex Schuster
@ 2010-02-26 17:51           ` Mark Knecht
  2010-02-26 17:59             ` Volker Armin Hemmann
  0 siblings, 1 reply; 16+ messages in thread
From: Mark Knecht @ 2010-02-26 17:51 UTC (permalink / raw
  To: gentoo-user

On Fri, Feb 26, 2010 at 9:27 AM, Alex Schuster <wonko@wonkology.org> wrote:
> Mark Knecht writes:
>
>> On Fri, Feb 26, 2010 at 8:01 AM, Alex Schuster <wonko@wonkology.org>
>> wrote:
>
>> > Okay, but it still states:
>> >>          *    SMART error logging
>> >>          *    SMART self-test
>> >
>> > So maybe smartctl -t long /dev/hda still works? Just give it a try.
>>
>> No, -t long fails the same way. Basically every time I try to use
>> smartctl on the drive it seems to issue one of these 3-line reports
>> about SectorIDNotFound in dmesg. My other machines don't do this. Not
>> a good sign I think...
>>
>> hda: task_no_data_intr: status=0x51 { DriveReady SeekComplete Error }
>> hda: task_no_data_intr: error=0x10 { SectorIdNotFound },
>> LBAsect=16777008, sector=18446744073709551615
>> hda: possibly failed opcode: 0xb0
>
> Uh-oh. Okay, I guess it just won't work then.
>
>
>> Could this have ANYTHING to do with kernel configuation? Is there
>> anything required at the kernel level that I might not have turned on?
>
> I'm pretty sure it has nothing to do with the kernel, but with your drive
> being incapable of the SMART commands.
>
> But I guess using badblocks is not that different in the end. The SMART
> selftest runs in the background and does not create disk I/O, but I think
> it does nothing so much different from badblocks.
>
>        Wonko
>
>

The machine _mostly_ crashed while running badblocks. I say mostly
because the mouse is still alive but I can no longer ssh in and cannot
open a terminal on my wife's desktop or get to the console.

I tried to Ctrl-C out out of badblocks here (this is running shelled
in) before I figured out it was a total crash which messed up the
terminal a bit but you can see what it was reporting before the crash

dragonfly ~ # badblocks -sv /dev/hda
Checking blocks 0 to 156290903
Checking for bad blocks (read-only test): 89360960done, 35:00 elapsed
89360961done, 35:09 elapsed
89360962
89360963
^C^C18% done, 35:27 elapsed

So, there seem to be problems, possibly with the drive, or maybe it's
some sort of overheating problem on the processor and this was just
the way the processor failed before the crash?

I ran memtest86 night before last for 8 hours and had no memory
problems. I'll remove memory and PCI cards, reseat everything, and
then see what happens.

- Mark



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [gentoo-user] recovery from /var corruption?
  2010-02-26 17:51           ` Mark Knecht
@ 2010-02-26 17:59             ` Volker Armin Hemmann
  2010-02-26 18:19               ` Paul Hartman
  2010-02-26 18:26               ` Mark Knecht
  0 siblings, 2 replies; 16+ messages in thread
From: Volker Armin Hemmann @ 2010-02-26 17:59 UTC (permalink / raw
  To: gentoo-user

On Freitag 26 Februar 2010, Mark Knecht wrote:

> 
> The machine _mostly_ crashed while running badblocks. I say mostly
> because the mouse is still alive but I can no longer ssh in and cannot
> open a terminal on my wife's desktop or get to the console.

because it is not crashed but waiting for the ide timeouts.

> 
> I tried to Ctrl-C out out of badblocks here (this is running shelled
> in) before I figured out it was a total crash which messed up the
> terminal a bit but you can see what it was reporting before the crash
> 
> dragonfly ~ # badblocks -sv /dev/hda
> Checking blocks 0 to 156290903
> Checking for bad blocks (read-only test): 89360960done, 35:00 elapsed
> 89360961done, 35:09 elapsed
> 89360962
> 89360963
> ^C^C18% done, 35:27 elapsed
> 
> So, there seem to be problems, possibly with the drive, or maybe it's
> some sort of overheating problem on the processor and this was just
> the way the processor failed before the crash?
> 
> I ran memtest86 night before last for 8 hours and had no memory
> problems. I'll remove memory and PCI cards, reseat everything, and
> then see what happens.

protip: if you are running badblocks (or ddrescue) on a probably damaged 
device - attach it with an usb adapter. That way your box is still usable.

/me hates linux kernel for making processes in D unkillable and sucking very 
much on diskio.



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [gentoo-user] recovery from /var corruption?
  2010-02-26 17:59             ` Volker Armin Hemmann
@ 2010-02-26 18:19               ` Paul Hartman
  2010-02-26 18:26               ` Mark Knecht
  1 sibling, 0 replies; 16+ messages in thread
From: Paul Hartman @ 2010-02-26 18:19 UTC (permalink / raw
  To: gentoo-user

On Fri, Feb 26, 2010 at 11:59 AM, Volker Armin Hemmann
<volkerarmin@googlemail.com> wrote:
> protip: if you are running badblocks (or ddrescue) on a probably damaged
> device - attach it with an usb adapter. That way your box is still usable.

+1, i had a bad drive and it's so much easier to unplug/replug the USB
instead of rebooting and etc.



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [gentoo-user] recovery from /var corruption?
  2010-02-26 17:59             ` Volker Armin Hemmann
  2010-02-26 18:19               ` Paul Hartman
@ 2010-02-26 18:26               ` Mark Knecht
  2010-02-26 18:37                 ` Volker Armin Hemmann
  2010-02-26 18:48                 ` Mark Knecht
  1 sibling, 2 replies; 16+ messages in thread
From: Mark Knecht @ 2010-02-26 18:26 UTC (permalink / raw
  To: gentoo-user

On Fri, Feb 26, 2010 at 9:59 AM, Volker Armin Hemmann
<volkerarmin@googlemail.com> wrote:
> On Freitag 26 Februar 2010, Mark Knecht wrote:
>
>>
>> The machine _mostly_ crashed while running badblocks. I say mostly
>> because the mouse is still alive but I can no longer ssh in and cannot
>> open a terminal on my wife's desktop or get to the console.
>
> because it is not crashed but waiting for the ide timeouts.

So if I let it continue running is it going to come back in the next
hour or two? I am assuming the IDE timeouts are because the drive is
having trouble, correct? That's the theory here? If so then unless the
software can mark them bad and somehow create good files out of bad
then I'm still left with a machine that is going to need serious work
done before it's a happy box again, correct?

On the other hand, because I have reasonably good user backups
(although no real system backups) right now if I bite the bullet and
build the machine then when my wife gets it back it's hopefully going
to be more reliable, wouldn't it?

I'm thinking that maybe I just copy a little stuff off the box - /etc
and the like - and then boot the machine with the Gentoo install CD or
System Resuce CD and see what the drive is doing?

That doesn't cost me anything to look around, but if SMART won't turn
on and badblocks is suggesting the drive is having trouble maybe
running something like badblocks and actually __marking__ blocks as
bad and then reloading Gentoo would work in the long run? (A lot of
work though.)

I'm really not interested in buying new drive because the machine is
ATA100/133 and if it's not the drive then the money is wasted for a
new machine. The cheapest at NewEgg is about $40. Why spend the buck
for an old Intel Centrino machine?

>
>>
>> I tried to Ctrl-C out out of badblocks here (this is running shelled
>> in) before I figured out it was a total crash which messed up the
>> terminal a bit but you can see what it was reporting before the crash
>>
>> dragonfly ~ # badblocks -sv /dev/hda
>> Checking blocks 0 to 156290903
>> Checking for bad blocks (read-only test): 89360960done, 35:00 elapsed
>> 89360961done, 35:09 elapsed
>> 89360962
>> 89360963
>> ^C^C18% done, 35:27 elapsed
>>
>> So, there seem to be problems, possibly with the drive, or maybe it's
>> some sort of overheating problem on the processor and this was just
>> the way the processor failed before the crash?
>>
>> I ran memtest86 night before last for 8 hours and had no memory
>> problems. I'll remove memory and PCI cards, reseat everything, and
>> then see what happens.
>
> protip: if you are running badblocks (or ddrescue) on a probably damaged
> device - attach it with an usb adapter. That way your box is still usable.
>
> /me hates linux kernel for making processes in D unkillable and sucking very
> much on diskio.
>
>

Good inputs. Thanks!

Cheers,
Mark

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [gentoo-user] recovery from /var corruption?
  2010-02-26 18:26               ` Mark Knecht
@ 2010-02-26 18:37                 ` Volker Armin Hemmann
  2010-02-26 18:48                 ` Mark Knecht
  1 sibling, 0 replies; 16+ messages in thread
From: Volker Armin Hemmann @ 2010-02-26 18:37 UTC (permalink / raw
  To: gentoo-user

On Freitag 26 Februar 2010, Mark Knecht wrote:
> On Fri, Feb 26, 2010 at 9:59 AM, Volker Armin Hemmann
> 
> <volkerarmin@googlemail.com> wrote:
> > On Freitag 26 Februar 2010, Mark Knecht wrote:
> >> The machine _mostly_ crashed while running badblocks. I say mostly
> >> because the mouse is still alive but I can no longer ssh in and cannot
> >> open a terminal on my wife's desktop or get to the console.
> > 
> > because it is not crashed but waiting for the ide timeouts.
> 
> So if I let it continue running is it going to come back in the next
> hour or two? 

yes
> I am assuming the IDE timeouts are because the drive is
> having trouble, correct? That's the theory here? 

yes

> If so then unless the software can mark them bad and somehow create good 
files out of bad
> then I'm still left with a machine that is going to need serious work
> done before it's a happy box again, correct?

and with 'serious work' you mean 'replace the harddisk' ...

> 
> On the other hand, because I have reasonably good user backups
> (although no real system backups) right now if I bite the bullet and
> build the machine then when my wife gets it back it's hopefully going
> to be more reliable, wouldn't it?

yes

> 
> I'm thinking that maybe I just copy a little stuff off the box - /etc
> and the like - and then boot the machine with the Gentoo install CD or
> System Resuce CD and see what the drive is doing?

you could do that.

> 
> That doesn't cost me anything to look around, but if SMART won't turn
> on and badblocks is suggesting the drive is having trouble maybe
> running something like badblocks and actually __marking__ blocks as
> bad and then reloading Gentoo would work in the long run? (A lot of
> work though.)

you would need to save the badblocks to a file, than feed that file to mkfs. And 
you are not even save - because when a drive starts to have bad blocks the 
chance that more are popping up some is pretty high. So you might be lucky and 
the drive is able to run for a long while (even maybe mapping out bad blocks 
while testing them - so always run badblocks twice), but you have at least a 
as a good chance that the whole thing starts over in a couple of weeks.

> 
> I'm really not interested in buying new drive because the machine is
> ATA100/133 and if it's not the drive then the money is wasted for a
> new machine. The cheapest at NewEgg is about $40. Why spend the buck
> for an old Intel Centrino machine?

you could take the drive with you when you buy a new machine. Moving harddisks 
is not that hard. Or put it in an usb enclosure when you don't need it 
anymore. ide-usb enclosures are cheap.



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [gentoo-user] recovery from /var corruption?
  2010-02-26 18:26               ` Mark Knecht
  2010-02-26 18:37                 ` Volker Armin Hemmann
@ 2010-02-26 18:48                 ` Mark Knecht
  1 sibling, 0 replies; 16+ messages in thread
From: Mark Knecht @ 2010-02-26 18:48 UTC (permalink / raw
  To: gentoo-user

On Fri, Feb 26, 2010 at 10:26 AM, Mark Knecht <markknecht@gmail.com> wrote:
<SNIP>
>
> On the other hand, because I have reasonably good user backups
> (although no real system backups) right now if I bite the bullet and
> build the machine then when my wife gets it back it's hopefully going
> to be more reliable, wouldn't it?
>
> I'm thinking that maybe I just copy a little stuff off the box - /etc
> and the like - and then boot the machine with the Gentoo install CD or
> System Resuce CD and see what the drive is doing?
>
<SNIP>

As a related idea I dug out an old copy of Spinrite which I'll run on
all the partitions just to see what it says. However if the problem is
currently 1 partition (/var) which is still mostly readable, could I
not just create a new var partition - the drive has space free - and
then copy important stuff from old var to new var, change fstab and
then basically just go on from there?

Cheers,
Mark



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [gentoo-user] recovery from /var corruption?
  2010-02-26  3:33 [gentoo-user] recovery from /var corruption? Mark Knecht
  2010-02-26  9:09 ` Neil Bothwick
  2010-02-26  9:46 ` Alex Schuster
@ 2010-02-26 11:47 ` daid kahl
  2010-02-26 17:38 ` daid kahl
  3 siblings, 0 replies; 16+ messages in thread
From: daid kahl @ 2010-02-26 11:47 UTC (permalink / raw
  To: gentoo-user

On 26 February 2010 12:33, Mark Knecht <markknecht@gmail.com> wrote:
> So I got my wife's machine booted today using a install disk and
> played a bit with e2fsck. The machine stopped being happy last night
> due to some sort of corruption on the /var partition. e2fsck
> complained about 3 or 4 files and then repaired the partition. The
> machine booted cleanly as far as I can tell.

Hey buddy!

This happened to me, too!  See below for my savage ranting for a good laugh.

My rule for this is rsnapshot my present system as it is, grab a disk
image backup (taken less frequently), and then go to town with
portage.

I emerged 620 packages today.  (Much more in fact if I count
rebuilding and stuff.)  Only OO.o update is remaining in world.

I don't think there's a good and safe way around it.  I find inode
corruption can be sneaky and hit other stuff.  Assuming your backs all
exist and stuff, then you can hit up stuff like rsync with the update
flag for your personal files between newest and safest backups.

Rant:
Okay, so Mac OS is getting it to the face now, officially, and forever
in my world.  I've almost kind of said this before, and I can't
remember why I don't follow my own advice, but nothing can be worse
than twice-monthly 10% inode corruption.

Now check this out:
The e2fs program is told "do not mount sda3" and "if you ever do,
mount it ro."  Even though Mac OS is crazy enough not to use
/etc/fstab, it will still (supposedly) listen to rules in here.  I
found some very retarded way of effectively serial-device referencing
sda3, and I said, "do not mount this drive at boot, and if you do, do
it ro."  Then I went into a Disk Utility thing.  I told that the same
thing.  So that's three times I've said, "Never touch this drive with
a 10 foot pole, plz thx!"  Yeah, please explain to me how an
unmounted, only ro drive can receive  rectal examination of 11.4%
inode corruption.

Others, please take this as a lesson (in some form or another).  I
think it's the badly coded e2fs program, but that thing is so bad that
if it is to blame, it happened after I tried to uninstall the program
too, so who knows.  So I'm going to put a tiny Tiger install this
weekend so I can get nice boot, a few firmware accesses (kill the
silly booting sound, and delay an annoying 20 second boot delay in the
case there is no EFI partition...ugh).  And then I am going to never
look at it's ugly face again.

System Rescue CD, partimage, and rsnapshot are my friends!

(I had so many packages because over the holidays I didn't do sync and
world updates, and then I decided to go back to the wonderful ~x86,
but since I was super busy and I don't like backing up a system that's
untested, then I didn't have good backups of the updates.  Maybe a
poor choice, but in any case, that was not the reason I was trying to
kick myself in the face.

Be bloody lucky,
or don't use retarded softwarez---
daid

>
> So, something went bad and I managed to sneak around it for a while
> and now I'm sort of living with the machine wondering what to do.
>
> Do I just watch the logs looking for problems? I have no way of
> knowing right now whether this was a disk problem that's going to come
> back, a 1 time deal due to power, or something else entirely.
>
> As these cheap machines that don't use RAID what's the right way to
> go? emerge -e @world and then wait for the next event? Do nothing and
> wait?
>
> We've got decent personal data backups as well as basic /etc data.
>
> Thanks,
> Mark
>
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [gentoo-user] recovery from /var corruption?
  2010-02-26  3:33 [gentoo-user] recovery from /var corruption? Mark Knecht
                   ` (2 preceding siblings ...)
  2010-02-26 11:47 ` daid kahl
@ 2010-02-26 17:38 ` daid kahl
  2010-02-26 18:57   ` Mark Knecht
  3 siblings, 1 reply; 16+ messages in thread
From: daid kahl @ 2010-02-26 17:38 UTC (permalink / raw
  To: gentoo-user

On 26 February 2010 12:33, Mark Knecht <markknecht@gmail.com> wrote:
> So I got my wife's machine booted today using a install disk and
> played a bit with e2fsck. The machine stopped being happy last night
> due to some sort of corruption on the /var partition. e2fsck
> complained about 3 or 4 files and then repaired the partition. The
> machine booted cleanly as far as I can tell.
>
> So, something went bad and I managed to sneak around it for a while
> and now I'm sort of living with the machine wondering what to do.
>
> Do I just watch the logs looking for problems? I have no way of
> knowing right now whether this was a disk problem that's going to come
> back, a 1 time deal due to power, or something else entirely.
>
> As these cheap machines that don't use RAID what's the right way to
> go? emerge -e @world and then wait for the next event? Do nothing and
> wait?
>
> We've got decent personal data backups as well as basic /etc data.
>
> Thanks,
> Mark
>

I reconsidered your problem, and I actually wonder if emerging world
is a valid notion in this case, as the world file is under /var and
this is reported as corrupt.

In this sense, it may be entirely non-trivial to regenerate (without
backup) the correct world-file for a system.

Am I out in the deep end, or is this, in fact, the critical point that
needs consideration here?

~daid



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [gentoo-user] recovery from /var corruption?
  2010-02-26 17:38 ` daid kahl
@ 2010-02-26 18:57   ` Mark Knecht
  0 siblings, 0 replies; 16+ messages in thread
From: Mark Knecht @ 2010-02-26 18:57 UTC (permalink / raw
  To: gentoo-user

On Fri, Feb 26, 2010 at 9:38 AM, daid kahl <daidxor@gmail.com> wrote:
> On 26 February 2010 12:33, Mark Knecht <markknecht@gmail.com> wrote:
>> So I got my wife's machine booted today using a install disk and
>> played a bit with e2fsck. The machine stopped being happy last night
>> due to some sort of corruption on the /var partition. e2fsck
>> complained about 3 or 4 files and then repaired the partition. The
>> machine booted cleanly as far as I can tell.
>>
>> So, something went bad and I managed to sneak around it for a while
>> and now I'm sort of living with the machine wondering what to do.
>>
>> Do I just watch the logs looking for problems? I have no way of
>> knowing right now whether this was a disk problem that's going to come
>> back, a 1 time deal due to power, or something else entirely.
>>
>> As these cheap machines that don't use RAID what's the right way to
>> go? emerge -e @world and then wait for the next event? Do nothing and
>> wait?
>>
>> We've got decent personal data backups as well as basic /etc data.
>>
>> Thanks,
>> Mark
>>
>
> I reconsidered your problem, and I actually wonder if emerging world
> is a valid notion in this case, as the world file is under /var and
> this is reported as corrupt.
>
> In this sense, it may be entirely non-trivial to regenerate (without
> backup) the correct world-file for a system.
>
> Am I out in the deep end, or is this, in fact, the critical point that
> needs consideration here?
>
> ~daid

Hi daid,
   In general you are correct. If I didn't have a copy of the world
file then it would be a bit hit and miss. In this case I do have it
saved elsewhere so it's actually quite easy.

   This failure is more (it seems) a few bad blocks on one partition
and not a total drive failure.

   I'm leaning toward a new /var partition and just ignoring the
partition that has problems. It will sit on the disk but it's only
10GB out of 160GB so it's not the end of the world by any means.

   Thanks!

- Mark



^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2010-02-26 18:57 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-02-26  3:33 [gentoo-user] recovery from /var corruption? Mark Knecht
2010-02-26  9:09 ` Neil Bothwick
2010-02-26  9:46 ` Alex Schuster
2010-02-26 15:17   ` Mark Knecht
2010-02-26 16:01     ` Alex Schuster
2010-02-26 16:53       ` Mark Knecht
2010-02-26 17:27         ` Alex Schuster
2010-02-26 17:51           ` Mark Knecht
2010-02-26 17:59             ` Volker Armin Hemmann
2010-02-26 18:19               ` Paul Hartman
2010-02-26 18:26               ` Mark Knecht
2010-02-26 18:37                 ` Volker Armin Hemmann
2010-02-26 18:48                 ` Mark Knecht
2010-02-26 11:47 ` daid kahl
2010-02-26 17:38 ` daid kahl
2010-02-26 18:57   ` Mark Knecht

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox