* [gentoo-user] [HELP] Intermittent software RAID failures
@ 2010-03-18 21:45 Carlos Hendson
2010-03-18 21:58 ` Mark Knecht
` (2 more replies)
0 siblings, 3 replies; 6+ messages in thread
From: Carlos Hendson @ 2010-03-18 21:45 UTC (permalink / raw
To: gentoo-user
[-- Attachment #1: Type: text/plain, Size: 729 bytes --]
Hello,
I've got a Dell Inspiron 1720 laptop with dual 2.5" hard drives setup
using software RAID1. I've had this computer for about a year and half
and all's been working well.
I've experienced intermittent software RAID errors like those found in
the "softraid-fail.txt" attachment.
Initially I suspected a kernel bug because it started around the same
time I'd upgraded the kernel (around the 2.6.30 upgrade) but subsequent
kernel upgrades haven't improved the situation.
I've run smartctl --all and bablocks on both disks, but nothing is
reported as faulty.
I don't understand what is causing RAID to report these faults and would
like some ideas as to how I can further diagnose the problem.
Thanks in advance,
Carlos
[-- Attachment #2: softraid-fail.txt --]
[-- Type: text/plain, Size: 6799 bytes --]
Feb 28 15:14:16 pheonix kernel: ata3.00: exception Emask 0x10 SAct 0x0 SErr 0x10000 action 0xe frozen
Feb 28 15:14:16 pheonix kernel: ata3.00: irq_stat 0x00400000, PHY RDY changed
Feb 28 15:14:16 pheonix kernel: ata3: SError: { PHYRdyChg }
Feb 28 15:14:16 pheonix kernel: ata3.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
Feb 28 15:14:16 pheonix kernel: res 40/00:0c:97:74:25/00:00:0c:00:00/40 Emask 0x10 (ATA bus error)
Feb 28 15:14:16 pheonix kernel: ata3.00: status: { DRDY }
Feb 28 15:14:16 pheonix kernel: ata3: hard resetting link
Feb 28 15:14:19 pheonix kernel: ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Feb 28 15:14:19 pheonix kernel: ata3.00: configured for UDMA/133
Feb 28 15:14:19 pheonix kernel: ata3: EH complete
Feb 28 15:14:19 pheonix kernel: end_request: I/O error, dev sdb, sector 178062452
Feb 28 15:14:19 pheonix kernel: raid1: Disk failure on sdb1, disabling device.
Feb 28 15:14:19 pheonix kernel: raid1: Operation continuing on 1 devices.
Feb 28 15:14:19 pheonix kernel: md: recovery of RAID array md0
Feb 28 15:14:19 pheonix kernel: md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
Feb 28 15:14:19 pheonix kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
Feb 28 15:14:19 pheonix kernel: md: using 128k window, over a total of 178024192 blocks.
Feb 28 15:14:19 pheonix kernel: md: resuming recovery of md0 from checkpoint.
Feb 28 15:14:19 pheonix kernel: md: md0: recovery done.
Feb 28 15:14:19 pheonix kernel: RAID1 conf printout:
Feb 28 15:14:19 pheonix kernel: --- wd:1 rd:2
Feb 28 15:14:19 pheonix kernel: disk 0, wo:0, o:1, dev:sda8
Feb 28 15:14:19 pheonix kernel: disk 1, wo:1, o:0, dev:sdb1
Feb 28 15:14:19 pheonix kernel: md: recovery of RAID array md0
Feb 28 15:14:19 pheonix kernel: md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
Feb 28 15:14:19 pheonix kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
Feb 28 15:14:19 pheonix kernel: md: using 128k window, over a total of 178024192 blocks.
Feb 28 15:14:19 pheonix kernel: md: resuming recovery of md0 from checkpoint.
Feb 28 15:14:19 pheonix kernel: md: md0: recovery done.
Feb 28 15:14:20 pheonix kernel: RAID1 conf printout:
Feb 28 15:14:20 pheonix kernel: --- wd:1 rd:2
Feb 28 15:14:20 pheonix kernel: disk 0, wo:0, o:1, dev:sda8
Feb 28 15:14:20 pheonix kernel: disk 1, wo:1, o:0, dev:sdb1
Feb 28 15:14:20 pheonix kernel: RAID1 conf printout:
Feb 28 15:14:20 pheonix kernel: --- wd:1 rd:2
Feb 28 15:14:20 pheonix kernel: disk 0, wo:0, o:1, dev:sda8
Feb 28 15:14:20 pheonix kernel: disk 1, wo:1, o:0, dev:sdb1
Feb 28 15:14:20 pheonix kernel: RAID1 conf printout:
Feb 28 15:14:20 pheonix kernel: --- wd:1 rd:2
Feb 28 15:14:20 pheonix kernel: disk 0, wo:0, o:1, dev:sda8
Mar 12 19:38:06 pheonix kernel: ata1.00: exception Emask 0x10 SAct 0x0 SErr 0x10000 action 0xe frozen
Mar 12 19:38:06 pheonix kernel: ata1.00: irq_stat 0x00400000, PHY RDY changed
Mar 12 19:38:06 pheonix kernel: ata1: SError: { PHYRdyChg }
Mar 12 19:38:06 pheonix kernel: ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
Mar 12 19:38:06 pheonix kernel: res 40/00:24:b6:fa:df/00:00:17:00:00/40 Emask 0x10 (ATA bus error)
Mar 12 19:38:06 pheonix kernel: ata1.00: status: { DRDY }
Mar 12 19:38:06 pheonix kernel: ata1: hard resetting link
Mar 12 19:38:09 pheonix kernel: ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Mar 12 19:38:09 pheonix kernel: ata1.00: configured for UDMA/133
Mar 12 19:38:09 pheonix kernel: ata1: EH complete
Mar 12 19:38:09 pheonix kernel: end_request: I/O error, dev sda, sector 305244964
Mar 12 19:38:09 pheonix kernel: raid1: Disk failure on sda8, disabling device.
Mar 12 19:38:09 pheonix kernel: raid1: Operation continuing on 1 devices.
Mar 12 19:38:09 pheonix kernel: md: recovery of RAID array md0
Mar 12 19:38:09 pheonix kernel: md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
Mar 12 19:38:09 pheonix kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
Mar 12 19:38:09 pheonix kernel: md: using 128k window, over a total of 178024192 blocks.
Mar 12 19:38:09 pheonix kernel: md: resuming recovery of md0 from checkpoint.
Mar 12 19:38:09 pheonix kernel: md: md0: recovery done.
Mar 12 19:38:09 pheonix kernel: RAID1 conf printout:
Mar 12 19:38:09 pheonix kernel: --- wd:1 rd:2
Mar 12 19:38:09 pheonix kernel: disk 0, wo:1, o:0, dev:sda8
Mar 12 19:38:09 pheonix kernel: disk 1, wo:0, o:1, dev:sdb1
Mar 12 19:38:09 pheonix kernel: md: recovery of RAID array md0
Mar 12 19:38:09 pheonix kernel: md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
Mar 12 19:38:09 pheonix kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
Mar 12 19:38:09 pheonix kernel: md: using 128k window, over a total of 178024192 blocks.
Mar 12 19:38:09 pheonix kernel: md: resuming recovery of md0 from checkpoint.
Mar 12 19:38:09 pheonix kernel: md: md0: recovery done.
Mar 12 19:38:09 pheonix kernel: RAID1 conf printout:
Mar 12 19:38:09 pheonix kernel: --- wd:1 rd:2
Mar 12 19:38:09 pheonix kernel: disk 0, wo:1, o:0, dev:sda8
Mar 12 19:38:09 pheonix kernel: disk 1, wo:0, o:1, dev:sdb1
Mar 12 19:38:09 pheonix kernel: md: recovery of RAID array md0
Mar 12 19:38:09 pheonix kernel: md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
Mar 12 19:38:09 pheonix kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
Mar 12 19:38:09 pheonix kernel: md: using 128k window, over a total of 178024192 blocks.
Mar 12 19:38:09 pheonix kernel: md: resuming recovery of md0 from checkpoint.
Mar 12 19:38:09 pheonix kernel: md: md0: recovery done.
Mar 12 19:38:10 pheonix kernel: RAID1 conf printout:
Mar 12 19:38:10 pheonix kernel: --- wd:1 rd:2
Mar 12 19:38:10 pheonix kernel: disk 0, wo:1, o:0, dev:sda8
Mar 12 19:38:10 pheonix kernel: disk 1, wo:0, o:1, dev:sdb1
Mar 12 19:38:10 pheonix kernel: md: recovery of RAID array md0
Mar 18 21:57:33 pheonix kernel: ata3.00: exception Emask 0x10 SAct 0x0 SErr 0x10000 action 0xe frozen
Mar 18 21:57:33 pheonix kernel: ata3.00: irq_stat 0x00400000, PHY RDY changed
Mar 18 21:57:33 pheonix kernel: ata3: SError: { PHYRdyChg }
Mar 18 21:57:33 pheonix kernel: ata3.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
Mar 18 21:57:33 pheonix kernel: res 40/00:24:bf:1c:1f/00:00:00:00:00/40 Emask 0x10 (ATA bus error)
Mar 18 21:57:33 pheonix kernel: ata3.00: status: { DRDY }
Mar 18 21:57:33 pheonix kernel: ata3: hard resetting link
Mar 18 21:57:37 pheonix kernel: ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Mar 18 21:57:37 pheonix kernel: ata3.00: configured for UDMA/133
Mar 18 21:57:37 pheonix kernel: ata3: EH complete
Mar 18 21:57:37 pheonix kernel: end_request: I/O error, dev sdb, sector 178116972
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [gentoo-user] [HELP] Intermittent software RAID failures
2010-03-18 21:45 [gentoo-user] [HELP] Intermittent software RAID failures Carlos Hendson
@ 2010-03-18 21:58 ` Mark Knecht
2010-03-18 22:45 ` Keith Dart
2010-03-19 14:33 ` Paul Hartman
2 siblings, 0 replies; 6+ messages in thread
From: Mark Knecht @ 2010-03-18 21:58 UTC (permalink / raw
To: gentoo-user
On Thu, Mar 18, 2010 at 2:45 PM, Carlos Hendson <skyclan@gmx.net> wrote:
> Hello,
>
> I've got a Dell Inspiron 1720 laptop with dual 2.5" hard drives setup
> using software RAID1. I've had this computer for about a year and half
> and all's been working well.
>
> I've experienced intermittent software RAID errors like those found in
> the "softraid-fail.txt" attachment.
>
> Initially I suspected a kernel bug because it started around the same
> time I'd upgraded the kernel (around the 2.6.30 upgrade) but subsequent
> kernel upgrades haven't improved the situation.
>
> I've run smartctl --all and bablocks on both disks, but nothing is
> reported as faulty.
>
> I don't understand what is causing RAID to report these faults and would
> like some ideas as to how I can further diagnose the problem.
>
> Thanks in advance,
> Carlos
>
Kernel upgrades might not tell you much. Kernel downgrades might.
- Mark
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [gentoo-user] [HELP] Intermittent software RAID failures
2010-03-18 21:45 [gentoo-user] [HELP] Intermittent software RAID failures Carlos Hendson
2010-03-18 21:58 ` Mark Knecht
@ 2010-03-18 22:45 ` Keith Dart
2010-03-19 8:11 ` Carlos
2010-03-19 14:33 ` Paul Hartman
2 siblings, 1 reply; 6+ messages in thread
From: Keith Dart @ 2010-03-18 22:45 UTC (permalink / raw
To: gentoo-user; +Cc: skyclan
=== On Thu, 03/18, Carlos Hendson wrote: ===
> I've experienced intermittent software RAID errors like those found in
> the "softraid-fail.txt" attachment.
===
That's most likely your disk starting to fail.
-- Keith Dart
--
-- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Keith Dart <keith@dartworks.biz>
public key: ID: 19017044
<http://www.dartworks.biz/>
=====================================================================
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [gentoo-user] [HELP] Intermittent software RAID failures
2010-03-18 22:45 ` Keith Dart
@ 2010-03-19 8:11 ` Carlos
2010-03-19 14:37 ` Volker Armin Hemmann
0 siblings, 1 reply; 6+ messages in thread
From: Carlos @ 2010-03-19 8:11 UTC (permalink / raw
To: gentoo-user
Keith Dart wrote:
> === On Thu, 03/18, Carlos Hendson wrote: ===
>> I've experienced intermittent software RAID errors like those found in
>> the "softraid-fail.txt" attachment.
>
> ===
>
> That's most likely your disk starting to fail.
>
How would I go about categorically proving such a thing? What are the
right tools for the job? I found it strange that both /dev/sda8 and
/dev/sdb1 have reported similar problems.
No disk errors are reported when using non-RAID partitions which reside
on the same physical disk. This is why I'm not 100% convinced it's a
disk failure.
Regards,
Carlos
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [gentoo-user] [HELP] Intermittent software RAID failures
2010-03-18 21:45 [gentoo-user] [HELP] Intermittent software RAID failures Carlos Hendson
2010-03-18 21:58 ` Mark Knecht
2010-03-18 22:45 ` Keith Dart
@ 2010-03-19 14:33 ` Paul Hartman
2 siblings, 0 replies; 6+ messages in thread
From: Paul Hartman @ 2010-03-19 14:33 UTC (permalink / raw
To: gentoo-user
On Thu, Mar 18, 2010 at 4:45 PM, Carlos Hendson <skyclan@gmx.net> wrote:
> Hello,
>
> I've got a Dell Inspiron 1720 laptop with dual 2.5" hard drives setup
> using software RAID1. I've had this computer for about a year and half
> and all's been working well.
>
> I've experienced intermittent software RAID errors like those found in
> the "softraid-fail.txt" attachment.
>
> Initially I suspected a kernel bug because it started around the same
> time I'd upgraded the kernel (around the 2.6.30 upgrade) but subsequent
> kernel upgrades haven't improved the situation.
>
> I've run smartctl --all and bablocks on both disks, but nothing is
> reported as faulty.
>
> I don't understand what is causing RAID to report these faults and would
> like some ideas as to how I can further diagnose the problem.
I remember reading something recently (within the last year?) about
smartmontools causing disks to go offline unnecessarily in some
situation, i think due to a bug in the smart tools. I don't know if
it's a certain version of smartmontools or combination of that and
other things. Maybe you can try upgrade/downgrade of smart or
temporarily disable smartd to see if it stops the disks from being
taken offline.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [gentoo-user] [HELP] Intermittent software RAID failures
2010-03-19 8:11 ` Carlos
@ 2010-03-19 14:37 ` Volker Armin Hemmann
0 siblings, 0 replies; 6+ messages in thread
From: Volker Armin Hemmann @ 2010-03-19 14:37 UTC (permalink / raw
To: gentoo-user
On Freitag 19 März 2010, Carlos wrote:
> Keith Dart wrote:
> > === On Thu, 03/18, Carlos Hendson wrote: ===
> >
> >> I've experienced intermittent software RAID errors like those found in
> >> the "softraid-fail.txt" attachment.
> >
> > ===
> >
> > That's most likely your disk starting to fail.
>
> How would I go about categorically proving such a thing? What are the
> right tools for the job? I found it strange that both /dev/sda8 and
> /dev/sdb1 have reported similar problems.
>
> No disk errors are reported when using non-RAID partitions which reside
> on the same physical disk. This is why I'm not 100% convinced it's a
> disk failure.
>
> Regards,
> Carlos
well, the error is located on the raid partition....
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2010-03-19 14:37 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-03-18 21:45 [gentoo-user] [HELP] Intermittent software RAID failures Carlos Hendson
2010-03-18 21:58 ` Mark Knecht
2010-03-18 22:45 ` Keith Dart
2010-03-19 8:11 ` Carlos
2010-03-19 14:37 ` Volker Armin Hemmann
2010-03-19 14:33 ` Paul Hartman
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox