public inbox for gentoo-user@lists.gentoo.org
 help / color / mirror / Atom feed
* [gentoo-user] SSD or MoBo playing up?
@ 2020-06-18 22:47 Michael
  2020-06-19  0:28 ` Adam Carter
  0 siblings, 1 reply; 4+ messages in thread
From: Michael @ 2020-06-18 22:47 UTC (permalink / raw
  To: gentoo-user

[-- Attachment #1: Type: text/plain, Size: 6010 bytes --]

It started thus:

Jun 18 17:52:45 asus kernel: ata1.00: exception Emask 0x0 SAct 0x2 SErr 0x0 
action 0x6 frozen
Jun 18 17:52:45 asus kernel: ata1.00: failed command: WRITE FPDMA QUEUED
Jun 18 17:52:45 asus kernel: ata1.00: cmd 61/20:08:60:f1:90/00:00:02:00:00/40 
tag 1 ncq dma 16384 out\x0a         res 40/00:01:00:4f:c2/00:00:00:00:00/00 
Emask 0x4 (timeout)
Jun 18 17:52:45 asus kernel: ata1.00: status: { DRDY }
Jun 18 17:52:45 asus kernel: ata1: hard resetting link
Jun 18 17:52:55 asus kernel: ata1: softreset failed (1st FIS failed)
Jun 18 17:52:55 asus kernel: ata1: hard resetting link
Jun 18 17:53:05 asus kernel: ata1: softreset failed (1st FIS failed)
Jun 18 17:53:05 asus kernel: ata1: hard resetting link
Jun 18 17:53:40 asus kernel: ata1: softreset failed (1st FIS failed)
Jun 18 17:53:40 asus kernel: ata1: limiting SATA link speed to 3.0 Gbps
Jun 18 17:53:40 asus kernel: ata1: hard resetting link
Jun 18 17:53:45 asus kernel: ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 
320)
Jun 18 17:53:45 asus kernel: ata1.00: link online but device misclassified
Jun 18 17:53:51 asus kernel: ata1.00: qc timeout (cmd 0xec)
Jun 18 17:53:51 asus kernel: ata1.00: failed to IDENTIFY (I/O error, 
err_mask=0x4)
Jun 18 17:53:51 asus kernel: ata1.00: revalidation failed (errno=-5)
Jun 18 17:53:51 asus kernel: ata1: hard resetting link
Jun 18 17:54:01 asus kernel: ata1: softreset failed (1st FIS failed)
Jun 18 17:54:01 asus kernel: ata1: hard resetting link
Jun 18 17:54:11 asus kernel: ata1: softreset failed (1st FIS failed)
Jun 18 17:54:11 asus kernel: ata1: hard resetting link
Jun 18 17:54:46 asus kernel: ata1: softreset failed (1st FIS failed)
Jun 18 17:54:46 asus kernel: ata1: limiting SATA link speed to 1.5 Gbps
Jun 18 17:54:46 asus kernel: ata1: hard resetting link
Jun 18 17:54:51 asus kernel: ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 
310)
Jun 18 17:54:51 asus kernel: ata1.00: link online but device misclassified
Jun 18 17:55:01 asus kernel: ata1.00: qc timeout (cmd 0xec)
Jun 18 17:55:01 asus kernel: ata1.00: failed to IDENTIFY (I/O error, 
err_mask=0x4)
Jun 18 17:55:01 asus kernel: ata1.00: revalidation failed (errno=-5)
Jun 18 17:55:01 asus kernel: ata1: hard resetting link
Jun 18 17:55:11 asus kernel: ata1: softreset failed (1st FIS failed)
[snip ...] 

Jun 18 17:57:32 asus kernel: ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 
310)
Jun 18 17:57:32 asus kernel: ata1.00: link online but device misclassified
Jun 18 17:57:32 asus kernel: ata1: EH complete
Jun 18 17:57:32 asus kernel: sd 0:0:0:0: [sda] tag#15 FAILED Result: 
hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jun 18 17:57:32 asus kernel: sd 0:0:0:0: [sda] tag#15 CDB: Write(10) 2a 00 03 
b5 5d 18 00 00 08 00
Jun 18 17:57:32 asus kernel: blk_update_request: I/O error, dev sda, sector 
62217496 op 0x1:(WRITE) flags 0x100000 phys_seg 1 prio class 0
Jun 18 17:57:32 asus kernel: BTRFS error (device sda3): bdev /dev/sda3 errs: 
wr 1, rd 0, flush 0, corrupt 0, gen 0
Jun 18 17:57:32 asus kernel: sd 0:0:0:0: [sda] tag#16 FAILED Result: 
hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jun 18 17:57:32 asus kernel: sd 0:0:0:0: [sda] tag#16 CDB: Write(10) 2a 00 08 
7b 76 68 00 00 08 00
Jun 18 17:57:32 asus kernel: blk_update_request: I/O error, dev sda, sector 
142308968 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
Jun 18 17:57:32 asus kernel: BTRFS error (device sda3): bdev /dev/sda3 errs: 
wr 2, rd 0, flush 0, corrupt 0, gen 0
Jun 18 17:57:32 asus kernel: sd 0:0:0:0: [sda] tag#24 FAILED Result: 
hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jun 18 17:57:32 asus kernel: sd 0:0:0:0: [sda] tag#17 FAILED Result: 
hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jun 18 17:57:32 asus kernel: sd 0:0:0:0: [sda] tag#17 CDB: Write(10) 2a 00 02 
90 f1 60 00 00 20 00
Jun 18 17:57:32 asus kernel: blk_update_request: I/O error, dev sda, sector 
43053408 op 0x1:(WRITE) flags 0x1800 phys_seg 4 prio class 0
Jun 18 17:57:32 asus kernel: sd 0:0:0:0: [sda] tag#24 CDB: ATA command pass 
through(16) 85 06 20 00 00 00 00 00 00 00 00 00 00 00 e5 00
Jun 18 17:57:32 asus kernel: BTRFS error (device sda3): bdev /dev/sda3 errs: 
wr 3, rd 0, flush 0, corrupt 0, gen 0
Jun 18 17:57:32 asus kernel: BTRFS: error (device sda3) in 
btrfs_commit_transaction:2280: errno=-5 IO failure (Error while writing out 
transaction)
Jun 18 17:57:32 asus kernel: BTRFS info (device sda3): forced readonly
Jun 18 17:57:32 asus kernel: BTRFS warning (device sda3): Skipping commit of 
aborted transaction.
Jun 18 17:57:32 asus kernel: BTRFS: error (device sda3) in 
cleanup_transaction:1832: errno=-5 IO failure
Jun 18 17:57:32 asus kernel: BTRFS info (device sda3): delayed_refs has NO 
entry
Jun 18 17:57:32 asus kernel: sd 0:0:0:0: [sda] tag#18 FAILED Result: 
hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jun 18 17:57:32 asus kernel: sd 0:0:0:0: [sda] tag#18 CDB: Read(10) 28 00 01 
2b 3b c0 00 00 20 00
Jun 18 17:57:32 asus kernel: blk_update_request: I/O error, dev sda, sector 
19610560 op 0x0:(READ) flags 0x1000 phys_seg 4 prio class 0
Jun 18 17:57:32 asus kernel: BTRFS error (device sda2): bdev /dev/sda2 errs: 
wr 0, rd 1, flush 0, corrupt 0, gen 0
Jun 18 17:57:32 asus kernel: BTRFS info (device sda2): no csum found for inode 
14036317 start 0
Jun 18 17:57:32 asus kernel: sd 0:0:0:0: [sda] tag#4 FAILED Result: 
hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jun 18 17:57:32 asus kernel: sd 0:0:0:0: [sda] tag#4 CDB: Read(10) 28 00 01 2b 
3b c0 00 00 20 00
Jun 18 17:57:32 asus kernel: blk_update_request: I/O error, dev sda, sector 
19610560 op 0x0:(READ) flags 0x1000 phys_seg 4 prio class 0
[snip ...]

I shut it down, rebooted with LiveUSB, btrfsck'ed it and ... no errors.  :-/

I fstrim'med it, fsck'ed it once more and ran a smartctl short test.  Still no 
errors.

/dev/sda is an SSD, with /dev/sda2 being mounted as / and /dev/sda3 mounted as 
/home.

I took a fresh back up of /home, just in case, but I don't know what to make 
of the above.  Do you see something familiar in the syslog errors shown above?

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [gentoo-user] SSD or MoBo playing up?
  2020-06-18 22:47 [gentoo-user] SSD or MoBo playing up? Michael
@ 2020-06-19  0:28 ` Adam Carter
  2020-06-19  0:59   ` mad.scientist.at.large
  0 siblings, 1 reply; 4+ messages in thread
From: Adam Carter @ 2020-06-19  0:28 UTC (permalink / raw
  To: gentoo-user

[-- Attachment #1: Type: text/plain, Size: 534 bytes --]

On Fri, Jun 19, 2020 at 8:48 AM Michael <confabulate@kintzios.com> wrote:

> It started thus:
>
> Jun 18 17:52:45 asus kernel: ata1.00: exception Emask 0x0 SAct 0x2 SErr
> 0x0
> action 0x6 frozen
> Jun 18 17:52:45 asus kernel: ata1.00: failed command: WRITE FPDMA QUEUED
> Jun 18 17:52:45 asus kernel: ata1.00: cmd
> 61/20:08:60:f1:90/00:00:02:00:00/40
> tag 1 ncq dma 16384 out\x0a         res
> 40/00:01:00:4f:c2/00:00:00:00:00/00
> Emask 0x4 (timeout)
> <snip>
>

What does smartctl -a /dev/sda report?

Tried replacing the cable?

[-- Attachment #2: Type: text/html, Size: 909 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [gentoo-user] SSD or MoBo playing up?
  2020-06-19  0:28 ` Adam Carter
@ 2020-06-19  0:59   ` mad.scientist.at.large
  2020-06-19 10:26     ` Michael
  0 siblings, 1 reply; 4+ messages in thread
From: mad.scientist.at.large @ 2020-06-19  0:59 UTC (permalink / raw
  To: Gentoo User

You also might try a known good power supply as well.  

You should definitely try the drive on another system if you can't do that, and/or try another drive with the current mother board.  With the errors being different over time it could easily be nearly any component in the system.  Symptoms are sometimes misleading, trouble shooting is difficult for most people, sometimes even for people who are usually good at it and experienced.  Root causes are often obscured and failures can propagate in unforeseen ways even for the experienced.  Testing suspect components in another system and/or substituting known good components into the failing system are the most useful for trouble shooting, but you have to remember what you've tested and what it suggested and recheck when it doesn't make sense.

-- “The whole world is watching! The whole world is watching!”



Jun 18, 2020, 18:28 by adamcarter3@gmail.com:

> On Fri, Jun 19, 2020 at 8:48 AM Michael <> confabulate@kintzios.com> > wrote:
>
>> It started thus:
>>  
>>  Jun 18 17:52:45 asus kernel: ata1.00: exception Emask 0x0 SAct 0x2 SErr 0x0 
>>  action 0x6 frozen
>>  Jun 18 17:52:45 asus kernel: ata1.00: failed command: WRITE FPDMA QUEUED
>>  Jun 18 17:52:45 asus kernel: ata1.00: cmd 61/20:08:60:f1:90/00:00:02:00:00/40 
>>  tag 1 ncq dma 16384 out\x0a         res 40/00:01:00:4f:c2/00:00:00:00:00/00 
>>  Emask 0x4 (timeout)
>> <snip>
>>
>
> What does smartctl -a /dev/sda report?
>
> Tried replacing the cable?
>



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [gentoo-user] SSD or MoBo playing up?
  2020-06-19  0:59   ` mad.scientist.at.large
@ 2020-06-19 10:26     ` Michael
  0 siblings, 0 replies; 4+ messages in thread
From: Michael @ 2020-06-19 10:26 UTC (permalink / raw
  To: gentoo-user

[-- Attachment #1: Type: text/plain, Size: 5912 bytes --]

On Friday, 19 June 2020 01:59:55 BST mad.scientist.at.large@tutanota.com 
wrote:
> You also might try a known good power supply as well. 
> 
> You should definitely try the drive on another system if you can't do that,
> and/or try another drive with the current mother board.  With the errors
> being different over time it could easily be nearly any component in the
> system.  Symptoms are sometimes misleading, trouble shooting is difficult
> for most people, sometimes even for people who are usually good at it and
> experienced.  Root causes are often obscured and failures can propagate in
> unforeseen ways even for the experienced.  Testing suspect components in
> another system and/or substituting known good components into the failing
> system are the most useful for trouble shooting, but you have to remember
> what you've tested and what it suggested and recheck when it doesn't make
> sense.
> 
> -- “The whole world is watching! The whole world is watching!”

Thank you both for your replies.

The smartctl result following the short test showed no error, but it doesn't 
present a log of tests either.

Following a reboot dmesg was happy too.
  
========================
# smartctl -a /dev/sda
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-5.4.38-gentoo] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Indilinx Barefoot 3 based SSDs
Device Model:     OCZ-ARC100
Serial Number:    A22L1061435000660
LU WWN Device Id: 5 e83a97 10000e885
Firmware Version: 1.00
User Capacity:    240,057,409,536 bytes [240 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri Jun 19 11:12:44 2020 BST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: 
Disabled.
Self-test execution status:      ( 249) Self-test routine in progress...
                                        90% of test remaining.
Total time to complete Offline 
data collection:                (    0) seconds.
Offline data collection
capabilities:                    (0x1d) SMART execute Offline immediate.
                                        No Auto Offline data collection 
support.
                                        Abort Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        No Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.

recommended polling time:        (   0) minutes.
Extended self-test routine
recommended polling time:        (   0) minutes.

SMART Attributes Data Structure revision number: 18
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  
WHEN_FAILED RAW_VALUE
  5 Runtime_Bad_Block       0x0000   000   000   000    Old_age   Offline      
-       0
  9 Power_On_Hours          0x0000   100   100   000    Old_age   Offline      
-       12417
 12 Power_Cycle_Count       0x0000   100   100   000    Old_age   Offline      
-       3029
171 Avail_OP_Block_Count    0x0000   100   100   000    Old_age   Offline      
-       79543120
174 Pwr_Cycle_Ct_Unplanned  0x0000   100   100   000    Old_age   Offline      
-       42
195 Total_Prog_Failures     0x0000   100   100   000    Old_age   Offline      
-       0
196 Total_Erase_Failures    0x0000   100   100   000    Old_age   Offline      
-       0
197 Total_Unc_Read_Failures 0x0000   100   100   000    Old_age   Offline      
-       0
208 Average_Erase_Count     0x0000   100   100   000    Old_age   Offline      
-       1547
210 SATA_CRC_Error_Count    0x0000   100   100   000    Old_age   Offline      
-       0
224 In_Warranty             0x0000   100   100   000    Old_age   Offline      
-       1
233 Remaining_Lifetime_Perc 0x0000   049   049   000    Old_age   Offline      
-       49
241 Host_Writes_GiB         0x0000   100   100   000    Old_age   Offline      
-       19172
242 Host_Reads_GiB          0x0000   100   100   000    Old_age   Offline      
-       13725
249 Total_NAND_Prog_Ct_GiB  0x0000   100   100   000    Old_age   Offline      
-       719206231

SMART Error Log Version: 1
No Errors Logged

Warning! SMART Self-Test Log Structure error: invalid SMART checksum.
SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

Selective Self-tests/Logging not supported
============================

I'm running a long test now to see if this comes up with anything, but reading 
the previous error in syslog shows the SATA link failed, the kernel tried 
resetting it and downgraded the speed to 3.0 Gbps, but the link kept 
malfunctioning, which ultimately affected the fs.

I've reseated the cables just for good measure.  I hope the PSU is still OK, 
because I'd rather spend the money on a new PC.

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2020-06-19 10:26 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-06-18 22:47 [gentoo-user] SSD or MoBo playing up? Michael
2020-06-19  0:28 ` Adam Carter
2020-06-19  0:59   ` mad.scientist.at.large
2020-06-19 10:26     ` Michael

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox