Re: [gentoo-user] Another hard drive failure

public inbox for gentoo-user@lists.gentoo.org
 help / color / mirror / Atom feed

From: Dale <rdalek1967@gmail.com>
To: gentoo-user@lists.gentoo.org
Subject: Re: [gentoo-user] Another hard drive failure
Date: Mon, 9 Jun 2025 11:37:08 -0500	[thread overview]
Message-ID: <f6b104ce-983b-2a4f-2c60-0e1449cb15f9@gmail.com> (raw)
In-Reply-To: <38882730.XM6RcZxFsP@rogueboard>

Michael wrote:
> On Sunday, 8 June 2025 16:09:19 British Summer Time Dale wrote:
>> Michael wrote:
>>> The 184 End-to-End-Error SMART attribute was developed by Hewlett Packard
>>> to check if corruption took place as data was transferred from/to the
>>> buffer of the drive.  Some drives report it and this one does.  I suppose
>>> if the RAM data buffer is a bit unreliable, this kind of error will come
>>> and go.  Bearing in mind this particular drive was in a laptop,
>>> overheating may have also contributed.
>> I've never seen that one before.  It doesn't show up on the drives I
>> have but another manufacturer may use that.  May even be helpful. 
> I can see it here on two different Seagate drives and the one I mentioned in 
> this thread reports it is failing according to the End-to-End-Error SMART 
> attribute.  Various Western Digital drives and a 1.8" Toshiba drive don't have 
> it.
>
>
>>> I just finished writing to it and no more errors showed up.  I'll format
>>> it, with a slow read-write test for bad-blocks and recheck the smartclt
>>> attributes to see if more errors show up.  Then I could play with it with
>>> some temporary media files which I don't mind if any are lost and see how
>>> it behaves.
>> The drive I mentioned had some small number, single digit number, of
>> those too.  It wasn't much but it did have to correct a error.  I'd do
>> some serious testing and see if it is stable for sure.  I guess
>> badblocks is a good way to go but I've been known to use shred. 
>> Basically, anything that writes to the whole drive should catch errors. 
>> If you can write to it twice and it stays the same, it might be OK.
> SMART may report read errors and/or write errors.  I have found the read 
> errors can be somewhat spurious.  Smart says can't read some sector, but 
> either dd or hdparm can read it and the SMART read error count does not 
> increase when you try it.
>
> Sometimes it is a hard error and fsck will fail to reallocate it.  In this 
> case I try dd or 'hdparm --write-sector' to force a reallocation and/or I try 
> formatting with -cc option.  If this fails too, then the drive is retired from 
> active duty on ill health grounds.  :p
>
>
>> Basically, test it well until you are comfy that it is stable and can be
>> at least fairly trusted.  I might also add, I stuck a post-it note on it
>> that it has bad sectors/blocks.  So I don't forget.  :/ 
>>
>> Post back with what you get.  Curious to see if it is stable or not.  If
>> so, maybe a drive with a small number of these errors and is stable is
>> safe to use.  Maybe.
>>
>> Dale
>>
>> :-)  :-) 
> As mentioned above I finished writing to it with dd - no errors.
>
> I also completed another extended smartctl test on it, still no errors.  The 
> End-to-End_Error attribute continues to show "FAILING_NOW".  I think this is 
> reasonable.  If the drive buffer failed a CRC test once, for whatever unknown 
> reason, it may do so again in the future.  I noticed the temperature of this 
> drive while spinning on the USB docking station is on the high side, compared 
> with other drives.  It may have overheated in a laptop and this is why the 
> error occurred - who knows?  Anyway, it seems stable enough for now to try 
> with real data.
>
> Interestingly, I've a WD 2.5" 1TB drive which a couple of years ago had 
> reported errors on a partition I was using for /var.  An emerge had failed and 
> smartctl confirmed at the time this drive was unable to reallocate the faulty 
> sectors in the partition.  I thought I should do something about it, while I 
> was busying myself on all these drive rescue activities recently.  The problem 
> with this drive was widespread across many sectors.  Overwriting a sector here 
> and another there did not allow me to complete an extended smartctl test 
> without more reallocation errors.  Thankfully, all the errors were taking 
> place on the same partition.  So I overwrote the partition with dd, then 
> deleted it along with two more partitions, repartitioned and eventually 
> reformatted it with -cc.  Following this operation, smartctl reported no more 
> errors.  I've reinstalled Gentoo and to my surprise has been working with no 
> problems shown so far.  Since this WD drive lives in a laptop, I also took the 
> opportunity to configure fscrypt for /home, plus cryptsetup for swap while at 
> it.  Time will tell how long it may survive.


This is why drive errors need to be looked at in context.  If a drive
has a error and it marks that block as bad and no new errors pop up, the
drive may be usable but one should keep a eye on it.  The drive I have
and the fact it is stable for a few years now, I would use it if I had
too.  If however more errors pop up on a drive and it continues to
increase, one might want to find a door that needs to be held open. 

I like SMART and it is useful for sure.  The testing methods that drive
makers put in these drives is also useful.  Thing is, SMART can't detect
everything.  Even with what they can detect, it still takes a human to
judge what is acceptable or not.  On that drive since it says it is
failing, I'd run badblocks, shred, dd or something until it failed or I
was willing to use the drive because it is stable.  It's going to do one
or the other.  It could be a learning experience.  If you do, post back
and let us know what it did.  Did it fail or after numerous passes it is
still going strong with no new errors? 

Dale

:-)  :-)

     prev parent reply	other threads:[~2025-06-09 16:38 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-07 17:41 [gentoo-user] Another hard drive failure Michael
2025-06-07 18:12 ` Dale
2025-06-07 22:42   ` Michael
2025-06-08 15:09     ` Dale
2025-06-09 12:19       ` Michael
2025-06-09 16:37         ` Dale [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f6b104ce-983b-2a4f-2c60-0e1449cb15f9@gmail.com \
    --to=rdalek1967@gmail.com \
    --cc=gentoo-user@lists.gentoo.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox