[gentoo-user] Contradictionary behaviour of SMART on hds ?!?

public inbox for gentoo-user@lists.gentoo.org
 help / color / mirror / Atom feed

* [gentoo-user] Contradictionary behaviour of SMART on hds ?!?
@ 2014-07-27 10:12 meino.cramer
  2014-07-27 10:23 ` Dale
  2014-07-27 10:27 ` Neil Bothwick
  0 siblings, 2 replies; 18+ messages in thread
From: meino.cramer @ 2014-07-27 10:12 UTC (permalink / raw
  To: Gentoo

Hi,

after finding a bad sector on my hd (see previous thread) a read a lot
stuff about SMART, smartctl and such to determine wether and how
severe is a bad sector and how to cope with it.

There is (at least ;) one thing I dont understand:

On the one hand, the surface test (extended offline and such) aborts
as soon the first read fgailure happens.

On the other hand it is said: If the count of bad sectors increases
over time it is time to change the hd.

How can the second happen, if the first is true???

Best regards,
mcc

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [gentoo-user] Contradictionary behaviour of SMART on hds ?!?
  2014-07-27 10:12 [gentoo-user] Contradictionary behaviour of SMART on hds ?!? meino.cramer
@ 2014-07-27 10:23 ` Dale
  2014-07-27 10:27 ` Neil Bothwick
  1 sibling, 0 replies; 18+ messages in thread
From: Dale @ 2014-07-27 10:23 UTC (permalink / raw
  To: gentoo-user

meino.cramer@gmx.de wrote:
> Hi,
>
> after finding a bad sector on my hd (see previous thread) a read a lot
> stuff about SMART, smartctl and such to determine wether and how
> severe is a bad sector and how to cope with it.
>
> There is (at least ;) one thing I dont understand:
>
> On the one hand, the surface test (extended offline and such) aborts
> as soon the first read fgailure happens.
>
> On the other hand it is said: If the count of bad sectors increases
> over time it is time to change the hd.
>
> How can the second happen, if the first is true???
>
> Best regards,
> mcc
>
>

I am by no means a expert on this but I'll try to provide some info,
which others may correct.  ;-) 

I had a situation similar to yours a few months ago.  I had a bad spot
that popped up.  After some effort, I got the drive to mark that as
bad.  Now, from my understanding, the drive sort of has some extra space
that it can use if needed.  However, at some point it will run out if
you have spots going bad one after another. 

Again, my understanding.  Let's say it has 200 extra spots.  You then
run the test or the drive detects on its own and finds a bad spot.  It
marks that spot as bad and then uses one of the extra spots.  So, if you
have 201 spots to go bad, you fresh out of luck.  Again, that's my
understanding of this. 

Now can someone else that knows even more than me explain this better?  
:-D 

Dale

:-)  :-)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [gentoo-user] Contradictionary behaviour of SMART on hds ?!?
  2014-07-27 10:12 [gentoo-user] Contradictionary behaviour of SMART on hds ?!? meino.cramer
  2014-07-27 10:23 ` Dale
@ 2014-07-27 10:27 ` Neil Bothwick
  2014-07-27 10:41   ` meino.cramer
  1 sibling, 1 reply; 18+ messages in thread
From: Neil Bothwick @ 2014-07-27 10:27 UTC (permalink / raw
  To: gentoo-user

[-- Attachment #1: Type: text/plain, Size: 767 bytes --]

On Sun, 27 Jul 2014 12:12:47 +0200, meino.cramer@gmx.de wrote:

> On the one hand, the surface test (extended offline and such) aborts
> as soon the first read fgailure happens.
> 
> On the other hand it is said: If the count of bad sectors increases
> over time it is time to change the hd.
> 
> How can the second happen, if the first is true???

My understanding is that the test only aborts if the error is severe
enough to force it to do so. A simple bad block can be skipped and the
rest of the drive tested.

I've had a couple of drives get to the stage where SMART tests abort at
an error and in both cases the manufacturer replaced them without
question.


-- 
Neil Bothwick

If at first you do succeed, try to hide your astonishment.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [gentoo-user] Contradictionary behaviour of SMART on hds ?!?
  2014-07-27 10:27 ` Neil Bothwick
@ 2014-07-27 10:41   ` meino.cramer
  2014-07-27 11:11     ` Dale
  2014-07-27 18:57     ` Neil Bothwick
  0 siblings, 2 replies; 18+ messages in thread
From: meino.cramer @ 2014-07-27 10:41 UTC (permalink / raw
  To: gentoo-user

Neil Bothwick <neil@digimed.co.uk> [14-07-27 12:32]:
> On Sun, 27 Jul 2014 12:12:47 +0200, meino.cramer@gmx.de wrote:
> 
> > On the one hand, the surface test (extended offline and such) aborts
> > as soon the first read fgailure happens.
> > 
> > On the other hand it is said: If the count of bad sectors increases
> > over time it is time to change the hd.
> > 
> > How can the second happen, if the first is true???
> 
> My understanding is that the test only aborts if the error is severe
> enough to force it to do so. A simple bad block can be skipped and the
> rest of the drive tested.
> 
> I've had a couple of drives get to the stage where SMART tests abort at
> an error and in both cases the manufacturer replaced them without
> question.
> 
> 
> -- 
> Neil Bothwick
> 
> If at first you do succeed, try to hide your astonishment.

Hi Dale, hi Neil,

thanks for the infos.

But it is slightly off the point I tried to explain (I am no native
english speaker...sorry...:)

Suppose - as in my case - I have not yert managed to urge the hd to 
map the bad sector off...

Now...all tests abort after scanning 10% of the disk. Disk health
status is reported as "PASSED"...cause only one bad sector has been
found.

But 90% of the space of the disk has never been scanned.

Is this an implementation fault?
And if YES...is it the implementation of the firmware?
And: Is it my firmware or the one of the drive?
;)

Best regards,
mcc




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [gentoo-user] Contradictionary behaviour of SMART on hds ?!?
  2014-07-27 10:41   ` meino.cramer
@ 2014-07-27 11:11     ` Dale
  2014-07-27 11:29       ` meino.cramer
  2014-07-27 18:57     ` Neil Bothwick
  1 sibling, 1 reply; 18+ messages in thread
From: Dale @ 2014-07-27 11:11 UTC (permalink / raw
  To: gentoo-user

meino.cramer@gmx.de wrote:
> Neil Bothwick <neil@digimed.co.uk> [14-07-27 12:32]:
>> On Sun, 27 Jul 2014 12:12:47 +0200, meino.cramer@gmx.de wrote:
>>
>>> On the one hand, the surface test (extended offline and such) aborts
>>> as soon the first read fgailure happens.
>>>
>>> On the other hand it is said: If the count of bad sectors increases
>>> over time it is time to change the hd.
>>>
>>> How can the second happen, if the first is true???
>> My understanding is that the test only aborts if the error is severe
>> enough to force it to do so. A simple bad block can be skipped and the
>> rest of the drive tested.
>>
>> I've had a couple of drives get to the stage where SMART tests abort at
>> an error and in both cases the manufacturer replaced them without
>> question.
>>
>>
>> -- 
>> Neil Bothwick
>>
>> If at first you do succeed, try to hide your astonishment.
> Hi Dale, hi Neil,
>
> thanks for the infos.
>
> But it is slightly off the point I tried to explain (I am no native
> english speaker...sorry...:)
>
> Suppose - as in my case - I have not yert managed to urge the hd to 
> map the bad sector off...
>
> Now...all tests abort after scanning 10% of the disk. Disk health
> status is reported as "PASSED"...cause only one bad sector has been
> found.
>
> But 90% of the space of the disk has never been scanned.
>
> Is this an implementation fault?
> And if YES...is it the implementation of the firmware?
> And: Is it my firmware or the one of the drive?
> ;)
>
> Best regards,
> mcc
>

Interesting.  I was able to get mine to do a full test and give me a
clean result.  If yours doesn't, well, I'd be diggin me out a box and
sending that puppy back to mommy.  It seems to need some help.  To me,
errors is one thing, errors that can't be corrected is a whole new
problem.  It should fix it and pass the test.

Even with my drive passing the test, I don't trust it yet.  If it was
still showing the error even after I did what I had done, I certainly
wouldn't trust it.  If yours can't finish the long self test, it may
need repairs that are above our pay grade. 

Maybe Neil or someone will have more ideas.  I hope.

Dale

:-)  :-) 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [gentoo-user] Contradictionary behaviour of SMART on hds ?!?
  2014-07-27 11:11     ` Dale
@ 2014-07-27 11:29       ` meino.cramer
  2014-07-27 12:33         ` Dale
  2014-07-28  4:44         ` Jc García
  0 siblings, 2 replies; 18+ messages in thread
From: meino.cramer @ 2014-07-27 11:29 UTC (permalink / raw
  To: gentoo-user

Dale <rdalek1967@gmail.com> [14-07-27 13:12]:
> meino.cramer@gmx.de wrote:
> > Neil Bothwick <neil@digimed.co.uk> [14-07-27 12:32]:
> >> On Sun, 27 Jul 2014 12:12:47 +0200, meino.cramer@gmx.de wrote:
> >>
> >>> On the one hand, the surface test (extended offline and such) aborts
> >>> as soon the first read fgailure happens.
> >>>
> >>> On the other hand it is said: If the count of bad sectors increases
> >>> over time it is time to change the hd.
> >>>
> >>> How can the second happen, if the first is true???
> >> My understanding is that the test only aborts if the error is severe
> >> enough to force it to do so. A simple bad block can be skipped and the
> >> rest of the drive tested.
> >>
> >> I've had a couple of drives get to the stage where SMART tests abort at
> >> an error and in both cases the manufacturer replaced them without
> >> question.
> >>
> >>
> >> -- 
> >> Neil Bothwick
> >>
> >> If at first you do succeed, try to hide your astonishment.
> > Hi Dale, hi Neil,
> >
> > thanks for the infos.
> >
> > But it is slightly off the point I tried to explain (I am no native
> > english speaker...sorry...:)
> >
> > Suppose - as in my case - I have not yert managed to urge the hd to 
> > map the bad sector off...
> >
> > Now...all tests abort after scanning 10% of the disk. Disk health
> > status is reported as "PASSED"...cause only one bad sector has been
> > found.
> >
> > But 90% of the space of the disk has never been scanned.
> >
> > Is this an implementation fault?
> > And if YES...is it the implementation of the firmware?
> > And: Is it my firmware or the one of the drive?
> > ;)
> >
> > Best regards,
> > mcc
> >
> 
> Interesting.  I was able to get mine to do a full test and give me a
> clean result.  If yours doesn't, well, I'd be diggin me out a box and
> sending that puppy back to mommy.  It seems to need some help.  To me,
> errors is one thing, errors that can't be corrected is a whole new
> problem.  It should fix it and pass the test.
> 
> Even with my drive passing the test, I don't trust it yet.  If it was
> still showing the error even after I did what I had done, I certainly
> wouldn't trust it.  If yours can't finish the long self test, it may
> need repairs that are above our pay grade. 
> 
> Maybe Neil or someone will have more ideas.  I hope.
> 
> Dale
> 
> :-)  :-) 

>

Back to the initial problem:

How can I offline test the rest of the disk if 
the first bad sector (10%) of the surface breaks
the test with an error?

Best regards,
mcc




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [gentoo-user] Contradictionary behaviour of SMART on hds ?!?
  2014-07-27 11:29       ` meino.cramer
@ 2014-07-27 12:33         ` Dale
  2014-07-27 13:19           ` meino.cramer
  2014-07-28  4:44         ` Jc García
  1 sibling, 1 reply; 18+ messages in thread
From: Dale @ 2014-07-27 12:33 UTC (permalink / raw
  To: gentoo-user

meino.cramer@gmx.de wrote:
> Back to the initial problem: How can I offline test the rest of the
> disk if the first bad sector (10%) of the surface breaks the test with
> an error? Best regards, mcc 

I never got mine to go past the first failure until I used dd to erase
the drive.  As mentioned before, I may could have done that without
moving my data but that was to complicated and risky for me at the
time.  From my understanding tho, until that data is moved off the bad
spot so that the drive knows it can do what it needs to, that spot is
still going to show up.  I don't know of a way to make it test beyond
the bad spot either.

If you have a drive that you can move that data over to so that you can
play with the bad drive, that's what I would do.  Once you get it moved,
then dd the whole drive, run the test and then see what results you
get.  I looked at a howto that someone posted or I found and doing it
with the data on there just made me nervous. 

I'm running out of info here.  Anyone else provide more help than me?

Dale

:-)  :-) 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [gentoo-user] Contradictionary behaviour of SMART on hds ?!?
  2014-07-27 12:33         ` Dale
@ 2014-07-27 13:19           ` meino.cramer
  2014-07-27 14:05             ` Dale
  0 siblings, 1 reply; 18+ messages in thread
From: meino.cramer @ 2014-07-27 13:19 UTC (permalink / raw
  To: gentoo-user

Dale <rdalek1967@gmail.com> [14-07-27 14:36]:
> meino.cramer@gmx.de wrote:
> > Back to the initial problem: How can I offline test the rest of the
> > disk if the first bad sector (10%) of the surface breaks the test with
> > an error? Best regards, mcc 
> 
> I never got mine to go past the first failure until I used dd to erase
> the drive.  As mentioned before, I may could have done that without
> moving my data but that was to complicated and risky for me at the
> time.  From my understanding tho, until that data is moved off the bad
> spot so that the drive knows it can do what it needs to, that spot is
> still going to show up.  I don't know of a way to make it test beyond
> the bad spot either.
> 
> If you have a drive that you can move that data over to so that you can
> play with the bad drive, that's what I would do.  Once you get it moved,
> then dd the whole drive, run the test and then see what results you
> get.  I looked at a howto that someone posted or I found and doing it
> with the data on there just made me nervous. 
> 
> I'm running out of info here.  Anyone else provide more help than me?
> 
> Dale
> 
> :-)  :-) 
> 

Hi Dale,

thanks for the info...

I already did this. PLEASE read my previous posting completly.

dd failed with an I/O error at that spot.




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [gentoo-user] Contradictionary behaviour of SMART on hds ?!?
  2014-07-27 13:19           ` meino.cramer
@ 2014-07-27 14:05             ` Dale
  2014-07-27 14:34               ` Mick
  0 siblings, 1 reply; 18+ messages in thread
From: Dale @ 2014-07-27 14:05 UTC (permalink / raw
  To: gentoo-user

meino.cramer@gmx.de wrote:
> Dale <rdalek1967@gmail.com> [14-07-27 14:36]:
>> meino.cramer@gmx.de wrote:
>>> Back to the initial problem: How can I offline test the rest of the
>>> disk if the first bad sector (10%) of the surface breaks the test with
>>> an error? Best regards, mcc 
>> I never got mine to go past the first failure until I used dd to erase
>> the drive.  As mentioned before, I may could have done that without
>> moving my data but that was to complicated and risky for me at the
>> time.  From my understanding tho, until that data is moved off the bad
>> spot so that the drive knows it can do what it needs to, that spot is
>> still going to show up.  I don't know of a way to make it test beyond
>> the bad spot either.
>>
>> If you have a drive that you can move that data over to so that you can
>> play with the bad drive, that's what I would do.  Once you get it moved,
>> then dd the whole drive, run the test and then see what results you
>> get.  I looked at a howto that someone posted or I found and doing it
>> with the data on there just made me nervous. 
>>
>> I'm running out of info here.  Anyone else provide more help than me?
>>
>> Dale
>>
>> :-)  :-) 
>>
> Hi Dale,
>
> thanks for the info...
>
> I already did this. PLEASE read my previous posting completly.
>
> dd failed with an I/O error at that spot.
>
>
>
>

Hmmmm.  I'd be getting my data off there or some sort of backup and then
try erasing the whole drive.  If that fails as well, then it seems like
you need a box and some shipping to get a replacement if it is under
warranty.  If the dd fails, that sounds like maybe it has a error it
can't correct for some reason.  I think dd does its thing on a basic
level and I have never had it give me a error except for running out of
space when it is done.  I'm sure if the command you used was wrong, Neil
would have picked up on it and said something.  So, I don't think you
are doing anything wrong, I just think your drive may have even more
serious issues than mine had. 

Unless someone else comes on with a idea on something else to try, I'd
be looking for somewhere to put my data and a different drive.  If after
that you can get it working, well, you got a spare.  If not, it was
broke anyway. 

I hope someone else has more ideas. 

Dale

:-)  :-)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [gentoo-user] Contradictionary behaviour of SMART on hds ?!?
  2014-07-27 14:05             ` Dale
@ 2014-07-27 14:34               ` Mick
  2014-07-27 16:50                 ` meino.cramer
  0 siblings, 1 reply; 18+ messages in thread
From: Mick @ 2014-07-27 14:34 UTC (permalink / raw
  To: gentoo-user

[-- Attachment #1: Type: Text/Plain, Size: 2742 bytes --]

On Sunday 27 Jul 2014 15:05:53 Dale wrote:
> meino.cramer@gmx.de wrote:
> > Dale <rdalek1967@gmail.com> [14-07-27 14:36]:
> >> meino.cramer@gmx.de wrote:
> >>> Back to the initial problem: How can I offline test the rest of the
> >>> disk if the first bad sector (10%) of the surface breaks the test with
> >>> an error? Best regards, mcc
> >> 
> >> I never got mine to go past the first failure until I used dd to erase
> >> the drive.  As mentioned before, I may could have done that without
> >> moving my data but that was to complicated and risky for me at the
> >> time.  From my understanding tho, until that data is moved off the bad
> >> spot so that the drive knows it can do what it needs to, that spot is
> >> still going to show up.  I don't know of a way to make it test beyond
> >> the bad spot either.
> >> 
> >> If you have a drive that you can move that data over to so that you can
> >> play with the bad drive, that's what I would do.  Once you get it moved,
> >> then dd the whole drive, run the test and then see what results you
> >> get.  I looked at a howto that someone posted or I found and doing it
> >> with the data on there just made me nervous.
> >> 
> >> I'm running out of info here.  Anyone else provide more help than me?
> >> 
> >> Dale
> >> 
> >> :-)  :-)
> > 
> > Hi Dale,
> > 
> > thanks for the info...
> > 
> > I already did this. PLEASE read my previous posting completly.
> > 
> > dd failed with an I/O error at that spot.
> 
> Hmmmm.  I'd be getting my data off there or some sort of backup and then
> try erasing the whole drive.  If that fails as well, then it seems like
> you need a box and some shipping to get a replacement if it is under
> warranty.  If the dd fails, that sounds like maybe it has a error it
> can't correct for some reason.  I think dd does its thing on a basic
> level and I have never had it give me a error except for running out of
> space when it is done.  I'm sure if the command you used was wrong, Neil
> would have picked up on it and said something.  So, I don't think you
> are doing anything wrong, I just think your drive may have even more
> serious issues than mine had.
> 
> Unless someone else comes on with a idea on something else to try, I'd
> be looking for somewhere to put my data and a different drive.  If after
> that you can get it working, well, you got a spare.  If not, it was
> broke anyway.
> 
> I hope someone else has more ideas.

Does it still error out if you run the commands in this sequence?

mkswap -L swap -f -c /dev/sda2
dd if=/dev/zero of=/dev/sda2 bs=512 conv=notrunc

Also, did you try the 'hdparm --write-sector' option that Volker mentioned?

-- 
Regards,
Mick

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [gentoo-user] Contradictionary behaviour of SMART on hds ?!?
  2014-07-27 14:34               ` Mick
@ 2014-07-27 16:50                 ` meino.cramer
  2014-07-28  8:22                   ` Helmut Jarausch
  2014-07-29 10:27                   ` Mick
  0 siblings, 2 replies; 18+ messages in thread
From: meino.cramer @ 2014-07-27 16:50 UTC (permalink / raw
  To: gentoo-user

Mick <michaelkintzios@gmail.com> [14-07-27 16:36]:
> On Sunday 27 Jul 2014 15:05:53 Dale wrote:
> > meino.cramer@gmx.de wrote:
> > > Dale <rdalek1967@gmail.com> [14-07-27 14:36]:
> > >> meino.cramer@gmx.de wrote:
> > >>> Back to the initial problem: How can I offline test the rest of the
> > >>> disk if the first bad sector (10%) of the surface breaks the test with
> > >>> an error? Best regards, mcc
> > >> 
> > >> I never got mine to go past the first failure until I used dd to erase
> > >> the drive.  As mentioned before, I may could have done that without
> > >> moving my data but that was to complicated and risky for me at the
> > >> time.  From my understanding tho, until that data is moved off the bad
> > >> spot so that the drive knows it can do what it needs to, that spot is
> > >> still going to show up.  I don't know of a way to make it test beyond
> > >> the bad spot either.
> > >> 
> > >> If you have a drive that you can move that data over to so that you can
> > >> play with the bad drive, that's what I would do.  Once you get it moved,
> > >> then dd the whole drive, run the test and then see what results you
> > >> get.  I looked at a howto that someone posted or I found and doing it
> > >> with the data on there just made me nervous.
> > >> 
> > >> I'm running out of info here.  Anyone else provide more help than me?
> > >> 
> > >> Dale
> > >> 
> > >> :-)  :-)
> > > 
> > > Hi Dale,
> > > 
> > > thanks for the info...
> > > 
> > > I already did this. PLEASE read my previous posting completly.
> > > 
> > > dd failed with an I/O error at that spot.
> > 
> > Hmmmm.  I'd be getting my data off there or some sort of backup and then
> > try erasing the whole drive.  If that fails as well, then it seems like
> > you need a box and some shipping to get a replacement if it is under
> > warranty.  If the dd fails, that sounds like maybe it has a error it
> > can't correct for some reason.  I think dd does its thing on a basic
> > level and I have never had it give me a error except for running out of
> > space when it is done.  I'm sure if the command you used was wrong, Neil
> > would have picked up on it and said something.  So, I don't think you
> > are doing anything wrong, I just think your drive may have even more
> > serious issues than mine had.
> > 
> > Unless someone else comes on with a idea on something else to try, I'd
> > be looking for somewhere to put my data and a different drive.  If after
> > that you can get it working, well, you got a spare.  If not, it was
> > broke anyway.
> > 
> > I hope someone else has more ideas.
> 
> Does it still error out if you run the commands in this sequence?
> 
> mkswap -L swap -f -c /dev/sda2
> dd if=/dev/zero of=/dev/sda2 bs=512 conv=notrunc
> 
> Also, did you try the 'hdparm --write-sector' option that Volker mentioned?
> 
> -- 
> Regards,
> Mick

Hi Mick,

thanks for your reply on the topic.

I executed the mkswap/dd combo a several times today. Since I have
no logs I repeated again. Here are the results:

solfire:/home/user>mkswap -L swap -f -c /dev/sda2
1 bad page
mkswap: /dev/sda2: warning: wiping old swap signature.
Setting up swapspace version 1, size = 6291448 KiB
LABEL=swap, UUID=e742c0a6-862c-41e9-be4b-698b33c5a236
solfire:/home/user>dd if=/dev/zero of=/dev/sda2 bs=512 conv=notrunc
dd: error writing ‘/dev/sda2’: Input/output error
1669369+0 records in
1669368+0 records out
854716416 bytes (855 MB) copied, 28.4799 s, 30.0 MB/s
[1]    24047 exit 1     dd if=/dev/zero of=/dev/sda2 bs=512 conv=notrunc
solfire:/home/user>


I am a little anxious about the hdparm command...
For me it is unclear what sector is meant:

smartclt says:
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Selective offline   Completed: read failure       90%     14500         4288352511

From a previous posting I learned that "LBA" in this case is the byte
counter.

The sector is therefore 4288352511/512=8375688

However as a result of the dd command above I found this in the dmesg log:

[48588.471905] end_request: I/O error, dev sda, sector 1773816

Now...what sector count fits what sector count ... ?

I will not fire zeroes towards my hd this way before I know exactly
to what I am shooting at... ;)

Any light in all this shadow is heartly appreciated...

Best regards,
mcc










^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [gentoo-user] Contradictionary behaviour of SMART on hds ?!?
  2014-07-27 16:50                 ` meino.cramer
@ 2014-07-28  8:22                   ` Helmut Jarausch
  2014-07-29 10:27                   ` Mick
  1 sibling, 0 replies; 18+ messages in thread
From: Helmut Jarausch @ 2014-07-28  8:22 UTC (permalink / raw
  To: gentoo-user

On 07/27/2014 06:50:49 PM, meino.cramer@gmx.de wrote:
> Hi Mick,
> 
> thanks for your reply on the topic.
> 
> I executed the mkswap/dd combo a several times today. Since I have
> no logs I repeated again. Here are the results:
> 
> solfire:/home/user>mkswap -L swap -f -c /dev/sda2
> 1 bad page
> mkswap: /dev/sda2: warning: wiping old swap signature.
> Setting up swapspace version 1, size = 6291448 KiB
> LABEL=swap, UUID=e742c0a6-862c-41e9-be4b-698b33c5a236
> solfire:/home/user>dd if=/dev/zero of=/dev/sda2 bs=512 conv=notrunc
> dd: error writing ‘/dev/sda2’: Input/output error
> 1669369+0 records in
> 1669368+0 records out
> 854716416 bytes (855 MB) copied, 28.4799 s, 30.0 MB/s
> [1]    24047 exit 1     dd if=/dev/zero of=/dev/sda2 bs=512  
> conv=notrunc
> solfire:/home/user>
> 
> 
> I am a little anxious about the hdparm command...
> For me it is unclear what sector is meant:
> 
> smartclt says:
> Num  Test_Description    Status                  Remaining   
> LifeTime(hours)  LBA_of_first_error
> # 1  Selective offline   Completed: read failure       90%      
> 14500         4288352511
> 
> From a previous posting I learned that "LBA" in this case is the byte
> counter.
> 
> The sector is therefore 4288352511/512=8375688
> 
> However as a result of the dd command above I found this in the dmesg  
> log:
> 
> [48588.471905] end_request: I/O error, dev sda, sector 1773816
> 
> Now...what sector count fits what sector count ... ?
> 
> I will not fire zeroes towards my hd this way before I know exactly
> to what I am shooting at... ;)
> 
> Any light in all this shadow is heartly appreciated...
> 
> Best regards,
> mcc
> 
Here a few observations: First, smartctl starts counting at the very  
first sector of the drive
while dd starts counting at the first sector of the partition. So, find  
out where the partition starts
by using fdisk and add the partition offset to the number given by dd.

Second, if your file system is ext{2,3,4} try using fsdebug as  
described in
file:///home/jarausch/GenToo/Hints/Smartmontools_badblockhowto.html

Third, as far as I understand, smartctl's '-t select' option lets you  
test
specific ranges of the disk. You could try to start the test after the  
defective sector.

Helmut



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [gentoo-user] Contradictionary behaviour of SMART on hds ?!?
  2014-07-27 16:50                 ` meino.cramer
  2014-07-28  8:22                   ` Helmut Jarausch
@ 2014-07-29 10:27                   ` Mick
  1 sibling, 0 replies; 18+ messages in thread
From: Mick @ 2014-07-29 10:27 UTC (permalink / raw
  To: gentoo-user; +Cc: meino.cramer

[-- Attachment #1: Type: Text/Plain, Size: 2290 bytes --]

On Sunday 27 Jul 2014 17:50:49 meino.cramer@gmx.de wrote:

> I executed the mkswap/dd combo a several times today. Since I have
> no logs I repeated again. Here are the results:
> 
> solfire:/home/user>mkswap -L swap -f -c /dev/sda2
> 1 bad page
> mkswap: /dev/sda2: warning: wiping old swap signature.
> Setting up swapspace version 1, size = 6291448 KiB
> LABEL=swap, UUID=e742c0a6-862c-41e9-be4b-698b33c5a236
> solfire:/home/user>dd if=/dev/zero of=/dev/sda2 bs=512 conv=notrunc
> dd: error writing ‘/dev/sda2’: Input/output error
> 1669369+0 records in
> 1669368+0 records out
> 854716416 bytes (855 MB) copied, 28.4799 s, 30.0 MB/s
> [1]    24047 exit 1     dd if=/dev/zero of=/dev/sda2 bs=512 conv=notrunc
> solfire:/home/user>

Ahh!  This is a different result to what you showed previously, unless you 
chopped off the output last time.  It now shows that it was able to write:

 1669368 x 512 = 854,716,416 bytes 

out of a total sda2 partition size of 6.29 GiB.

> I am a little anxious about the hdparm command...
> For me it is unclear what sector is meant:
> 
> smartclt says:
> Num  Test_Description    Status                  Remaining  LifeTime(hours)
>  LBA_of_first_error # 1  Selective offline   Completed: read failure      
> 90%     14500         4288352511
> 
> From a previous posting I learned that "LBA" in this case is the byte
> counter.
> 
> The sector is therefore 4288352511/512=8375688

OK, let's not confuse ourselves:  LBA counting starts from zero.

Therefore, if you subtract the start of the sda2 partition it becomes:

4288352511 - 104448 = 4288248063 / 512 = 8,375,484.5 bytes, within the sda2 
partition.

Unless my maths failed me above, I can't say I understand why dd bailed out 
after writing 854,716,416 bytes, but smartctl failed earlier than that reading 
just 8,375,484.5 bytes.  Perhaps dd can write further, than what smartctl can 
read without error, because these are two different mechanisms - but I am 
guessing wildly.  :-)

Someone more knowledgeable should chime in here.

PS. I'm copying you in just in case this is lost in your Inbox - I had to 
resend it because I used the wrong From address by mistake and I suspect it 
never made it to the list.
-- 
Regards,
Mick

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [gentoo-user] Contradictionary behaviour of SMART on hds ?!?
  2014-07-27 11:29       ` meino.cramer
  2014-07-27 12:33         ` Dale
@ 2014-07-28  4:44         ` Jc García
  1 sibling, 0 replies; 18+ messages in thread
From: Jc García @ 2014-07-28  4:44 UTC (permalink / raw
  To: gentoo-user

2014-07-27 5:29 GMT-06:00  <meino.cramer@gmx.de>:
>
> Back to the initial problem:
>
> How can I offline test the rest of the disk if
> the first bad sector (10%) of the surface breaks
> the test with an error?
>
> Best regards,
> mcc

I've only read this thread not the previous, and I can only give some
feedback on this part, if i understood dd reported failing at
4288352511 bytes in your drive, make a loopback device with an offset
past that sector to try dd keep going, or run dd with skip up to that
sector with.
# losetup -o  428835252 /dev/loop1 /dev/your_hdd
just to give an example, calculate an apropiate byte count for a 1
sector above, and then use dd on  /dev/loop1

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [gentoo-user] Contradictionary behaviour of SMART on hds ?!?
  2014-07-27 10:41   ` meino.cramer
  2014-07-27 11:11     ` Dale
@ 2014-07-27 18:57     ` Neil Bothwick
  2014-07-27 19:20       ` Dale
  1 sibling, 1 reply; 18+ messages in thread
From: Neil Bothwick @ 2014-07-27 18:57 UTC (permalink / raw
  To: gentoo-user

[-- Attachment #1: Type: text/plain, Size: 1192 bytes --]

On Sun, 27 Jul 2014 12:41:15 +0200, meino.cramer@gmx.de wrote:

> > My understanding is that the test only aborts if the error is severe
> > enough to force it to do so. A simple bad block can be skipped and the
> > rest of the drive tested.

> But it is slightly off the point I tried to explain (I am no native
> english speaker...sorry...:)
> 
> Suppose - as in my case - I have not yert managed to urge the hd to 
> map the bad sector off...
> 
> Now...all tests abort after scanning 10% of the disk. Disk health
> status is reported as "PASSED"...cause only one bad sector has been
> found.
> 
> But 90% of the space of the disk has never been scanned.

Read the smartctl message again, it's not reporting a bad sector, it's
reporting a read failure. Bad sectors are detected and mapped out in the
background, you have something more serious, something that prevents the
drive scanning past this point. If it's less then two years old, send it
back. Most drive manufacturers have a form on their web site where you
can input the serial number and see the warranty status. If you can
return it so so, ASAP.


-- 
Neil Bothwick

Press every key to continue.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [gentoo-user] Contradictionary behaviour of SMART on hds ?!?
  2014-07-27 18:57     ` Neil Bothwick
@ 2014-07-27 19:20       ` Dale
  2014-07-27 21:21         ` Neil Bothwick
  0 siblings, 1 reply; 18+ messages in thread
From: Dale @ 2014-07-27 19:20 UTC (permalink / raw
  To: gentoo-user

Neil Bothwick wrote:
> On Sun, 27 Jul 2014 12:41:15 +0200, meino.cramer@gmx.de wrote:
>
>>> My understanding is that the test only aborts if the error is severe
>>> enough to force it to do so. A simple bad block can be skipped and the
>>> rest of the drive tested.
>> But it is slightly off the point I tried to explain (I am no native
>> english speaker...sorry...:)
>>
>> Suppose - as in my case - I have not yert managed to urge the hd to 
>> map the bad sector off...
>>
>> Now...all tests abort after scanning 10% of the disk. Disk health
>> status is reported as "PASSED"...cause only one bad sector has been
>> found.
>>
>> But 90% of the space of the disk has never been scanned.
> Read the smartctl message again, it's not reporting a bad sector, it's
> reporting a read failure. Bad sectors are detected and mapped out in the
> background, you have something more serious, something that prevents the
> drive scanning past this point. If it's less then two years old, send it
> back. Most drive manufacturers have a form on their web site where you
> can input the serial number and see the warranty status. If you can
> return it so so, ASAP.
>
>

Glad you noticed something I didn't.  I just wish it was better news for
the OP.

Question.  Does that mean that the heads can't move past that point?  If
yes, does that mean the OP can't get any data that is further out than
that point?  I'm asking hoping I will learn something.  I have taken
drives apart so I know how the arm moves the heads across the platter. 
If I get what you are saying, it's like the heads get to a certain
point, about 10%, and then stop.

Thanks.

Dale

:-)  :-)


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [gentoo-user] Contradictionary behaviour of SMART on hds ?!?
  2014-07-27 19:20       ` Dale
@ 2014-07-27 21:21         ` Neil Bothwick
  2014-07-28  3:11           ` Dale
  0 siblings, 1 reply; 18+ messages in thread
From: Neil Bothwick @ 2014-07-27 21:21 UTC (permalink / raw
  To: gentoo-user

[-- Attachment #1: Type: text/plain, Size: 787 bytes --]

On Sun, 27 Jul 2014 14:20:23 -0500, Dale wrote:

> Question.  Does that mean that the heads can't move past that point?  If
> yes, does that mean the OP can't get any data that is further out than
> that point?  I'm asking hoping I will learn something.  I have taken
> drives apart so I know how the arm moves the heads across the platter. 
> If I get what you are saying, it's like the heads get to a certain
> point, about 10%, and then stop.

I don't think so, as I've seen this sort of thing on a drive but still
been able to access ~all my data. It seems that the SMART tests are a
little stupid in this respect and give up when they decide a drive is
broken, as opposed to failing.


-- 
Neil Bothwick

Irritable? Who the bloody hell are you calling irritable?

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [gentoo-user] Contradictionary behaviour of SMART on hds ?!?
  2014-07-27 21:21         ` Neil Bothwick
@ 2014-07-28  3:11           ` Dale
  0 siblings, 0 replies; 18+ messages in thread
From: Dale @ 2014-07-28  3:11 UTC (permalink / raw
  To: gentoo-user

Neil Bothwick wrote:
> On Sun, 27 Jul 2014 14:20:23 -0500, Dale wrote:
>
>> Question.  Does that mean that the heads can't move past that point?  If
>> yes, does that mean the OP can't get any data that is further out than
>> that point?  I'm asking hoping I will learn something.  I have taken
>> drives apart so I know how the arm moves the heads across the platter. 
>> If I get what you are saying, it's like the heads get to a certain
>> point, about 10%, and then stop.
> I don't think so, as I've seen this sort of thing on a drive but still
> been able to access ~all my data. It seems that the SMART tests are a
> little stupid in this respect and give up when they decide a drive is
> broken, as opposed to failing.
>
>

So, it isn't likely a mechanical failure but *maybe* some sort of
firmware/software/or other type of failure?  Interesting.  The reason I
was asking is because it seems the OP is using the drive, even booting
from it I think, which makes me think it is still able to access the
data but yet the SMART test can't get to the same area.  It was a bit
confusing since it wasn't "logical".  One type of access is working
while another isn't.  Odd.  I realize that short of some techy person
taking the drive apart, we won't likely really know why it failed the
test but just curious as to what options were there as to the failure.

Well, run into something interesting everyday.  I hope the OP can get
his data off there before this gets worse.  I guess if nothing else,
SMART showed that something isn't right, is likely failing and needs
attention.  If SMART is correct.  ;-)

Dale

:-)  :-)

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2014-07-29 10:28 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-07-27 10:12 [gentoo-user] Contradictionary behaviour of SMART on hds ?!? meino.cramer
2014-07-27 10:23 ` Dale
2014-07-27 10:27 ` Neil Bothwick
2014-07-27 10:41   ` meino.cramer
2014-07-27 11:11     ` Dale
2014-07-27 11:29       ` meino.cramer
2014-07-27 12:33         ` Dale
2014-07-27 13:19           ` meino.cramer
2014-07-27 14:05             ` Dale
2014-07-27 14:34               ` Mick
2014-07-27 16:50                 ` meino.cramer
2014-07-28  8:22                   ` Helmut Jarausch
2014-07-29 10:27                   ` Mick
2014-07-28  4:44         ` Jc García
2014-07-27 18:57     ` Neil Bothwick
2014-07-27 19:20       ` Dale
2014-07-27 21:21         ` Neil Bothwick
2014-07-28  3:11           ` Dale

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox