Re: [gentoo-amd64] Re: RAID1 boot - no bootable media found

public inbox for gentoo-amd64@lists.gentoo.org
 help / color / mirror / Atom feed

From: Mark Knecht <markknecht@gmail.com>
To: gentoo-amd64@lists.gentoo.org
Subject: Re: [gentoo-amd64] Re: RAID1 boot - no bootable media found
Date: Thu, 1 Apr 2010 11:57:47 -0700	[thread overview]
Message-ID: <o2q5bdc1c8b1004011157if9fb419ey3a777f4fd3743c46@mail.gmail.com> (raw)
In-Reply-To: <pan.2010.03.31.06.56.05@cox.net>

A bit long in response. Sorry.

On Tue, Mar 30, 2010 at 11:56 PM, Duncan <1i5t5.duncan@cox.net> wrote:
> Mark Knecht posted on Tue, 30 Mar 2010 13:26:59 -0700 as excerpted:
>
>> I've set up a duplicate boot partition on sdb and it boots. However one
>> thing I'm unsure if when I change the hard drive boot does the old sdb
>> become the new sda because it's what got booted? Or is the order still
>> as it was? The answer determines what I do in grub.conf as to which
>> drive I'm trying to use. I can figure this out later by putting
>> something different on each drive and looking. Might be system/BIOS
>> dependent.
>
> That depends on your BIOS.  My current system (the workstation, now 6+
> years old but still going strong as it was a $400+ server grade mobo) will
> boot from whatever disk I tell it to, but keeps the same BIOS disk order
> regardless -- unless I physically turn one or more of them off, of
> course.  My previous system would always switch the chosen boot drive to
> be the first one.  (I suppose it could be IDE vs. SATA as well, as the
> switcher was IDE, the stable one is SATA-1.)
>
> So that's something I guess you figure out for yourself.  But it sounds
> like you're already well on your way...
>

It seems to be constant mapping meaning (I guess) that I need to
change the drive specs in grub.conf on the second drive to actually
use the second drive.

I made the titles for booting different for each grub.conf file to
ensure I was really getting grub from the second drive. My sda grub
boot menu says "2.6.33-gentoo booting from sda" on the first drive,
sdb on the second drive, etc.

<SNIP>
>
> The point being... it /is/ actually possible to verify that they're
> working well before you fdisk/mkfs and load data.  Tho it does take
> awhile... days... on drives of modern size.
>

I'm trying badblocks right now on sdc. using

badblocks -v /dev/sdc

Maybe I need to do something more strenuous? It looks like it will be
done an an hour or two. (i7-920 with SATA drives so it should be fast,
as long as I'm not just reading the buffers or something like that.

Roughly speaking 1TB read at 100MB/S should take 10,000 seconds or 2.7
hours. I'm at 18% after 28 minutes so that seems about right. (With no
errors so far assuming I'm using the right command)

>>> 3) suspend the disks after a period of inactivity
>>
>> This could be part of what's going on, but I don't think it's the whole
>> story. My drives (WD Green 1TB drives) apparently park the heads after 8
>> seconds (yes 8 seconds!) of inactivity to save power. Each time it parks
>> it increments the Load_Cycle_Count SMART parameter. I've been tracking
>> this on the three drives in the system. The one I'm currently using is
>> incrementing while the 2 that sit unused until I get RAID going again
>> are not. Possibly there is something about how these drives come out of
>> park that creates large delays once in awhile.
>
> You may wish to take a second look at that, for an entirely /different/
> reason.  If those are the ones I just googled on the WD site, they're
> rated 300K load/unload cycles.  Take a look at your BIOS spin-down
> settings, and use hdparm to get a look at the disk's powersaving and
> spindown settings.  You may wish to set the disks to something more
> reasonable, as with 8 second timeouts, that 300k cycles isn't going to
> last so long...

Very true. Here is the same drive model I put in a new machine for my
dad. It's been powered up and running Gentoo as a typical desktop
machine for about 50 days. He doesn't use it more than about an hour a
day on average. It's already hit 31K load/unload cycles. At 10% of
300K that about 1.5 years of life before I hit that spec. I've watched
his system a bit and his system seems to add 1 to the count almost
exactly every 2 minutes on average. Is that a common cron job maybe?

I looked up the spec on all three WD lines - Green, Blue and Black.
All three were 300K cycles. This issue has come up on the RAID list.
It seems that some other people are seeing this and aren't exactly
sure what Linux is doing to cause this.

I'll study hdparm and BIOS when I can reboot.

My dad's current data:

gandalf ~ # smartctl -A /dev/sda
smartctl 5.39.1 2010-01-28 r3054 [x86_64-pc-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail
Always       -       0
  3 Spin_Up_Time            0x0027   129   128   021    Pre-fail
Always       -       6525
  4 Start_Stop_Count        0x0032   100   100   000    Old_age
Always       -       21
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail
Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age
Always       -       0
  9 Power_On_Hours          0x0032   099   099   000    Old_age
Always       -       1183
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age
Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age
Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age
Always       -       20
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age
Always       -       5
193 Load_Cycle_Count        0x0032   190   190   000    Old_age
Always       -       31240
194 Temperature_Celsius     0x0022   121   116   000    Old_age
Always       -       26
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age
Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age
Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age
Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age
Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age
Offline      -       0

gandalf ~ #


>
> You may recall a couple years ago when Ubuntu accidentally shipped with
> laptop mode (or something, IDR the details) turned on by default, and
> people were watching their drives wear out before their eyes.  That's
> effectively what you're doing, with an 8-second idle timeout.  Most laptop
> drives (2.5" and 1.8") are designed for it.  Most 3.5" desktop/server
> drives are NOT designed for that tight an idle timeout spec, and in fact,
> may well last longer spinning at idle overnight, as opposed to shutting
> down every day even.
>
> I'd at least look into it, as there's no use wearing the things out
> unnecessarily.  Maybe you'll decide to let them run that way and save the
> power, but you'll know about the available choices then, at least.
>

Yeah, that's important. Thanks. If I can solve all these RAID problems
then maybe I'll look at adding RAID to his box with better drives or
something.

Note that on my system only I'm seeing real problems in
/var/log/message, non-RAID, like 1000's of these:

Mar 29 14:06:33 keeper kernel: rsync(3368): READ block 45276264 on sda3
Mar 29 14:06:33 keeper kernel: rsync(3368): READ block 46309336 on sda3
Mar 29 14:06:33 keeper kernel: rsync(3368): READ block 46567488 on sda3
Mar 29 14:06:33 keeper kernel: rsync(3368): READ block 46567680 on sda3

or

Mar 29 14:07:36 keeper kernel: flush-8:0(3365): WRITE block 33555752 on sda3
Mar 29 14:07:36 keeper kernel: flush-8:0(3365): WRITE block 33555760 on sda3
Mar 29 14:07:36 keeper kernel: flush-8:0(3365): WRITE block 33555768 on sda3
Mar 29 14:07:36 keeper kernel: flush-8:0(3365): WRITE block 33555776 on sda3


However I see NONE of that on my dad's machine using the same drive
but different chipset.

The above problems seem to result in this sort of problem when I try
going with RAID as I tried again this monring:

INFO: task kjournald:5064 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kjournald     D ffff880028351580     0  5064      2 0x00000000
 ffff8801ac91a190 0000000000000046 0000000000000000 ffffffff81067110
 000000000000dcf8 ffff880180863fd8 0000000000011580 0000000000011580
 ffff88014165ba20 ffff8801ac89a834 ffff8801af920150 ffff8801ac91a418
Call Trace:
 [<ffffffff81067110>] ? __alloc_pages_nodemask+0xfa/0x58c
 [<ffffffff8129174a>] ? md_make_request+0xde/0x119
 [<ffffffff810a9576>] ? sync_buffer+0x0/0x40
 [<ffffffff81334305>] ? io_schedule+0x3e/0x54
 [<ffffffff810a95b1>] ? sync_buffer+0x3b/0x40
 [<ffffffff81334789>] ? __wait_on_bit+0x41/0x70
 [<ffffffff810a9576>] ? sync_buffer+0x0/0x40
 [<ffffffff81334823>] ? out_of_line_wait_on_bit+0x6b/0x77
 [<ffffffff81040a66>] ? wake_bit_function+0x0/0x23
 [<ffffffff8111f400>] ? journal_commit_transaction+0xb56/0x1112
 [<ffffffff81334280>] ? schedule+0x8f4/0x93b
 [<ffffffff81335e3d>] ? _raw_spin_lock_irqsave+0x18/0x34
 [<ffffffff81040a38>] ? autoremove_wake_function+0x0/0x2e
 [<ffffffff81335bcc>] ? _raw_spin_unlock_irqrestore+0x12/0x2c
 [<ffffffff8112278c>] ? kjournald+0xe2/0x20a
 [<ffffffff81040a38>] ? autoremove_wake_function+0x0/0x2e
 [<ffffffff811226aa>] ? kjournald+0x0/0x20a
 [<ffffffff81040665>] ? kthread+0x79/0x81
 [<ffffffff81002c94>] ? kernel_thread_helper+0x4/0x10
 [<ffffffff810405ec>] ? kthread+0x0/0x81
 [<ffffffff81002c90>] ? kernel_thread_helper+0x0/0x10
Thanks,
Mark

next prev parent reply	other threads:[~2010-04-01 19:02 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-03-28 17:14 [gentoo-amd64] RAID1 boot - no bootable media found Mark Knecht
2010-03-30  6:39 ` [gentoo-amd64] " Duncan
2010-03-30 13:56   ` Mark Knecht
2010-03-30 18:08     ` Duncan
2010-03-30 20:26       ` Mark Knecht
2010-03-31  6:56         ` Duncan
2010-04-01 18:57           ` Mark Knecht [this message]
2010-04-02  9:43             ` Duncan
2010-04-02 17:18               ` Mark Knecht
2010-04-03 23:13                 ` Mark Knecht
2010-04-05 18:17                   ` Mark Knecht
2010-04-06 14:00                     ` Duncan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=o2q5bdc1c8b1004011157if9fb419ey3a777f4fd3743c46@mail.gmail.com \
    --to=markknecht@gmail.com \
    --cc=gentoo-amd64@lists.gentoo.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox