From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gentoo-user+bounces-144323-garchives=archives.gentoo.org@lists.gentoo.org>
Received: from lists.gentoo.org (pigeon.gentoo.org [208.92.234.80])
	by finch.gentoo.org (Postfix) with ESMTP id EB448138383
	for <garchives@archives.gentoo.org>; Wed,  9 Jan 2013 02:57:57 +0000 (UTC)
Received: from pigeon.gentoo.org (localhost [127.0.0.1])
	by pigeon.gentoo.org (Postfix) with SMTP id 4653721C18A;
	Wed,  9 Jan 2013 02:57:38 +0000 (UTC)
Received: from svr-us4.tirtonadi.com (svr-us4.tirtonadi.com [69.65.43.212])
	(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by pigeon.gentoo.org (Postfix) with ESMTPS id 7DAB321C15F
	for <gentoo-user@lists.gentoo.org>; Wed,  9 Jan 2013 02:55:58 +0000 (UTC)
Received: from mail-pa0-f46.google.com ([209.85.220.46]:50209)
	by svr-us4.tirtonadi.com with esmtpsa (TLSv1:RC4-SHA:128)
	(Exim 4.80)
	(envelope-from <pandu@poluan.info>)
	id 1Tslpd-002ZQS-Dj
	for gentoo-user@lists.gentoo.org; Wed, 09 Jan 2013 09:55:58 +0700
Received: by mail-pa0-f46.google.com with SMTP id bh2so722275pad.5
        for <gentoo-user@lists.gentoo.org>; Tue, 08 Jan 2013 18:55:55 -0800 (PST)
Precedence: bulk
List-Post: <mailto:gentoo-user@lists.gentoo.org>
List-Help: <mailto:gentoo-user+help@lists.gentoo.org>
List-Unsubscribe: <mailto:gentoo-user+unsubscribe@lists.gentoo.org>
List-Subscribe: <mailto:gentoo-user+subscribe@lists.gentoo.org>
List-Id: Gentoo Linux mail <gentoo-user.gentoo.org>
X-BeenThere: gentoo-user@lists.gentoo.org
Reply-to: gentoo-user@lists.gentoo.org
MIME-Version: 1.0
Received: by 10.68.134.232 with SMTP id pn8mr202488405pbb.47.1357700155900;
 Tue, 08 Jan 2013 18:55:55 -0800 (PST)
Received: by 10.68.248.66 with HTTP; Tue, 8 Jan 2013 18:55:55 -0800 (PST)
Received: by 10.68.248.66 with HTTP; Tue, 8 Jan 2013 18:55:55 -0800 (PST)
In-Reply-To: <50EC6D59.1090809@binarywings.net>
References: <50EB2BF7.4040109@binarywings.net>
	<20130108012016.2f02c68c@khamul.example.com>
	<50EBCA77.8030603@binarywings.net>
	<20130108095510.04f84040@khamul.example.com>
	<50EC4660.5090208@binarywings.net>
	<CAA2qdGUn8pf4WKsKugFeY20aXrciyQiwpigGVs+5xkjW4hbBsQ@mail.gmail.com>
	<50EC6D59.1090809@binarywings.net>
Date: Wed, 9 Jan 2013 09:55:55 +0700
Message-ID: <CAA2qdGUrCAB3cXWDhfbJC5t0OTjuaN6D-t6X0eC4rQj-1PrQWA@mail.gmail.com>
Subject: Re: [gentoo-user] OT: Fighting bit rot
From: Pandu Poluan <pandu@poluan.info>
To: gentoo-user@lists.gentoo.org
Content-Type: multipart/alternative; boundary=047d7b10c86181025f04d2d233e8
X-AntiAbuse: This header was added to track abuse, please include it with any abuse report
X-AntiAbuse: Primary Hostname - svr-us4.tirtonadi.com
X-AntiAbuse: Original Domain - lists.gentoo.org
X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12]
X-AntiAbuse: Sender Address Domain - poluan.info
X-Get-Message-Sender-Via: svr-us4.tirtonadi.com: authenticated_id: rileyer+pandu.poluan.info/only user confirmed/virtual account not confirmed
X-Archives-Salt: 03ecab56-13bf-4eb6-b400-12044b73add4
X-Archives-Hash: 57114151e5de0373314d62355296d2ea

--047d7b10c86181025f04d2d233e8
Content-Type: text/plain; charset=UTF-8

On Jan 9, 2013 2:06 AM, "Florian Philipp" <lists@binarywings.net> wrote:
>
> Am 08.01.2013 18:41, schrieb Pandu Poluan:
> >
> > On Jan 8, 2013 11:20 PM, "Florian Philipp" <lists@binarywings.net
> > <mailto:lists@binarywings.net>> wrote:
> >>
> >
> > -- snip --
> >
> [...]
> >>
> >> When you have completely static content, md5sum, rsync and friends are
> >> sufficient. But if you have content that changes from time to time, the
> >> number of false-positives would be too high. In this case, I think you
> >> could easily distinguish by comparing both file content and time
stamps.
> >>
> [...]
> >
> > IMO, we're all barking up the wrong tree here...
> >
> > Before a file's content can change without user involvement, bit rot
> > must first get through the checksum (CRC?) of the hard disk itself.
> > There will be no 'gradual degradation of data', just 'catastrophic data
> > loss'.
> >
>
> Unfortunately, that's only partly true. Latent disk errors are a well
> researched topic [1-3]. CRCs are not perfectly reliable. The trick is to
> detect and correct errors while you still have valid backups or other
> types of redundancy.
>
> The only way to do this is regular scrubbing. That's why professional
> archival solutions offer some kind of self-healing which is usually just
> the same as what I proposed (plus whatever on-access integrity checks
> the platform supports) [4].
>
> > I would rather focus my efforts on ensuring that my backups are always
> > restorable, at least until the most recent time of archival.
> >
>
> That's the point:
> a) You have to detect when you have to restore from backup.
> b) You have to verify that the backup itself is still valid.
> c) You have to avoid situations where undetected errors creep into the
> backup.
>
> I'm not talking about a purely theoretical possibility. I have
> experienced just that: Some data that I have kept lying around for years
> was corrupted.
>
> [1] Schwarz et.al: Disk Scrubbing in Large, Archival Storage Systems
> http://www.cse.scu.edu/~tschwarz/Papers/mascots04.pdf
>
> [2] Baker et.al: A fresh look at the reliability of long-term digital
> storage
> http://arxiv.org/pdf/cs/0508130
>
> [3] Bairavasundaram et.al: An Analysis of Latent Sector Errors in Disk
> Drives
> http://bnrg.eecs.berkeley.edu/~randy/Courses/CS294.F07/11.1.pdf
>
> [4]
>
http://uk.emc.com/collateral/analyst-reports/kci-evaluation-of-emc-centera.pdf
>
> Regards,
> Florian Philipp
>

Interesting reads... thanks for the link!

Hmm... if I'm in your position, I think this is what I'll do:

1. Make a set of MD5 'checksums', one per file for ease of update.
2. Compare the checksums with the actual files before opening a file. If
mismatch, notify.
3. When file handle is closed, recalculate.

Protect the set of MD5 periodically using par2.

Also protect your backups using par2, for that matter (that's what I always
do when I archive something to optical media).

Of course, you can outright use par2 to protect and ECC your data, but the
time needed to generate the .par files *every time* would be too much,
methinks...

Rgds,
--

--047d7b10c86181025f04d2d233e8
Content-Type: text/html; charset=UTF-8

<p><br>
On Jan 9, 2013 2:06 AM, &quot;Florian Philipp&quot; &lt;<a href="mailto:lists@binarywings.net">lists@binarywings.net</a>&gt; wrote:<br>
&gt;<br>
&gt; Am 08.01.2013 18:41, schrieb Pandu Poluan:<br>
&gt; &gt;<br>
&gt; &gt; On Jan 8, 2013 11:20 PM, &quot;Florian Philipp&quot; &lt;<a href="mailto:lists@binarywings.net">lists@binarywings.net</a><br>
&gt; &gt; &lt;mailto:<a href="mailto:lists@binarywings.net">lists@binarywings.net</a>&gt;&gt; wrote:<br>
&gt; &gt;&gt;<br>
&gt; &gt;<br>
&gt; &gt; -- snip --<br>
&gt; &gt;<br>
&gt; [...]<br>
&gt; &gt;&gt;<br>
&gt; &gt;&gt; When you have completely static content, md5sum, rsync and friends are<br>
&gt; &gt;&gt; sufficient. But if you have content that changes from time to time, the<br>
&gt; &gt;&gt; number of false-positives would be too high. In this case, I think you<br>
&gt; &gt;&gt; could easily distinguish by comparing both file content and time stamps.<br>
&gt; &gt;&gt;<br>
&gt; [...]<br>
&gt; &gt;<br>
&gt; &gt; IMO, we&#39;re all barking up the wrong tree here...<br>
&gt; &gt;<br>
&gt; &gt; Before a file&#39;s content can change without user involvement, bit rot<br>
&gt; &gt; must first get through the checksum (CRC?) of the hard disk itself.<br>
&gt; &gt; There will be no &#39;gradual degradation of data&#39;, just &#39;catastrophic data<br>
&gt; &gt; loss&#39;.<br>
&gt; &gt;<br>
&gt;<br>
&gt; Unfortunately, that&#39;s only partly true. Latent disk errors are a well<br>
&gt; researched topic [1-3]. CRCs are not perfectly reliable. The trick is to<br>
&gt; detect and correct errors while you still have valid backups or other<br>
&gt; types of redundancy.<br>
&gt;<br>
&gt; The only way to do this is regular scrubbing. That&#39;s why professional<br>
&gt; archival solutions offer some kind of self-healing which is usually just<br>
&gt; the same as what I proposed (plus whatever on-access integrity checks<br>
&gt; the platform supports) [4].<br>
&gt;<br>
&gt; &gt; I would rather focus my efforts on ensuring that my backups are always<br>
&gt; &gt; restorable, at least until the most recent time of archival.<br>
&gt; &gt;<br>
&gt;<br>
&gt; That&#39;s the point:<br>
&gt; a) You have to detect when you have to restore from backup.<br>
&gt; b) You have to verify that the backup itself is still valid.<br>
&gt; c) You have to avoid situations where undetected errors creep into the<br>
&gt; backup.<br>
&gt;<br>
&gt; I&#39;m not talking about a purely theoretical possibility. I have<br>
&gt; experienced just that: Some data that I have kept lying around for years<br>
&gt; was corrupted.<br>
&gt;<br>
&gt; [1] Schwarz <a href="http://et.al">et.al</a>: Disk Scrubbing in Large, Archival Storage Systems<br>
&gt; <a href="http://www.cse.scu.edu/~tschwarz/Papers/mascots04.pdf">http://www.cse.scu.edu/~tschwarz/Papers/mascots04.pdf</a><br>
&gt;<br>
&gt; [2] Baker <a href="http://et.al">et.al</a>: A fresh look at the reliability of long-term digital<br>
&gt; storage<br>
&gt; <a href="http://arxiv.org/pdf/cs/0508130">http://arxiv.org/pdf/cs/0508130</a><br>
&gt;<br>
&gt; [3] Bairavasundaram <a href="http://et.al">et.al</a>: An Analysis of Latent Sector Errors in Disk<br>
&gt; Drives<br>
&gt; <a href="http://bnrg.eecs.berkeley.edu/~randy/Courses/CS294.F07/11.1.pdf">http://bnrg.eecs.berkeley.edu/~randy/Courses/CS294.F07/11.1.pdf</a><br>
&gt;<br>
&gt; [4]<br>
&gt; <a href="http://uk.emc.com/collateral/analyst-reports/kci-evaluation-of-emc-centera.pdf">http://uk.emc.com/collateral/analyst-reports/kci-evaluation-of-emc-centera.pdf</a><br>
&gt;<br>
&gt; Regards,<br>
&gt; Florian Philipp<br>
&gt;</p>
<p>Interesting reads... thanks for the link!</p>
<p>Hmm... if I&#39;m in your position, I think this is what I&#39;ll do:</p>
<p>1. Make a set of MD5 &#39;checksums&#39;, one per file for ease of update.<br>
2. Compare the checksums with the actual files before opening a file. If mismatch, notify.<br>
3. When file handle is closed, recalculate.</p>
<p>Protect the set of MD5 periodically using par2.</p>
<p>Also protect your backups using par2, for that matter (that&#39;s what I always do when I archive something to optical media).</p>
<p>Of course, you can outright use par2 to protect and ECC your data, but the time needed to generate the .par files *every time* would be too much, methinks...<br></p>
<p>Rgds,<br>
--<br>
</p>

--047d7b10c86181025f04d2d233e8--