public inbox for gentoo-user@lists.gentoo.org
 help / color / mirror / Atom feed
* [gentoo-user] md5sum for directories?
@ 2008-02-24 11:06 Stroller
  2008-02-24 11:46 ` Etaoin Shrdlu
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Stroller @ 2008-02-24 11:06 UTC (permalink / raw
  To: gentoo-user

Hi there,

I'm in the habit of backing up customer data by booting from knoppix,  
connecting a portable hard-drive and copying with `cp -rvf`.

When this has finished I connect the portable hard-drive to my  
desktop machine, copy the directory of data from it to my homedir,  
and make a zip file of the directory.

I've done this loads in the past, and never been aware of any file  
corruption, but I guess I'm just paranoid today. Perhaps I shouldn't  
use the -v flags during my copy - it's reassuring to see the files  
being copied, but what if I overlooked a bunch of errors in the  
middle of all those thousands of "copied successfully" confirmations?  
What if something has gone wrong during one of the two copies?

So my question is:

Is there any way to check the integrity of copied directories, to be  
sure that none of the files or sub-directories in them have become  
damaged during transfer? I'm thinking of something like md5sum for  
directories.


It occurred to me that one could run `find . -type f -exec md5sum \{}  
\; > file.txt` on both machines and diff the outputs, but some of  
these directories contain many thousands of files, and I'd imagine  
that mdsumming of all these could take some time.

Does anyone have any suggestions, please?

Stroller.
  
   
-- 
gentoo-user@lists.gentoo.org mailing list



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [gentoo-user] md5sum for directories?
  2008-02-24 11:06 [gentoo-user] md5sum for directories? Stroller
@ 2008-02-24 11:46 ` Etaoin Shrdlu
  2008-02-27  0:38   ` Stroller
  2008-02-24 14:29 ` Neil Bothwick
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 13+ messages in thread
From: Etaoin Shrdlu @ 2008-02-24 11:46 UTC (permalink / raw
  To: gentoo-user

On Sunday 24 February 2008, Stroller wrote:

> I've done this loads in the past, and never been aware of any file
> corruption, but I guess I'm just paranoid today. Perhaps I shouldn't
> use the -v flags during my copy - it's reassuring to see the files
> being copied, but what if I overlooked a bunch of errors in the
> middle of all those thousands of "copied successfully" confirmations?
> What if something has gone wrong during one of the two copies?

Well, in that case cp will have a nnonzero exit status. Look:

$ ls -l
total 12
-rw-r--r-- 1 kermit users    4 2008-02-24 12:30 a
-rw-r--r-- 1 kermit users   12 2008-02-24 12:30 b
drwxr-xr-x 2 kermit users 4096 2008-02-24 12:30 destdir
$ ls -l destdir
total 0
$ chmod 000 b
$ ls -l
total 12
-rw-r--r-- 1 kermit users    4 2008-02-24 12:30 a
---------- 1 kermit users   12 2008-02-24 12:30 b
drwxr-xr-x 2 kermit users 4096 2008-02-24 12:30 destdir
$ cp a b destdir
cp: cannot open `b' for reading: Permission denied
$ echo $?
1
$ ls -l destdir
total 4
-rw-r--r-- 1 kermit users 4 2008-02-24 12:31 a

I think this should hold for the majority of cases/errors cp might 
encounter during the copy.
Of course, this does not detect a succesful, but somehow corrupted, copy 
(which should be exceptionally rare, anyway).

> So my question is:
>
> Is there any way to check the integrity of copied directories, to be
> sure that none of the files or sub-directories in them have become
> damaged during transfer? I'm thinking of something like md5sum for
> directories.

I'm not aware of any such tool (which might exist nonetheless, of 
course). However, on the filesystem, the objects that we 
call "directories" are just index files holding filenames and pointers 
to inodes. Running a checksum on the directories themselves would not 
guarantee against corruption of any of the contained files, since file 
data is not contained in the directory. Thus, to be accurate, such a 
tool would have to scan the directory, find each file, and perform a 
checksum on it, which would result in something not much different from 
the find command you suggested, in terms of resource usage.
-- 
gentoo-user@lists.gentoo.org mailing list



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [gentoo-user] md5sum for directories?
  2008-02-24 11:06 [gentoo-user] md5sum for directories? Stroller
  2008-02-24 11:46 ` Etaoin Shrdlu
@ 2008-02-24 14:29 ` Neil Bothwick
  2008-02-24 16:39   ` cabbage
  2008-02-24 19:46 ` Christopher Copeland
  2008-02-24 21:15 ` [gentoo-user] " »Q«
  3 siblings, 1 reply; 13+ messages in thread
From: Neil Bothwick @ 2008-02-24 14:29 UTC (permalink / raw
  To: gentoo-user

[-- Attachment #1: Type: text/plain, Size: 474 bytes --]

On Sun, 24 Feb 2008 11:06:10 +0000, Stroller wrote:

> Is there any way to check the integrity of copied directories, to be  
> sure that none of the files or sub-directories in them have become  
> damaged during transfer? I'm thinking of something like md5sum for  
> directories.

Diff?

diff -r /source /dest
will return no output if the two copies are identical.


-- 
Neil Bothwick

Never underestimate the bandwidth of a station wagon full of tapes!

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [gentoo-user] md5sum for directories?
  2008-02-24 14:29 ` Neil Bothwick
@ 2008-02-24 16:39   ` cabbage
  2008-02-24 16:46     ` Dirk Heinrichs
  2008-02-24 16:49     ` Andrew Gaydenko
  0 siblings, 2 replies; 13+ messages in thread
From: cabbage @ 2008-02-24 16:39 UTC (permalink / raw
  To: gentoo-user

[-- Attachment #1: Type: text/plain, Size: 590 bytes --]

diff can use for binary files ?

On Sun, Feb 24, 2008 at 10:29 PM, Neil Bothwick <neil@digimed.co.uk> wrote:

> On Sun, 24 Feb 2008 11:06:10 +0000, Stroller wrote:
>
> > Is there any way to check the integrity of copied directories, to be
> > sure that none of the files or sub-directories in them have become
> > damaged during transfer? I'm thinking of something like md5sum for
> > directories.
>
> Diff?
>
> diff -r /source /dest
> will return no output if the two copies are identical.
>
>
> --
> Neil Bothwick
>
> Never underestimate the bandwidth of a station wagon full of tapes!
>

[-- Attachment #2: Type: text/html, Size: 932 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [gentoo-user] md5sum for directories?
  2008-02-24 16:39   ` cabbage
@ 2008-02-24 16:46     ` Dirk Heinrichs
  2008-02-24 16:49     ` Andrew Gaydenko
  1 sibling, 0 replies; 13+ messages in thread
From: Dirk Heinrichs @ 2008-02-24 16:46 UTC (permalink / raw
  To: gentoo-user

[-- Attachment #1: Type: text/plain, Size: 148 bytes --]

Am Sonntag, 24. Februar 2008 schrieb cabbage:

> diff can use for binary files ?

If you just want to know "different or not", sure.

Bye...

	Dirk

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [gentoo-user] md5sum for directories?
  2008-02-24 16:39   ` cabbage
  2008-02-24 16:46     ` Dirk Heinrichs
@ 2008-02-24 16:49     ` Andrew Gaydenko
  1 sibling, 0 replies; 13+ messages in thread
From: Andrew Gaydenko @ 2008-02-24 16:49 UTC (permalink / raw
  To: gentoo-user

Hi!
======= On Sunday 24 February 2008, you wrote: =======
...
> > On Sun, 24 Feb 2008 11:06:10 +0000, Stroller wrote:
> > > Is there any way to check the integrity of copied directories, to
> > > be sure that none of the files or sub-directories in them have
> > > become damaged during transfer? I'm thinking of something like
> > > md5sum for directories.

I use this script to check how DVD-data were written:

nice -n 15 find $1/* -type f -print0 | sort -z | xargs -0 cat | md5sum -b

Don't ask me how does it work - I forgot :-) But it works.

-- 
gentoo-user@lists.gentoo.org mailing list



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [gentoo-user] md5sum for directories?
  2008-02-24 11:06 [gentoo-user] md5sum for directories? Stroller
  2008-02-24 11:46 ` Etaoin Shrdlu
  2008-02-24 14:29 ` Neil Bothwick
@ 2008-02-24 19:46 ` Christopher Copeland
  2008-02-27  0:51   ` Stroller
  2008-02-24 21:15 ` [gentoo-user] " »Q«
  3 siblings, 1 reply; 13+ messages in thread
From: Christopher Copeland @ 2008-02-24 19:46 UTC (permalink / raw
  To: gentoo-user


On 24 Feb 2008, at 06:06, Stroller wrote:

> So my question is:
>
> Is there any way to check the integrity of copied directories, to be  
> sure that none of the files or sub-directories in them have become  
> damaged during transfer? I'm thinking of something like md5sum for  
> directories.

I use rsync for this and would suggest you look into it. You can tell  
it to compare files based on checksum (which is slower) and the real  
beauty is that if there is a file that is corrupt or otherwise not the  
same as the source it will copy just that single file to your backup  
disk. Test it by deleting a random file somewhere in the backup tree..  
rerun your rsync command and the file is copied back.

man rsync
--
Christopher
-- 
gentoo-user@lists.gentoo.org mailing list



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [gentoo-user]  Re: md5sum for directories?
  2008-02-24 11:06 [gentoo-user] md5sum for directories? Stroller
                   ` (2 preceding siblings ...)
  2008-02-24 19:46 ` Christopher Copeland
@ 2008-02-24 21:15 ` »Q«
  2008-02-26 19:59   ` Mick
  3 siblings, 1 reply; 13+ messages in thread
From: »Q« @ 2008-02-24 21:15 UTC (permalink / raw
  To: gentoo-user

Stroller <stroller@stellar.eclipse.co.uk> wrote:

> I'm thinking of something like md5sum for directories.

I think you may have gotten better solutions for your situation, but
md5deep (in portage) is like md5sum but with directory recursion.

-- 
gentoo-user@lists.gentoo.org mailing list



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [gentoo-user]  Re: md5sum for directories?
  2008-02-24 21:15 ` [gentoo-user] " »Q«
@ 2008-02-26 19:59   ` Mick
  0 siblings, 0 replies; 13+ messages in thread
From: Mick @ 2008-02-26 19:59 UTC (permalink / raw
  To: gentoo-user

[-- Attachment #1: Type: text/plain, Size: 1126 bytes --]

On Sunday 24 February 2008, »Q« wrote:
> Stroller <stroller@stellar.eclipse.co.uk> wrote:
> > I'm thinking of something like md5sum for directories.
>
> I think you may have gotten better solutions for your situation, but
> md5deep (in portage) is like md5sum but with directory recursion.

I'm probably not suggesting anything you don't already know, but just in case:

Notwithstanding that rsync is a superior tool just made for the job, I more 
often use tar instead of either rsync or cp.  This is because when I back up 
a complete fs I use whichever LiveCD I have at hand (usually Knoppix) which 
doesn't always have rsync on it.  Anyway, the tar command has the option -d 
which diffs the contents of the archive and the original fs, if you want to 
see what happened after the archive was written, or want to decide if it is 
time/worth making a fresher back up.  Alternatively and more appropriately 
if you run this as part of a back up process, there is the -W option.  From 
the man page:

-W, --verify
              attempt to verify the archive after writing it

HTH.
-- 
Regards,
Mick

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [gentoo-user] md5sum for directories?
  2008-02-24 11:46 ` Etaoin Shrdlu
@ 2008-02-27  0:38   ` Stroller
  2008-02-27  9:40     ` Etaoin Shrdlu
  0 siblings, 1 reply; 13+ messages in thread
From: Stroller @ 2008-02-27  0:38 UTC (permalink / raw
  To: gentoo-user


On 24 Feb 2008, at 11:46, Etaoin Shrdlu wrote:

> On Sunday 24 February 2008, Stroller wrote:
>
>> I've done this loads in the past, and never been aware of any file
>> corruption, but I guess I'm just paranoid today. Perhaps I shouldn't
>> use the -v flags during my copy - it's reassuring to see the files
>> being copied, but what if I overlooked a bunch of errors in the
>> middle of all those thousands of "copied successfully" confirmations?
>> What if something has gone wrong during one of the two copies?
>
> Well, in that case cp will have a nnonzero exit status. Look:
>
> ...
> $ cp a b destdir
> cp: cannot open `b' for reading: Permission denied
> $ echo $?
> 1
> ...
> I think this should hold for the majority of cases/errors cp might
> encounter during the copy.

Good point. I should have checked this when I first made the copy  
using cp, and will do so in the future.

> Of course, this does not detect a succesful, but somehow corrupted,  
> copy
> (which should be exceptionally rare, anyway).

Well perhaps I'm just being paranoid today.
But how do I know that a successful, but somehow corrupted, copy has  
not occurred?

What makes you confident that these are rare? I don't ask this to be  
antagonistic, just to increase my own confidence in the `cp` command.

>> Is there any way to check the integrity of copied directories, to be
>> sure that none of the files or sub-directories in them have become
>> damaged during transfer? I'm thinking of something like md5sum for
>> directories.
>
> I'm not aware of any such tool (which might exist nonetheless, of
> course). However, on the filesystem, the objects that we
> call "directories" are just index files holding filenames and pointers
> to inodes. Running a checksum on the directories themselves would not
> guarantee against corruption of any of the contained files, since file
> data is not contained in the directory.

Naturally.

Perhaps I should have phrased my question differently: "Is there any  
way to recursively check the integrity of copied directories of  
files?" However the words "to be sure that none of the files or sub- 
directories in them have become damaged during transfer"

> Thus, to be accurate, such a
> tool would have to scan the directory, find each file, and perform a
> checksum on it, which would result in something not much different  
> from
> the find command you suggested, in terms of resource usage.


I have to admit that I haven't run this command and I don't have any  
idea what its actual resource usage would be. I guess I'd be happy  
with a lower-grade of checksumming, if it would reduce the runtime to  
acceptable levels. With md5sum one can be - barring certain malicious  
external attacks - quite certain that a copied file is identical to  
the original. I would be happy with a "the file's there and it looks  
ok" level of confidence.

Stroller.

-- 
gentoo-user@lists.gentoo.org mailing list



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [gentoo-user] md5sum for directories?
  2008-02-24 19:46 ` Christopher Copeland
@ 2008-02-27  0:51   ` Stroller
  2008-02-27  2:36     ` Christopher Copeland
  0 siblings, 1 reply; 13+ messages in thread
From: Stroller @ 2008-02-27  0:51 UTC (permalink / raw
  To: gentoo-user


On 24 Feb 2008, at 19:46, Christopher Copeland wrote:
> On 24 Feb 2008, at 06:06, Stroller wrote:
>
>> So my question is:
>>
>> Is there any way to check the integrity of copied directories, to  
>> be sure that none of the files or sub-directories in them have  
>> become damaged during transfer? I'm thinking of something like  
>> md5sum for directories.
>
> I use rsync for this and would suggest you look into it. You can  
> tell it to compare files based on checksum (which is slower) and  
> the real beauty is that if there is a file that is corrupt or  
> otherwise not the same as the source it will copy just that single  
> file to your backup disk. Test it by deleting a random file  
> somewhere in the backup tree.. rerun your rsync command and the  
> file is copied back.
>
> man rsync

Thanks. I think this has been suggested before for my backups - IIRC  
it  has a useful --ignore-path or --exclude-path command which can  
insure you all the users' Documents & Settings, without the useless  
temp & "Temporary Internet Files".

I've just tried `rsync- vrchi` on a pair of subdirectories ("My  
Documents") of the backup I made last week and on those it seems run  
in acceptable time. I got little output, however, so have deleted a  
couple of files from the destination (I should perhaps write some  
random data to another) and am running it again in anticipation of  
some "copying /a/b/c/file /x/y/z/file" output.

I appreciate your help,

Stroller. 
-- 
gentoo-user@lists.gentoo.org mailing list



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [gentoo-user] md5sum for directories?
  2008-02-27  0:51   ` Stroller
@ 2008-02-27  2:36     ` Christopher Copeland
  0 siblings, 0 replies; 13+ messages in thread
From: Christopher Copeland @ 2008-02-27  2:36 UTC (permalink / raw
  To: gentoo-user


On 26 Feb 2008, at 19:51, Stroller wrote:

> Thanks. I think this has been suggested before for my backups - IIRC  
> it  has a useful --ignore-path or --exclude-path command which can  
> insure you all the users' Documents & Settings, without the useless  
> temp & "Temporary Internet Files".
>

rsync has excellent control over what is copied via the include and  
exclude options.

> I've just tried `rsync- vrchi` on a pair of subdirectories ("My  
> Documents") of the backup I made last week and on those it seems run  
> in acceptable time. I got little output, however, so have deleted a  
> couple of files from the destination (I should perhaps write some  
> random data to another) and am running it again in anticipation of  
> some "copying /a/b/c/file /x/y/z/file" output.
>

When I run rsync interactively i usually add --stats and --progress to  
the command. Those will give you more feedback.

> I appreciate your help,

Least I could do, and if I hadn't mentioned it I am sure someone else  
would have. This is a gentoo list ;-)
--
Christopher
-- 
gentoo-user@lists.gentoo.org mailing list



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [gentoo-user] md5sum for directories?
  2008-02-27  0:38   ` Stroller
@ 2008-02-27  9:40     ` Etaoin Shrdlu
  0 siblings, 0 replies; 13+ messages in thread
From: Etaoin Shrdlu @ 2008-02-27  9:40 UTC (permalink / raw
  To: gentoo-user

On Wednesday 27 February 2008, Stroller wrote:

> > Of course, this does not detect a succesful, but somehow corrupted,
> > copy
> > (which should be exceptionally rare, anyway).
>
> Well perhaps I'm just being paranoid today.
> But how do I know that a successful, but somehow corrupted, copy has
> not occurred?
>
> What makes you confident that these are rare? I don't ask this to be
> antagonistic, just to increase my own confidence in the `cp` command.

Ah well, I have no statistics here. But I can say that such a thing has 
never occured to me in the past (or at least if it occured, I did not 
notice that). Not a definitive proof, I know; rather, just my 
experience. You are of course free to not trust me and, if you're truly 
paranoid, you probably should do so :-)

> I have to admit that I haven't run this command and I don't have any
> idea what its actual resource usage would be. I guess I'd be happy
> with a lower-grade of checksumming, if it would reduce the runtime to
> acceptable levels. With md5sum one can be - barring certain malicious
> external attacks - quite certain that a copied file is identical to
> the original. I would be happy with a "the file's there and it looks
> ok" level of confidence.

Well, md5deep has already been suggested. If you are content with a 
lower-grade checksumming, you could write your own script that compares 
file lenghts and calculate checksums only on the first n and last m 
bytes of each file, for some reasonable values of n and m (bigger is 
better, as you guess). This is what backuppc (an excellent backup 
software) does when it has to decide whether a file has changed (and 
thus has to be backed up) compared with the copy stored in the backup 
pool.
Read this for more info:

http://backuppc.sourceforge.net/faq/BackupPC.html#some_design_issues

"The hashing function" paragraph. Do note that (of course) that method is 
not 100% accurate and might report false negatives if the corruption is 
in the middle of the file and file length did not change.
-- 
gentoo-user@lists.gentoo.org mailing list



^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2008-02-27  9:28 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-02-24 11:06 [gentoo-user] md5sum for directories? Stroller
2008-02-24 11:46 ` Etaoin Shrdlu
2008-02-27  0:38   ` Stroller
2008-02-27  9:40     ` Etaoin Shrdlu
2008-02-24 14:29 ` Neil Bothwick
2008-02-24 16:39   ` cabbage
2008-02-24 16:46     ` Dirk Heinrichs
2008-02-24 16:49     ` Andrew Gaydenko
2008-02-24 19:46 ` Christopher Copeland
2008-02-27  0:51   ` Stroller
2008-02-27  2:36     ` Christopher Copeland
2008-02-24 21:15 ` [gentoo-user] " »Q«
2008-02-26 19:59   ` Mick

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox