[gentoo-user] File synchronisation utility (searching for/about to program it)

public inbox for gentoo-user@lists.gentoo.org
 help / color / mirror / Atom feed

* [gentoo-user] File synchronisation utility (searching for/about to program it)
@ 2009-07-22 15:09 Simon
  2009-07-25  2:43 ` Alan E. Davis
  0 siblings, 1 reply; 7+ messages in thread
From: Simon @ 2009-07-22 15:09 UTC (permalink / raw
  To: gentoo-user

Hi there!  I was about to jump into the programming of my own sync
utility when i thought: Maybe i should ask if it exists first!  Also,
this is not really gentoo-related: it doesnt deal with OS or
portage...  but i'm rather asking the venerable community at large,
excuse me if you find this post inappropriate (but can you suggest a
more appropriate audience?).

There are lots of sync utility out there, but my search hasnt found
the one utility that has all the features i require.  Most lack some
of these features, some will have undesirable limitations...  I'm
currently using unison for all my sync needs, it's the best i found so
far but it is very limited on some aspects and it's a bit painful on
my setup.  Make sure i clearly refuse to even consider network
filesystems, and the reason is i need each computer to be fully
independent from each other, i sync my important files so to have a
working backup on all my pcs (my laptop breaks? fine, i just start my
desktop and continue working transparently, well, with last sync'ed
files).  Any kind of NFS could be considered for doing the file
transfers, but i dont think any of them can compete with rsync, so
they're out of the question.

Now, i know some of you will have the reflex to say: try Such tool, it
support 4 out of your 5 requirements. Or try Such tool, it supports
them all, but you'll have to bend things a bit to make it work like
you want....  I'm looking for the perfect solution, and if it doesnt
exist, well, i'm about to code it in C or C++, i have the design ready
and the concept is very simple yet provides all my features.  I wish
to publish the result as open software (probably with a license like
BSD or maybe LGPL, maybe but hopefully not GPL) and what i'm about to
code will be compatible Linux and MacOSX for sure, a port to windows
will require some dumb extensions (such as windows path to unix path
conversion, and file transfer support) and it will use very little
deps.  My project intends to use rsync for the transfer, and so my
project will basically extend rsync with all my required features.
Rsync does the transfer, i can't compete with how good rsync is at
transfering (works through ssh, rsh, through its daemon, does
differential transfers, transfers attributes/ownership...), but my
project will be better at finding what needs to be transfered, what
needs to be deleted and this on as many computers you want and in one
shot.

Here are the features that i seek/require (that i will be programming
if no utility can provide them all, the list is actually longer, but i
can live without the items not written here):

  -Little space requirements:  I could use rsync to make an
incremental backup using hardlinks, and basically just copy whatever
is "new" on each replica, but this takes way too much space and still
doesnt deal with deletes properly (ie a file is on A and B, gets
deleted on A and on B and recreated on B.  In reality we have a new
file on B, but rsync might want to delete this new file on B thinking
it's the file that got deleted on A, unison works admirably here, it
finds the first file effectively got deleted on both, nothing to do,
and new file appeared on B which needs to be transfered to A...  the
space unison uses to cache its date is about 100mb now, and i havent
cleaned it since i started using it, i believe more than half of it
could be removed, even 100mb still represents about 1% of what is
sync'ed).

  -Server-less:  I dont want to maintain a server on even a single
computer.  I like unison since it executes the server through ssh only
when used, it's never listening, it's never started at boot time.
This is excellent behavior and simplifies maintenance.

  -Bidirectional pair-wise sync:  Meaning i can start the sync from
host A or from host B, the process should be the same, should take
same amount of time, result should be the same. I should never have to
care where the sync is initiated.  (Unison doesnt support this, but
it's ok to sync from both directions, it's just not optimised)

  -Star topology: Or any topologise that allow syncing multiple
computers at once...  I'm tired of doing several pairwise syncs since
to do a full sync of my 3 computers (called A,B and C), i first have
to sync A->B and A->C, at this point A contains all the diffs and is
sync'ed, but i have to do it once more A->B and A->C to sync the
others (ie so B gets C's modifs).

  -Anarchic mode: hehe however you call it, using the same 3 hosts,
i'd like to be able to do a pairwise sync between: A->B, A->C and also
B->C.  To have the sync process decentralised...  This is possible
with unison but of course i have to ssh to the remote host i want to
sync with another remote host.

  -Intelligent conflict resolution:  Let's face it, the sync utility
wasnt gifted with artificial intelligence, so why bother?  It should
depend on the user's intelligence, but it should depend on it
intelligently.  Meaning, it should remember (if users wants it) the
resolution of a given conflict to always resolve it this way.  This
could effectively help in having some files mirrored from A->B, some
others mirrored from B->A, some others to be backed up before being
overwritten and some would always require user interactivity (like my
current project's file)...  This is a matter of preference and any
utility that dont understand this works against me.  No tool i've
encountered supports this, unison could do some of these but i would
have to break the sync'ing process into multiple smaller syncs, and
most tool will just shoot a list of all conflicts and as wheter to
keep local, keep remote, ignore, cancel, and this for each and every
conflict (the list is long, the cancel option is tempting!).

  -Friendly config/maintain:  I have the friendly user in mind (me),
meaning the tool should be user-friendly!  User-friendly doesnt mean
graphical interface with lots of eyecandy (this makes people fat, it's
hostile to me, not friendly at all!).  However, I like to have only
one config file to edit for all my needs, or a directory containing
one level of files, a few files, each logically separated (think about
/etc/portage) and most of all documented, intuitive.

These are the features i need most. I am tired of 'working around'
limitations or missing features.  I am tired of having to do multiple
syncs to get my whole house up2date.

And finally, thanks to those that were interested in my post enough to
read as far as here (unless you jumped straight here, but thank you
still for taking the time!).  I'm desperate at creating a project that
will be useful to me and hopefully to others too.  I'm a very good
C/C++/PHP/JS programmer but i could only rarely find work in that
field since i have no diploma (highschool diploma from 10 years ago
that's all).  Due to some illness i've lived a terribly unstable life
and i've had an exploratory tendency in development, meaning i've
started about 10K projects, but finished none.  I have published
nothing so far... in other words, i am nobody, and for companies, i am
a risk, even if i ask half the usual salary it's still a division by
zero:  salary divided by zero credibility (ie no diploma and no work
xp).  If i can build this project (on my own for the start) and
publish it, i think it would help me a lot professionally.  Also, once
the first version is out, i'll clearly welcome patches from the
community and having a team work will help even more.  Also, very
important to note, i am currently unemployed, collecting unemployment
insurrance as income, i still have about 2 months left of free time to
get my professional situation back on track, this 2 months of my
expertise is more than enough to get a good stable beta version of
this project.  But i need to get it started, i must be convinced this
is the right choice.

Thanks for reading, hopeful to be reading your answers!
  Simon

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [gentoo-user] File synchronisation utility (searching for/about  to program it)
  2009-07-22 15:09 [gentoo-user] File synchronisation utility (searching for/about to program it) Simon
@ 2009-07-25  2:43 ` Alan E. Davis
  2009-07-25 14:56   ` [gentoo-user] " walt
  2009-07-25 17:10   ` [gentoo-user] " Simon
  0 siblings, 2 replies; 7+ messages in thread
From: Alan E. Davis @ 2009-07-25  2:43 UTC (permalink / raw
  To: gentoo-user

[-- Attachment #1: Type: text/plain, Size: 1443 bytes --]

Hello, Simon:

I'm the last you would want to give advice about this question, but even
though I am not a programmer, I have been using git to sync on three
different systems.  I am using a flash drive as a cache, so to speak.  I
followed some tips from the Emacs org-mode mailing list to get this going.
It wasn't simple for me to recover when some files got out of sync on one of
the machines, but it was simple enough that even I could figure it out.  I
used a bare repo on the flash drive and push from each machine to this, a
very simple procedure that can be automated through cron, and pull to each
machine also from the bare repository.  I am not syncing a programming
project, but my various work.

Again, I am the least clueful you will find on this list, but if you wish
for me to tell you the steps I followed, that is possible.  One of the
mailing list threads that got me up to speed relatively quickly was at this
link.  (Hope it's ok to link another mailing list from this one.)

http://www.mail-archive.com/emacs-orgmode@gnu.org/msg11647.html

I apologize if the existence of a bare repo as an intermediary is a problem.
This can be done on a server as well.

Alan Davis

You can know the name of a bird in all the languages of the world,  but when
you're finished, you'll know absolutely nothing whatever about the bird...
So let's look at the bird and see what it's doing---that's what counts.

   ----Richard Feynman

>
>

[-- Attachment #2: Type: text/html, Size: 1821 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [gentoo-user]  Re: File synchronisation utility (searching for/about    to program it)
  2009-07-25  2:43 ` Alan E. Davis
@ 2009-07-25 14:56   ` walt
  2009-07-25 17:16     ` Simon
  2009-07-26  8:15     ` Alan McKinnon
  2009-07-25 17:10   ` [gentoo-user] " Simon
  1 sibling, 2 replies; 7+ messages in thread
From: walt @ 2009-07-25 14:56 UTC (permalink / raw
  To: gentoo-user

On 07/24/2009 07:43 PM, Alan E. Davis wrote:
> ...I am using a flash drive as a cache, so to speak...

I recently learned that flash drives wear out after about
10,000 write operations, which came as an unpleasant surprise.

Just be aware that you are drastically shortening the life of
a flash drive by writing to it frequently. (or so I've read)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [gentoo-user] File synchronisation utility (searching for/about  to program it)
  2009-07-25  2:43 ` Alan E. Davis
  2009-07-25 14:56   ` [gentoo-user] " walt
@ 2009-07-25 17:10   ` Simon
  2009-08-01 12:50     ` Mike Kazantsev
  1 sibling, 1 reply; 7+ messages in thread
From: Simon @ 2009-07-25 17:10 UTC (permalink / raw
  To: gentoo-user

> I'm the last you would want to give advice about this question, but even
> though I am not a programmer, I have been using git to sync on three
> different systems.  I am using a flash drive as a cache, so to speak.  I
> followed some tips from the Emacs org-mode mailing list to get this going.
> It wasn't simple for me to recover when some files got out of sync on one of
> the machines, but it was simple enough that even I could figure it out.  I
> used a bare repo on the flash drive and push from each machine to this, a
> very simple procedure that can be automated through cron, and pull to each
> machine also from the bare repository.  I am not syncing a programming
> project, but my various work.

Your reply is more than welcome!

I have tried using git in the past and found that it doesnt work in my
'space constrained' scenario.  The need for a repository is a problem.
 The use of the usbkey however is nice since it allows git to work
without having each computer maintain its own repository... but
still... i dont currently have a usbkey that's large enough to hold
all my data, even if i could compress it i doubt it would fit.

Another thing is, i wonder if it retains the attributes of the file
(creation date, mod date, owner/group, permissions)?  As this can be
important on some aspects of my synchronisation needs.

Still, git is a very good solution that works incrementally in a
differential manner (makes patches from previous versions).  But when
i tried it, i found to suit my needs it would require the programming
of a big wrapper that would interface git to make some daily quick
actions simpler than a few git commands.

> Again, I am the least clueful you will find on this list, but if you wish
> for me to tell you the steps I followed, that is possible.  One of the
> mailing list threads that got me up to speed relatively quickly was at this
> link.  (Hope it's ok to link another mailing list from this one.)
>
> http://www.mail-archive.com/emacs-orgmode@gnu.org/msg11647.html

I'll check it out... since i have my own solution all thought of and
designed, i'll be able to compare it and re-evaluate git from a new
angle.  As far as i can tell, there is no rule against links, but i
think there might be against publicity (ie if the link was to your
business product that fullfills my need).

> I apologize if the existence of a bare repo as an intermediary is a problem.
> This can be done on a server as well.

It is...  it makes all my computer dependant on that repo...  sync'ing
computers at home can be done alright, but will still require walking
around pluging/unpluging.  Makes this practically impossible to do
over the network (or to sync my host on the internet, not all my pc
are connected to the internet so the repo cant be just on the server,
i would have to maintain several repositories to work this out...).
It may be possible to adapt it to my scenario, but i think it will
require a lot of design in advance...  but i'll check it out...  at
worst it will convince me i should program my own, better it will give
me some good ideas or fortify some of my own good ideas and at best it
will be the thing i've been looking for!

Thanks again!

  Simon

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [gentoo-user] Re: File synchronisation utility (searching  for/about to program it)
  2009-07-25 14:56   ` [gentoo-user] " walt
@ 2009-07-25 17:16     ` Simon
  2009-07-26  8:15     ` Alan McKinnon
  1 sibling, 0 replies; 7+ messages in thread
From: Simon @ 2009-07-25 17:16 UTC (permalink / raw
  To: gentoo-user

>> ...I am using a flash drive as a cache, so to speak...
>
> I recently learned that flash drives wear out after about
> 10,000 write operations, which came as an unpleasant surprise.
>
> Just be aware that you are drastically shortening the life of
> a flash drive by writing to it frequently. (or so I've read)

I think the problem is the number of erases actually (but you erase in
order to write, so)...

This is number really is more or less, some companies achieve higher
number of writes, i think this figure shows a worst-case scenario.

Also, it's actually 10k writes per sector...  git, as far as i
understand, will append new versions to the repository, so it's
writing on other sectors and only once.  Only if you remove, prune,
cleanup the repository, then it would erase...  and if you were to
format it every day and write to it once a day, (say one new write per
sector, on every sector, once per day)... i've quickly calculated the
key will last around 27 years.  I'll be glad to buy a petabyte usbkey
for 10$US in 27 years man! lol

Simon

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [gentoo-user]  Re: File synchronisation utility (searching for/about    to program it)
  2009-07-25 14:56   ` [gentoo-user] " walt
  2009-07-25 17:16     ` Simon
@ 2009-07-26  8:15     ` Alan McKinnon
  1 sibling, 0 replies; 7+ messages in thread
From: Alan McKinnon @ 2009-07-26  8:15 UTC (permalink / raw
  To: gentoo-user

On Saturday 25 July 2009 16:56:57 walt wrote:
> On 07/24/2009 07:43 PM, Alan E. Davis wrote:
> > ...I am using a flash drive as a cache, so to speak...
>
> I recently learned that flash drives wear out after about
> 10,000 write operations, which came as an unpleasant surprise.

That's a gross over-simplification of reality.

Individual elements of a flash drive will eventually wear out - they are not 
infinitely over-writable.

The ultra-super-duper-cheapie crap ones average out at about 10,000 writes per 
cell, meaning that's the point where the manufacturer won't guarantee much. 
You may well get more on such a device in practice.

Decent drives go up to 100,000 writes before cell failures become 
statistically significant 

> Just be aware that you are drastically shortening the life of
> a flash drive by writing to it frequently. (or so I've read)

Which is why you should use wear-levelling 


-- 
alan dot mckinnon at gmail dot com



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [gentoo-user] File synchronisation utility (searching for/about to program it)
  2009-07-25 17:10   ` [gentoo-user] " Simon
@ 2009-08-01 12:50     ` Mike Kazantsev
  0 siblings, 0 replies; 7+ messages in thread
From: Mike Kazantsev @ 2009-08-01 12:50 UTC (permalink / raw
  To: gentoo-user

[-- Attachment #1: Type: text/plain, Size: 5412 bytes --]

On Sat, 25 Jul 2009 13:10:41 -0400
Simon <turner25@gmail.com> wrote:

> I have tried using git in the past and found that it doesnt work in my
> 'space constrained' scenario.  The need for a repository is a problem.
>  The use of the usbkey however is nice since it allows git to work
> without having each computer maintain its own repository... but
> still... i dont currently have a usbkey that's large enough to hold
> all my data, even if i could compress it i doubt it would fit.
> 
> Another thing is, i wonder if it retains the attributes of the file
> (creation date, mod date, owner/group, permissions)?  As this can be
> important on some aspects of my synchronisation needs.

Vanilla git doesn't, apart from executable bit.

Due to highly-modular structure of git, one can easily implement it
as a wrapper or replacement binary at some level, storing metadata in
some form (plain list, mirror tree or just alongside each file) when
pushing changes to repo, applying on each pull.
Then there are also git-hooks, which should be a better way than
wrapper in theory, but I found them much harder to use in practice.

> Still, git is a very good solution that works incrementally in a
> differential manner (makes patches from previous versions).  But when
> i tried it, i found to suit my needs it would require the programming
> of a big wrapper that would interface git to make some daily quick
> actions simpler than a few git commands.

That's another advantage of wrapper, but note that git-commands
themselves can be quite extensible via aliases, configurable in
gitconfig at any level (repo, home, system-wide).

  [alias]
    ci = commit -a
    co = checkout
    st = status -a
    br = branch
    ru = remote update
    ui = update-index --refresh
    cp = cherry-pick

Still, things such are "git ui && git cp X" are quite common, so
wrapper, or at least a set of shell aliases is quite handy.

>> I apologize if the existence of a bare repo as an intermediary is a problem.
>> This can be done on a server as well.  
>
> It is...  it makes all my computer dependant on that repo...  sync'ing
> computers at home can be done alright, but will still require walking
> around pluging/unpluging.  Makes this practically impossible to do
> over the network (or to sync my host on the internet, not all my pc
> are connected to the internet so the repo cant be just on the server,
> i would have to maintain several repositories to work this out...).
> It may be possible to adapt it to my scenario, but i think it will
> require a lot of design in advance...  but i'll check it out...  at
> worst it will convince me i should program my own, better it will give
> me some good ideas or fortify some of my own good ideas and at best it
> will be the thing i've been looking for!

Why keep bare repo at all? That's certainly not a prequisite with
distributed VCS like git.

You can fetch / merge / rebase / cherry-pick commits with git via ssh
just as easy as with rsync, using some intermediate media only if
machines aren't connected at all, but then there's just no way around
it.
And even here, knowing approximate date of last sync, you can use
commands like git-bundle to create single pack of new objects, which
remote(s) can easily import, transferring this via any applicable method
or protocol between / to any number of hosts.

As you've noted already, git is quite efficient when it comes to
storage, keeping the changes alone.
When this will become a problem due to long history of long-obsoleted
changes, you can drop them all, effectively 'sqashing' all the commits
in one of the repos, rebasing the rest against it.
So that should cover requirement one.

Cherry-picking commits or checking out individual files / dirs on top
of any base from any other repo/revision is pretty much what is stated
in the next three requirements.
One gotcha here is that you should be used to making individual commits
consistent and atomic, so each set of changes serves one purpose and
you won't be in situation when you'll need "half of commit" anywhere.

Conflict resolution is what you get with merge / rebase (just look at
the fine "git-merge" man page), but due to abscence of "ultimate AI"
these better used repeatedly against the same tree.

About the last point of original post... I don't think git is
"intuitive" until you understand exactly how it works - that's when it
becomes one, with all the high-level and intermediate interfaces having
great manpage and sole, clear purpose.

That said, I don't think git is the best way to sync everything.
I don't mix binary files with configuration, because just latter suffice
with gentoo: you have git-synced portage (emerge can sync via VCS
out-of-the-box), git-controlled overlay on top of it and pull the
world/sets/flag/etc changes... just run emerge and you're set, without
having to worry about architectural incompatibilities of binaries or
missing misc libs, against which they're linked here and there. That's
what portage is made for, after all.
Just think of trendemous space efficiency here - no binaries are backed
up at all, and all you need to do to restore 3G root from 2M pack is
"git clone (or receive-pack) && emerge -uDN @world" ;)

-- 
Mike Kazantsev // fraggod.net

-- 
Mike Kazantsev // fraggod.net

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2009-08-01 12:50 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-07-22 15:09 [gentoo-user] File synchronisation utility (searching for/about to program it) Simon
2009-07-25  2:43 ` Alan E. Davis
2009-07-25 14:56   ` [gentoo-user] " walt
2009-07-25 17:16     ` Simon
2009-07-26  8:15     ` Alan McKinnon
2009-07-25 17:10   ` [gentoo-user] " Simon
2009-08-01 12:50     ` Mike Kazantsev

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox