From: "caveman رَجُلُ الْكَهْفِ 穴居人" <toraboracaveman@protonmail.com>
To: Gentoo <gentoo-user@lists.gentoo.org>
Subject: [gentoo-user] an efficient idea for an alternative portage synchronisation
Date: Fri, 18 Jun 2021 12:10:28 +0000 [thread overview]
Message-ID: <W39UY79gTTnkYBA-829kjiWYRPxelVDQq1r9_DiK-R3zu7I4RbJ7s3l-freaWBbsSk6JnFf5cRFv2L0cc7kaSxJezUyQG3iY4-2i0dNpKpc=@protonmail.com> (raw)
tl;dr - i'm suggesting a new file syncing protocol
for portage syncing. details of this one is in
section 2.
1. background
-------------
rsync needs to read all files in order to compare
them. this is too expensive and doesn't scale as
portage's tree grows in size..
on the other hand, git gets away with this, by
maintaining a history of edits. so git doesn't
need to compare all files, instead it walks
through the history.
but git has another issue: the history getting
too big. this causes:
- `git clone` to needlessly take too long, as
many old histories become irrelevant as they
get fully overwridden by newer ones.
- this also causes `git pull` to be slower
than needed, as the history is not ideally
compressed.
- plus, the disk space that's wasted for
histories.
2. new protocol
---------------
to solve issues above, i think the ideal solution
is this protocol:
- each history is a number representing a
logical clock. 1st history is 0, 2nd is 1,
etc.
- the server maintains a list of N past many
histories of the portage tree.
- when a client requests to update its portage
tree, it tells the server its current
history. e.g. say client is currently
located in logical time 1234567.
- the server is maintaining only the past N
histories:
- if 1234567 is behind those maintained N
ones, then the server sends a full
portage tree from scratch.
- if 1234567 is within those maintained N
ones, then the server has two options:
(1) either send all changes since
1234567, as they happened
historically. this is a bad idea.
no good reason for it.
(2) better: the server can send the
compressed histories. compressed
histories are done once, and
cached, in a scalable way. the
cache itself is incremental, so
updating the cache is cheap
(details section 2.2.).
e.g. if there are 5000 histories
that the client lacks since time
1234567, then there is a chance
that many of the changes are just
a waste of time. e.g. add a file,
then delete the same file, then
add a different file again. so
why not just lie about the
history, and send the last file,
escaping ones int he middle? same
can be thought about diffs to code
blocks.
2.1. properties of this new protocol
------------------------------------
so this new protocol has these properties:
- unlike rsync, it doesn't need to compare all files
individually.
- unlike git, the history doesn't grow on the
client. history remains only a single
number representing a logical clock.
- the history on the server is limited to N
past entries. no devs will cry, because
this is not a code collaboration app, but
simply a file synchronisation app to replace
rsync. so the admins are free to set N as
small as they please, without worrying about
harming collaborating devs.
- server has the option to compress histories
to clients, and these histories are
cacheable for more performance.
2.2. how it will feel to admins/devs
------------------------------------
- the devs simply commit their changes to the
portage tree via git.
- the git server will have hooks to execute an
external command for this new protocol, that
will calculate all diffs necessary in order
to build a new history.
e.g. if current history is 30000, and a dev
makes a new commit via git, then the git
hooks will execute the external command to
calculate the diff for the affected files by
the git commit, such that history 30001 is
created.
the hooked external command will also see if
it can compress the histories, for the past
M many entries since 30001.
so that clients that live in time 30001-M,
who ask for 30001, can get the compressed
history instead of raw actual histories from
30001-m to 30001.
ty,
cm.
next reply other threads:[~2021-06-18 12:10 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-06-18 12:10 caveman رَجُلُ الْكَهْفِ 穴居人 [this message]
2021-06-18 14:16 ` [gentoo-user] an efficient idea for an alternative portage synchronisation Michael Jones
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='W39UY79gTTnkYBA-829kjiWYRPxelVDQq1r9_DiK-R3zu7I4RbJ7s3l-freaWBbsSk6JnFf5cRFv2L0cc7kaSxJezUyQG3iY4-2i0dNpKpc=@protonmail.com' \
--to=toraboracaveman@protonmail.com \
--cc=gentoo-user@lists.gentoo.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox