From: Michael Jones <gentoo@jonesmz.com>
To: gentoo-user@lists.gentoo.org
Subject: Re: [gentoo-user] an efficient idea for an alternative portage synchronisation
Date: Fri, 18 Jun 2021 09:16:48 -0500 [thread overview]
Message-ID: <CABfmKS+kPauptZRObeYiLDYJRMbdRyCYDUmZ5a42v_-r2OnTJA@mail.gmail.com> (raw)
In-Reply-To: <W39UY79gTTnkYBA-829kjiWYRPxelVDQq1r9_DiK-R3zu7I4RbJ7s3l-freaWBbsSk6JnFf5cRFv2L0cc7kaSxJezUyQG3iY4-2i0dNpKpc=@protonmail.com>
[-- Attachment #1: Type: text/plain, Size: 4997 bytes --]
On Fri, Jun 18, 2021, 07:10 caveman رَجُلُ الْكَهْفِ 穴居人 <
toraboracaveman@protonmail.com> wrote:
> tl;dr - i'm suggesting a new file syncing protocol
> for portage syncing. details of this one is in
> section 2.
>
>
> 1. background
> -------------
> rsync needs to read all files in order to compare
> them. this is too expensive and doesn't scale as
> portage's tree grows in size..
>
> on the other hand, git gets away with this, by
> maintaining a history of edits. so git doesn't
> need to compare all files, instead it walks
> through the history.
>
> but git has another issue: the history getting
> too big. this causes:
> - `git clone` to needlessly take too long, as
> many old histories become irrelevant as they
> get fully overwridden by newer ones.
> - this also causes `git pull` to be slower
> than needed, as the history is not ideally
> compressed.
> - plus, the disk space that's wasted for
> histories.
>
>
> 2. new protocol
> ---------------
> to solve issues above, i think the ideal solution
> is this protocol:
> - each history is a number representing a
> logical clock. 1st history is 0, 2nd is 1,
> etc.
> - the server maintains a list of N past many
> histories of the portage tree.
> - when a client requests to update its portage
> tree, it tells the server its current
> history. e.g. say client is currently
> located in logical time 1234567.
> - the server is maintaining only the past N
> histories:
> - if 1234567 is behind those maintained N
> ones, then the server sends a full
> portage tree from scratch.
> - if 1234567 is within those maintained N
> ones, then the server has two options:
> (1) either send all changes since
> 1234567, as they happened
> historically. this is a bad idea.
> no good reason for it.
>
> (2) better: the server can send the
> compressed histories. compressed
> histories are done once, and
> cached, in a scalable way. the
> cache itself is incremental, so
> updating the cache is cheap
> (details section 2.2.).
>
> e.g. if there are 5000 histories
> that the client lacks since time
> 1234567, then there is a chance
> that many of the changes are just
> a waste of time. e.g. add a file,
> then delete the same file, then
> add a different file again. so
> why not just lie about the
> history, and send the last file,
> escaping ones int he middle? same
> can be thought about diffs to code
> blocks.
>
> 2.1. properties of this new protocol
> ------------------------------------
> so this new protocol has these properties:
> - unlike rsync, it doesn't need to compare all files
> individually.
> - unlike git, the history doesn't grow on the
> client. history remains only a single
> number representing a logical clock.
> - the history on the server is limited to N
> past entries. no devs will cry, because
> this is not a code collaboration app, but
> simply a file synchronisation app to replace
> rsync. so the admins are free to set N as
> small as they please, without worrying about
> harming collaborating devs.
> - server has the option to compress histories
> to clients, and these histories are
> cacheable for more performance.
>
>
> 2.2. how it will feel to admins/devs
> ------------------------------------
> - the devs simply commit their changes to the
> portage tree via git.
> - the git server will have hooks to execute an
> external command for this new protocol, that
> will calculate all diffs necessary in order
> to build a new history.
>
> e.g. if current history is 30000, and a dev
> makes a new commit via git, then the git
> hooks will execute the external command to
> calculate the diff for the affected files by
> the git commit, such that history 30001 is
> created.
>
> the hooked external command will also see if
> it can compress the histories, for the past
> M many entries since 30001.
>
> so that clients that live in time 30001-M,
> who ask for 30001, can get the compressed
> history instead of raw actual histories from
> 30001-m to 30001.
>
> ty,
> cm
>
It seems like you are almost asking for git's --clone-depth and
--sync-depth flags.
Its not an exact match for your proposal but its very close.
>
[-- Attachment #2: Type: text/html, Size: 6205 bytes --]
prev parent reply other threads:[~2021-06-18 14:17 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-06-18 12:10 [gentoo-user] an efficient idea for an alternative portage synchronisation caveman رَجُلُ الْكَهْفِ 穴居人
2021-06-18 14:16 ` Michael Jones [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CABfmKS+kPauptZRObeYiLDYJRMbdRyCYDUmZ5a42v_-r2OnTJA@mail.gmail.com \
--to=gentoo@jonesmz.com \
--cc=gentoo-user@lists.gentoo.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox