public inbox for gentoo-user@lists.gentoo.org
 help / color / mirror / Atom feed
From: Michael Jones <gentoo@jonesmz.com>
To: gentoo-user@lists.gentoo.org
Subject: Re: [gentoo-user] an efficient idea for an alternative portage synchronisation
Date: Fri, 18 Jun 2021 09:16:48 -0500	[thread overview]
Message-ID: <CABfmKS+kPauptZRObeYiLDYJRMbdRyCYDUmZ5a42v_-r2OnTJA@mail.gmail.com> (raw)
In-Reply-To: <W39UY79gTTnkYBA-829kjiWYRPxelVDQq1r9_DiK-R3zu7I4RbJ7s3l-freaWBbsSk6JnFf5cRFv2L0cc7kaSxJezUyQG3iY4-2i0dNpKpc=@protonmail.com>

[-- Attachment #1: Type: text/plain, Size: 4997 bytes --]

On Fri, Jun 18, 2021, 07:10 caveman رَجُلُ الْكَهْفِ 穴居人 <
toraboracaveman@protonmail.com> wrote:

> tl;dr - i'm suggesting a new file syncing protocol
> for portage syncing.  details of this one is in
> section 2.
>
>
> 1. background
> -------------
> rsync needs to read all files in order to compare
> them.  this is too expensive and doesn't scale as
> portage's tree grows in size..
>
> on the other hand, git gets away with this, by
> maintaining a history of edits.  so git doesn't
> need to compare all files, instead it walks
> through the history.
>
> but git has another issue:  the history getting
> too big.  this causes:
>     - `git clone` to needlessly take too long, as
>       many old histories become irrelevant as they
>       get fully overwridden by newer ones.
>     - this also causes `git pull` to be slower
>       than needed, as the history is not ideally
>       compressed.
>     - plus, the disk space that's wasted for
>       histories.
>
>
> 2. new protocol
> ---------------
> to solve issues above, i think the ideal solution
> is this protocol:
>     - each history is a number representing a
>       logical clock.  1st history is 0, 2nd is 1,
>       etc.
>     - the server maintains a list of N past many
>       histories of the portage tree.
>     - when a client requests to update its portage
>       tree, it tells the server its current
>       history.  e.g. say client is currently
>       located in logical time 1234567.
>     - the server is maintaining only the past N
>       histories:
>         - if 1234567 is behind those maintained N
>           ones, then the server sends a full
>           portage tree from scratch.
>         - if 1234567 is within those maintained N
>           ones, then the server has two options:
>             (1) either send all changes since
>                 1234567, as they happened
>                 historically.  this is a bad idea.
>                 no good reason for it.
>
>             (2) better: the server can send the
>                 compressed histories.  compressed
>                 histories are done once, and
>                 cached, in a scalable way.  the
>                 cache itself is incremental, so
>                 updating the cache is cheap
>                 (details section 2.2.).
>
>                 e.g. if there are 5000 histories
>                 that the client lacks since time
>                 1234567, then there is a chance
>                 that many of the changes are just
>                 a waste of time.  e.g. add a file,
>                 then delete the same file, then
>                 add a different file again.  so
>                 why not just lie about the
>                 history, and send the last file,
>                 escaping ones int he middle?  same
>                 can be thought about diffs to code
>                 blocks.
>
> 2.1. properties of this new protocol
> ------------------------------------
> so this new protocol has these properties:
>     - unlike rsync, it doesn't need to compare all files
>       individually.
>     - unlike git, the history doesn't grow on the
>       client.  history remains only a single
>       number representing a logical clock.
>     - the history on the server is limited to N
>       past entries.  no devs will cry, because
>       this is not a code collaboration app, but
>       simply a file synchronisation app to replace
>       rsync.  so the admins are free to set N as
>       small as they please, without worrying about
>       harming collaborating devs.
>     - server has the option to compress histories
>       to clients, and these histories are
>       cacheable for more performance.
>
>
> 2.2. how it will feel to admins/devs
> ------------------------------------
>     - the devs simply commit their changes to the
>       portage tree via git.
>     - the git server will have hooks to execute an
>       external command for this new protocol, that
>       will calculate all diffs necessary in order
>       to build a new history.
>
>       e.g. if current history is 30000, and a dev
>       makes a new commit via git, then the git
>       hooks will execute the external command to
>       calculate the diff for the affected files by
>       the git commit, such that history 30001 is
>       created.
>
>       the hooked external command will also see if
>       it can compress the histories, for the past
>       M many entries since 30001.
>
>       so that clients that live in time 30001-M,
>       who ask for 30001, can get the compressed
>       history instead of raw actual histories from
>       30001-m to 30001.
>
> ty,
> cm
>


It seems like you are almost asking for git's --clone-depth and
--sync-depth flags.

Its not an exact match for your proposal but its very close.

>

[-- Attachment #2: Type: text/html, Size: 6205 bytes --]

      reply	other threads:[~2021-06-18 14:17 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-18 12:10 [gentoo-user] an efficient idea for an alternative portage synchronisation caveman رَجُلُ الْكَهْفِ 穴居人
2021-06-18 14:16 ` Michael Jones [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CABfmKS+kPauptZRObeYiLDYJRMbdRyCYDUmZ5a42v_-r2OnTJA@mail.gmail.com \
    --to=gentoo@jonesmz.com \
    --cc=gentoo-user@lists.gentoo.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox