public inbox for gentoo-user@lists.gentoo.org
 help / color / mirror / Atom feed
* [gentoo-user] an efficient idea for an alternative portage synchronisation
@ 2021-06-18 12:10 caveman رَجُلُ الْكَهْفِ 穴居人
  2021-06-18 14:16 ` Michael Jones
  0 siblings, 1 reply; 2+ messages in thread
From: caveman رَجُلُ الْكَهْفِ 穴居人 @ 2021-06-18 12:10 UTC (permalink / raw
  To: Gentoo

tl;dr - i'm suggesting a new file syncing protocol
for portage syncing.  details of this one is in
section 2.


1. background
-------------
rsync needs to read all files in order to compare
them.  this is too expensive and doesn't scale as
portage's tree grows in size..

on the other hand, git gets away with this, by
maintaining a history of edits.  so git doesn't
need to compare all files, instead it walks
through the history.

but git has another issue:  the history getting
too big.  this causes:
    - `git clone` to needlessly take too long, as
      many old histories become irrelevant as they
      get fully overwridden by newer ones.
    - this also causes `git pull` to be slower
      than needed, as the history is not ideally
      compressed.
    - plus, the disk space that's wasted for
      histories.


2. new protocol
---------------
to solve issues above, i think the ideal solution
is this protocol:
    - each history is a number representing a
      logical clock.  1st history is 0, 2nd is 1,
      etc.
    - the server maintains a list of N past many
      histories of the portage tree.
    - when a client requests to update its portage
      tree, it tells the server its current
      history.  e.g. say client is currently
      located in logical time 1234567.
    - the server is maintaining only the past N
      histories:
        - if 1234567 is behind those maintained N
          ones, then the server sends a full
          portage tree from scratch.
        - if 1234567 is within those maintained N
          ones, then the server has two options:
            (1) either send all changes since
                1234567, as they happened
                historically.  this is a bad idea.
                no good reason for it.

            (2) better: the server can send the
                compressed histories.  compressed
                histories are done once, and
                cached, in a scalable way.  the
                cache itself is incremental, so
                updating the cache is cheap
                (details section 2.2.).

                e.g. if there are 5000 histories
                that the client lacks since time
                1234567, then there is a chance
                that many of the changes are just
                a waste of time.  e.g. add a file,
                then delete the same file, then
                add a different file again.  so
                why not just lie about the
                history, and send the last file,
                escaping ones int he middle?  same
                can be thought about diffs to code
                blocks.

2.1. properties of this new protocol
------------------------------------
so this new protocol has these properties:
    - unlike rsync, it doesn't need to compare all files
      individually.
    - unlike git, the history doesn't grow on the
      client.  history remains only a single
      number representing a logical clock.
    - the history on the server is limited to N
      past entries.  no devs will cry, because
      this is not a code collaboration app, but
      simply a file synchronisation app to replace
      rsync.  so the admins are free to set N as
      small as they please, without worrying about
      harming collaborating devs.
    - server has the option to compress histories
      to clients, and these histories are
      cacheable for more performance.


2.2. how it will feel to admins/devs
------------------------------------
    - the devs simply commit their changes to the
      portage tree via git.
    - the git server will have hooks to execute an
      external command for this new protocol, that
      will calculate all diffs necessary in order
      to build a new history.

      e.g. if current history is 30000, and a dev
      makes a new commit via git, then the git
      hooks will execute the external command to
      calculate the diff for the affected files by
      the git commit, such that history 30001 is
      created.

      the hooked external command will also see if
      it can compress the histories, for the past
      M many entries since 30001.

      so that clients that live in time 30001-M,
      who ask for 30001, can get the compressed
      history instead of raw actual histories from
      30001-m to 30001.

ty,
cm.



^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [gentoo-user] an efficient idea for an alternative portage synchronisation
  2021-06-18 12:10 [gentoo-user] an efficient idea for an alternative portage synchronisation caveman رَجُلُ الْكَهْفِ 穴居人
@ 2021-06-18 14:16 ` Michael Jones
  0 siblings, 0 replies; 2+ messages in thread
From: Michael Jones @ 2021-06-18 14:16 UTC (permalink / raw
  To: gentoo-user

[-- Attachment #1: Type: text/plain, Size: 4997 bytes --]

On Fri, Jun 18, 2021, 07:10 caveman رَجُلُ الْكَهْفِ 穴居人 <
toraboracaveman@protonmail.com> wrote:

> tl;dr - i'm suggesting a new file syncing protocol
> for portage syncing.  details of this one is in
> section 2.
>
>
> 1. background
> -------------
> rsync needs to read all files in order to compare
> them.  this is too expensive and doesn't scale as
> portage's tree grows in size..
>
> on the other hand, git gets away with this, by
> maintaining a history of edits.  so git doesn't
> need to compare all files, instead it walks
> through the history.
>
> but git has another issue:  the history getting
> too big.  this causes:
>     - `git clone` to needlessly take too long, as
>       many old histories become irrelevant as they
>       get fully overwridden by newer ones.
>     - this also causes `git pull` to be slower
>       than needed, as the history is not ideally
>       compressed.
>     - plus, the disk space that's wasted for
>       histories.
>
>
> 2. new protocol
> ---------------
> to solve issues above, i think the ideal solution
> is this protocol:
>     - each history is a number representing a
>       logical clock.  1st history is 0, 2nd is 1,
>       etc.
>     - the server maintains a list of N past many
>       histories of the portage tree.
>     - when a client requests to update its portage
>       tree, it tells the server its current
>       history.  e.g. say client is currently
>       located in logical time 1234567.
>     - the server is maintaining only the past N
>       histories:
>         - if 1234567 is behind those maintained N
>           ones, then the server sends a full
>           portage tree from scratch.
>         - if 1234567 is within those maintained N
>           ones, then the server has two options:
>             (1) either send all changes since
>                 1234567, as they happened
>                 historically.  this is a bad idea.
>                 no good reason for it.
>
>             (2) better: the server can send the
>                 compressed histories.  compressed
>                 histories are done once, and
>                 cached, in a scalable way.  the
>                 cache itself is incremental, so
>                 updating the cache is cheap
>                 (details section 2.2.).
>
>                 e.g. if there are 5000 histories
>                 that the client lacks since time
>                 1234567, then there is a chance
>                 that many of the changes are just
>                 a waste of time.  e.g. add a file,
>                 then delete the same file, then
>                 add a different file again.  so
>                 why not just lie about the
>                 history, and send the last file,
>                 escaping ones int he middle?  same
>                 can be thought about diffs to code
>                 blocks.
>
> 2.1. properties of this new protocol
> ------------------------------------
> so this new protocol has these properties:
>     - unlike rsync, it doesn't need to compare all files
>       individually.
>     - unlike git, the history doesn't grow on the
>       client.  history remains only a single
>       number representing a logical clock.
>     - the history on the server is limited to N
>       past entries.  no devs will cry, because
>       this is not a code collaboration app, but
>       simply a file synchronisation app to replace
>       rsync.  so the admins are free to set N as
>       small as they please, without worrying about
>       harming collaborating devs.
>     - server has the option to compress histories
>       to clients, and these histories are
>       cacheable for more performance.
>
>
> 2.2. how it will feel to admins/devs
> ------------------------------------
>     - the devs simply commit their changes to the
>       portage tree via git.
>     - the git server will have hooks to execute an
>       external command for this new protocol, that
>       will calculate all diffs necessary in order
>       to build a new history.
>
>       e.g. if current history is 30000, and a dev
>       makes a new commit via git, then the git
>       hooks will execute the external command to
>       calculate the diff for the affected files by
>       the git commit, such that history 30001 is
>       created.
>
>       the hooked external command will also see if
>       it can compress the histories, for the past
>       M many entries since 30001.
>
>       so that clients that live in time 30001-M,
>       who ask for 30001, can get the compressed
>       history instead of raw actual histories from
>       30001-m to 30001.
>
> ty,
> cm
>


It seems like you are almost asking for git's --clone-depth and
--sync-depth flags.

Its not an exact match for your proposal but its very close.

>

[-- Attachment #2: Type: text/html, Size: 6205 bytes --]

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2021-06-18 14:17 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-06-18 12:10 [gentoo-user] an efficient idea for an alternative portage synchronisation caveman رَجُلُ الْكَهْفِ 穴居人
2021-06-18 14:16 ` Michael Jones

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox