From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from lists.gentoo.org (pigeon.gentoo.org [208.92.234.80]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by finch.gentoo.org (Postfix) with ESMTPS id 5C3851382C5 for ; Fri, 18 Jun 2021 14:17:07 +0000 (UTC) Received: from pigeon.gentoo.org (localhost [127.0.0.1]) by pigeon.gentoo.org (Postfix) with SMTP id 2E07FE0869; Fri, 18 Jun 2021 14:17:01 +0000 (UTC) Received: from mail-pg1-f181.google.com (mail-pg1-f181.google.com [209.85.215.181]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by pigeon.gentoo.org (Postfix) with ESMTPS id C50F0E0826 for ; Fri, 18 Jun 2021 14:17:00 +0000 (UTC) Received: by mail-pg1-f181.google.com with SMTP id t17so7941229pga.5 for ; Fri, 18 Jun 2021 07:17:00 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=rcfqM/dlrzrWVRliX9gPisQ7I9+xwis0AYegro5naOI=; b=AAFqpAZK9EPHkw+01qRo2t7p5Bk3V+fwqBAIooUBy9UyTmo3ZcnfKT76VyLqi069EQ EWXAcKg/kOA+D1APxfg6w4//cJ5Nebwn62KUeI6zhV2BN6v6WKYrg4Uv2epWtKZwuthr N64zfy586RKdN13KePw6d4TRoPUjF7OvG/pxDx89k8sLZZ6KlI2DsWOrb5gxLBKkdvow bjn09yQ8637VApB9umb6dGQWhVA8KLU2SItoHzqcJy01fv1v0ABlxF3qQmo+hdTTuM1/ Brwkt77kktoCQoFfg/3oNOz9Clnw9mHRvjNGB6T1DZ4ShUinDVfMxUA2zT7r71eMcdSo BPeg== X-Gm-Message-State: AOAM533XMFqTgO9gLvrwcaNyn/oyk8zjUqm7kNX9hZ5lu8IzBAA2eNJF Hu1GF01+uJG7zlLG8tz4W9gyQp4Y0GsGKQLag2hIN2hAR9Ikq5X9 X-Google-Smtp-Source: ABdhPJynqJWad7XCO1V15LC0TgOMd6RAzbH9O5YCHRO747Mr/hlAvwZEqVQpab8xqJoctGRydSnhCRLkFui1uX9wa6I= X-Received: by 2002:a62:30c2:0:b029:289:116c:ec81 with SMTP id w185-20020a6230c20000b0290289116cec81mr5389220pfw.42.1624025819313; Fri, 18 Jun 2021 07:16:59 -0700 (PDT) Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Id: Gentoo Linux mail X-BeenThere: gentoo-user@lists.gentoo.org Reply-to: gentoo-user@lists.gentoo.org X-Auto-Response-Suppress: DR, RN, NRN, OOF, AutoReply MIME-Version: 1.0 References: In-Reply-To: From: Michael Jones Date: Fri, 18 Jun 2021 09:16:48 -0500 Message-ID: Subject: Re: [gentoo-user] an efficient idea for an alternative portage synchronisation To: gentoo-user@lists.gentoo.org Content-Type: multipart/alternative; boundary="00000000000011102e05c50afac9" X-Archives-Salt: fbc61cdb-755b-4f79-958d-9c0ebc5beda2 X-Archives-Hash: bb7ecb6dbbedd40e39a3026d1231f8ae --00000000000011102e05c50afac9 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Fri, Jun 18, 2021, 07:10 caveman =D8=B1=D9=8E=D8=AC=D9=8F=D9=84=D9=8F = =D8=A7=D9=84=D9=92=D9=83=D9=8E=D9=87=D9=92=D9=81=D9=90 =E7=A9=B4=E5=B1=85= =E4=BA=BA < toraboracaveman@protonmail.com> wrote: > tl;dr - i'm suggesting a new file syncing protocol > for portage syncing. details of this one is in > section 2. > > > 1. background > ------------- > rsync needs to read all files in order to compare > them. this is too expensive and doesn't scale as > portage's tree grows in size.. > > on the other hand, git gets away with this, by > maintaining a history of edits. so git doesn't > need to compare all files, instead it walks > through the history. > > but git has another issue: the history getting > too big. this causes: > - `git clone` to needlessly take too long, as > many old histories become irrelevant as they > get fully overwridden by newer ones. > - this also causes `git pull` to be slower > than needed, as the history is not ideally > compressed. > - plus, the disk space that's wasted for > histories. > > > 2. new protocol > --------------- > to solve issues above, i think the ideal solution > is this protocol: > - each history is a number representing a > logical clock. 1st history is 0, 2nd is 1, > etc. > - the server maintains a list of N past many > histories of the portage tree. > - when a client requests to update its portage > tree, it tells the server its current > history. e.g. say client is currently > located in logical time 1234567. > - the server is maintaining only the past N > histories: > - if 1234567 is behind those maintained N > ones, then the server sends a full > portage tree from scratch. > - if 1234567 is within those maintained N > ones, then the server has two options: > (1) either send all changes since > 1234567, as they happened > historically. this is a bad idea. > no good reason for it. > > (2) better: the server can send the > compressed histories. compressed > histories are done once, and > cached, in a scalable way. the > cache itself is incremental, so > updating the cache is cheap > (details section 2.2.). > > e.g. if there are 5000 histories > that the client lacks since time > 1234567, then there is a chance > that many of the changes are just > a waste of time. e.g. add a file, > then delete the same file, then > add a different file again. so > why not just lie about the > history, and send the last file, > escaping ones int he middle? same > can be thought about diffs to code > blocks. > > 2.1. properties of this new protocol > ------------------------------------ > so this new protocol has these properties: > - unlike rsync, it doesn't need to compare all files > individually. > - unlike git, the history doesn't grow on the > client. history remains only a single > number representing a logical clock. > - the history on the server is limited to N > past entries. no devs will cry, because > this is not a code collaboration app, but > simply a file synchronisation app to replace > rsync. so the admins are free to set N as > small as they please, without worrying about > harming collaborating devs. > - server has the option to compress histories > to clients, and these histories are > cacheable for more performance. > > > 2.2. how it will feel to admins/devs > ------------------------------------ > - the devs simply commit their changes to the > portage tree via git. > - the git server will have hooks to execute an > external command for this new protocol, that > will calculate all diffs necessary in order > to build a new history. > > e.g. if current history is 30000, and a dev > makes a new commit via git, then the git > hooks will execute the external command to > calculate the diff for the affected files by > the git commit, such that history 30001 is > created. > > the hooked external command will also see if > it can compress the histories, for the past > M many entries since 30001. > > so that clients that live in time 30001-M, > who ask for 30001, can get the compressed > history instead of raw actual histories from > 30001-m to 30001. > > ty, > cm > It seems like you are almost asking for git's --clone-depth and --sync-depth flags. Its not an exact match for your proposal but its very close. > --00000000000011102e05c50afac9 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable


On Fri, Jun 18, 2021, 07:10 caveman =D8=B1=D9=8E=D8=AC= =D9=8F=D9=84=D9=8F =D8=A7=D9=84=D9=92=D9=83=D9=8E=D9=87=D9=92=D9=81=D9=90 = =E7=A9=B4=E5=B1=85=E4=BA=BA <toraboracaveman@protonmail.com> wrote:
tl;dr - i'm suggesting a new file syncing protocol for portage syncing.=C2=A0 details of this one is in
section 2.


1. background
-------------
rsync needs to read all files in order to compare
them.=C2=A0 this is too expensive and doesn't scale as
portage's tree grows in size..

on the other hand, git gets away with this, by
maintaining a history of edits.=C2=A0 so git doesn't
need to compare all files, instead it walks
through the history.

but git has another issue:=C2=A0 the history getting
too big.=C2=A0 this causes:
=C2=A0 =C2=A0 - `git clone` to needlessly take too long, as
=C2=A0 =C2=A0 =C2=A0 many old histories become irrelevant as they
=C2=A0 =C2=A0 =C2=A0 get fully overwridden by newer ones.
=C2=A0 =C2=A0 - this also causes `git pull` to be slower
=C2=A0 =C2=A0 =C2=A0 than needed, as the history is not ideally
=C2=A0 =C2=A0 =C2=A0 compressed.
=C2=A0 =C2=A0 - plus, the disk space that's wasted for
=C2=A0 =C2=A0 =C2=A0 histories.


2. new protocol
---------------
to solve issues above, i think the ideal solution
is this protocol:
=C2=A0 =C2=A0 - each history is a number representing a
=C2=A0 =C2=A0 =C2=A0 logical clock.=C2=A0 1st history is 0, 2nd is 1,
=C2=A0 =C2=A0 =C2=A0 etc.
=C2=A0 =C2=A0 - the server maintains a list of N past many
=C2=A0 =C2=A0 =C2=A0 histories of the portage tree.
=C2=A0 =C2=A0 - when a client requests to update its portage
=C2=A0 =C2=A0 =C2=A0 tree, it tells the server its current
=C2=A0 =C2=A0 =C2=A0 history.=C2=A0 e.g. say client is currently
=C2=A0 =C2=A0 =C2=A0 located in logical time 1234567.
=C2=A0 =C2=A0 - the server is maintaining only the past N
=C2=A0 =C2=A0 =C2=A0 histories:
=C2=A0 =C2=A0 =C2=A0 =C2=A0 - if 1234567 is behind those maintained N
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ones, then the server sends a full
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 portage tree from scratch.
=C2=A0 =C2=A0 =C2=A0 =C2=A0 - if 1234567 is within those maintained N
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ones, then the server has two options: =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 (1) either send all changes since=
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 1234567, as they ha= ppened
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 historically.=C2=A0= this is a bad idea.
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 no good reason for = it.

=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 (2) better: the server can send t= he
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 compressed historie= s.=C2=A0 compressed
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 histories are done = once, and
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 cached, in a scalab= le way.=C2=A0 the
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 cache itself is inc= remental, so
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 updating the cache = is cheap
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 (details section 2.= 2.).

=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 e.g. if there are 5= 000 histories
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 that the client lac= ks since time
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 1234567, then there= is a chance
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 that many of the ch= anges are just
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 a waste of time.=C2= =A0 e.g. add a file,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 then delete the sam= e file, then
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 add a different fil= e again.=C2=A0 so
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 why not just lie ab= out the
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 history, and send t= he last file,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 escaping ones int h= e middle?=C2=A0 same
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 can be thought abou= t diffs to code
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 blocks.

2.1. properties of this new protocol
------------------------------------
so this new protocol has these properties:
=C2=A0 =C2=A0 - unlike rsync, it doesn't need to compare all files
=C2=A0 =C2=A0 =C2=A0 individually.
=C2=A0 =C2=A0 - unlike git, the history doesn't grow on the
=C2=A0 =C2=A0 =C2=A0 client.=C2=A0 history remains only a single
=C2=A0 =C2=A0 =C2=A0 number representing a logical clock.
=C2=A0 =C2=A0 - the history on the server is limited to N
=C2=A0 =C2=A0 =C2=A0 past entries.=C2=A0 no devs will cry, because
=C2=A0 =C2=A0 =C2=A0 this is not a code collaboration app, but
=C2=A0 =C2=A0 =C2=A0 simply a file synchronisation app to replace
=C2=A0 =C2=A0 =C2=A0 rsync.=C2=A0 so the admins are free to set N as
=C2=A0 =C2=A0 =C2=A0 small as they please, without worrying about
=C2=A0 =C2=A0 =C2=A0 harming collaborating devs.
=C2=A0 =C2=A0 - server has the option to compress histories
=C2=A0 =C2=A0 =C2=A0 to clients, and these histories are
=C2=A0 =C2=A0 =C2=A0 cacheable for more performance.


2.2. how it will feel to admins/devs
------------------------------------
=C2=A0 =C2=A0 - the devs simply commit their changes to the
=C2=A0 =C2=A0 =C2=A0 portage tree via git.
=C2=A0 =C2=A0 - the git server will have hooks to execute an
=C2=A0 =C2=A0 =C2=A0 external command for this new protocol, that
=C2=A0 =C2=A0 =C2=A0 will calculate all diffs necessary in order
=C2=A0 =C2=A0 =C2=A0 to build a new history.

=C2=A0 =C2=A0 =C2=A0 e.g. if current history is 30000, and a dev
=C2=A0 =C2=A0 =C2=A0 makes a new commit via git, then the git
=C2=A0 =C2=A0 =C2=A0 hooks will execute the external command to
=C2=A0 =C2=A0 =C2=A0 calculate the diff for the affected files by
=C2=A0 =C2=A0 =C2=A0 the git commit, such that history 30001 is
=C2=A0 =C2=A0 =C2=A0 created.

=C2=A0 =C2=A0 =C2=A0 the hooked external command will also see if
=C2=A0 =C2=A0 =C2=A0 it can compress the histories, for the past
=C2=A0 =C2=A0 =C2=A0 M many entries since 30001.

=C2=A0 =C2=A0 =C2=A0 so that clients that live in time 30001-M,
=C2=A0 =C2=A0 =C2=A0 who ask for 30001, can get the compressed
=C2=A0 =C2=A0 =C2=A0 history instead of raw actual histories from
=C2=A0 =C2=A0 =C2=A0 30001-m to 30001.

ty,
cm


It seems like you are almost asking for git&#= 39;s --clone-depth and --sync-depth flags.

Its not an exact match for your proposal but its very cl= ose.
--00000000000011102e05c50afac9--