From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from lists.gentoo.org (pigeon.gentoo.org [208.92.234.80]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by finch.gentoo.org (Postfix) with ESMTPS id 1131B1382C5 for ; Fri, 18 Jun 2021 12:10:39 +0000 (UTC) Received: from pigeon.gentoo.org (localhost [127.0.0.1]) by pigeon.gentoo.org (Postfix) with SMTP id 3B925E086A; Fri, 18 Jun 2021 12:10:34 +0000 (UTC) Received: from mail-40130.protonmail.ch (mail-40130.protonmail.ch [185.70.40.130]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by pigeon.gentoo.org (Postfix) with ESMTPS id 617ACE07F1 for ; Fri, 18 Jun 2021 12:10:33 +0000 (UTC) Date: Fri, 18 Jun 2021 12:10:28 +0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=protonmail.com; s=protonmail; t=1624018230; bh=NEtuYTCYrhS/UGVOsZN6oiL9h7zoE6Z6gYl+opgq0Kc=; h=Date:To:From:Reply-To:Subject:From; b=NcrF/CtXa5WE9St0Ku5+euW3oAl15r2fpxxvBIqYXIoLAkXbHhDRRvFloEEjtCCLR lV0LEOT1DCiAKti+MT2ApsgJ89J0N03QNAcb4Ku8Ejl1x036r7XDQQlFZHmEDn2Mc4 h9Ym9NTcu3xAbvhWb9EUfbKgQz7a28UPv3/QUb/s= To: Gentoo From: =?utf-8?B?Y2F2ZW1hbiDYsdmO2KzZj9mE2Y8g2KfZhNmS2YPZjtmH2ZLZgdmQIOeptA==?= =?utf-8?B?5bGF5Lq6?= Subject: [gentoo-user] an efficient idea for an alternative portage synchronisation Message-ID: Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Id: Gentoo Linux mail X-BeenThere: gentoo-user@lists.gentoo.org Reply-to: gentoo-user@lists.gentoo.org X-Auto-Response-Suppress: DR, RN, NRN, OOF, AutoReply MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-1.2 required=10.0 tests=ALL_TRUSTED,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM shortcircuit=no autolearn=disabled version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on mailout.protonmail.ch X-Archives-Salt: 9765e7d6-48b8-4459-a1ee-a1f22d71345a X-Archives-Hash: e8ea9636d74fbef6b432e2f1d4c4b83d tl;dr - i'm suggesting a new file syncing protocol for portage syncing. details of this one is in section 2. 1. background ------------- rsync needs to read all files in order to compare them. this is too expensive and doesn't scale as portage's tree grows in size.. on the other hand, git gets away with this, by maintaining a history of edits. so git doesn't need to compare all files, instead it walks through the history. but git has another issue: the history getting too big. this causes: - `git clone` to needlessly take too long, as many old histories become irrelevant as they get fully overwridden by newer ones. - this also causes `git pull` to be slower than needed, as the history is not ideally compressed. - plus, the disk space that's wasted for histories. 2. new protocol --------------- to solve issues above, i think the ideal solution is this protocol: - each history is a number representing a logical clock. 1st history is 0, 2nd is 1, etc. - the server maintains a list of N past many histories of the portage tree. - when a client requests to update its portage tree, it tells the server its current history. e.g. say client is currently located in logical time 1234567. - the server is maintaining only the past N histories: - if 1234567 is behind those maintained N ones, then the server sends a full portage tree from scratch. - if 1234567 is within those maintained N ones, then the server has two options: (1) either send all changes since 1234567, as they happened historically. this is a bad idea. no good reason for it. (2) better: the server can send the compressed histories. compressed histories are done once, and cached, in a scalable way. the cache itself is incremental, so updating the cache is cheap (details section 2.2.). e.g. if there are 5000 histories that the client lacks since time 1234567, then there is a chance that many of the changes are just a waste of time. e.g. add a file, then delete the same file, then add a different file again. so why not just lie about the history, and send the last file, escaping ones int he middle? same can be thought about diffs to code blocks. 2.1. properties of this new protocol ------------------------------------ so this new protocol has these properties: - unlike rsync, it doesn't need to compare all files individually. - unlike git, the history doesn't grow on the client. history remains only a single number representing a logical clock. - the history on the server is limited to N past entries. no devs will cry, because this is not a code collaboration app, but simply a file synchronisation app to replace rsync. so the admins are free to set N as small as they please, without worrying about harming collaborating devs. - server has the option to compress histories to clients, and these histories are cacheable for more performance. 2.2. how it will feel to admins/devs ------------------------------------ - the devs simply commit their changes to the portage tree via git. - the git server will have hooks to execute an external command for this new protocol, that will calculate all diffs necessary in order to build a new history. e.g. if current history is 30000, and a dev makes a new commit via git, then the git hooks will execute the external command to calculate the diff for the affected files by the git commit, such that history 30001 is created. the hooked external command will also see if it can compress the histories, for the past M many entries since 30001. so that clients that live in time 30001-M, who ask for 30001, can get the compressed history instead of raw actual histories from 30001-m to 30001. ty, cm.