From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from lists.gentoo.org (pigeon.gentoo.org [208.92.234.80]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by finch.gentoo.org (Postfix) with ESMTPS id 09B9215808B for ; Mon, 14 Mar 2022 01:06:39 +0000 (UTC) Received: from pigeon.gentoo.org (localhost [127.0.0.1]) by pigeon.gentoo.org (Postfix) with SMTP id 06508E086A; Mon, 14 Mar 2022 01:06:38 +0000 (UTC) Received: from smtp.gentoo.org (dev.gentoo.org [IPv6:2001:470:ea4a:1:5054:ff:fec7:86e4]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by pigeon.gentoo.org (Postfix) with ESMTPS id 8309FE086A for ; Mon, 14 Mar 2022 01:06:37 +0000 (UTC) Received: by mail-ed1-f54.google.com with SMTP id g3so17759471edu.1 for ; Sun, 13 Mar 2022 18:06:36 -0700 (PDT) X-Gm-Message-State: AOAM530MIK09D02MoOuof2zFGd6hGMPKvRoDnd4cCiXh4eHsxwHCqin0 t8EsdzycniLIKpYeItECPrbC+o6XsVylLoFct8Q= X-Google-Smtp-Source: ABdhPJzQn/xb5Tn3gIPTeQ/Zz/GMo/6N8+GhycYA+4IBcaIpVbRclyaj8UCJIR/nnn+oNBBPUWRr5af1yJGJQDJtync= X-Received: by 2002:a05:6402:1341:b0:407:cece:49f8 with SMTP id y1-20020a056402134100b00407cece49f8mr18620362edw.152.1647219993557; Sun, 13 Mar 2022 18:06:33 -0700 (PDT) Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Id: Gentoo Linux mail X-BeenThere: gentoo-portage-dev@lists.gentoo.org Reply-to: gentoo-portage-dev@lists.gentoo.org X-Auto-Response-Suppress: DR, RN, NRN, OOF, AutoReply MIME-Version: 1.0 From: Matt Turner Date: Sun, 13 Mar 2022 18:06:21 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: [gentoo-portage-dev] Changing the VDB format To: gentoo-portage-dev@lists.gentoo.org Cc: Tim Harder Content-Type: multipart/mixed; boundary="000000000000954fd305da234ad4" X-Archives-Salt: 4feaafbe-6d0d-4ab6-ab25-b24189758023 X-Archives-Hash: 891b99bc55239b475fd8d71659dc60ec --000000000000954fd305da234ad4 Content-Type: text/plain; charset="UTF-8" The VDB uses a one-file-per-variable format. This has some inefficiencies, with many file systems. For example the 'EAPI' file that contains a single character will consume a 4K block on disk. $ cd /var/db/pkg/sys-apps/portage-3.0.30-r1/ $ ls -lh --block-size=1 | awk 'BEGIN { sum = 0; } { sum += $5; } END { print sum }' 418517 $ du -sh --apparent-size . 413K . $ du -sh . 556K . During normal operations, portage has to read each of these 35+ files/package individually. I suggest that we change the VDB format to a commonly used format that can be quickly read by portage and any other tools. Combining these 35+ files into a single file with a commonly used format should: - speed up vdb access - improve disk usage - allow external tools to access VDB data more easily I've attached a program that prints the VDB contents of a specified package in different formats: json, toml, and yaml (and also Python PrettyPrinter, just because). I think it's important to keep the VDB format as plain-text for ease of manipulation, so I have not considered anything like sqlite. I expected to prefer toml, but I actually find it to be rather gross looking. $ ~/vdb.py --toml /var/db/pkg/sys-apps/portage-3.0.30-r1/ | wc -c 444663 $ ~/vdb.py --yaml /var/db/pkg/sys-apps/portage-3.0.30-r1/ | wc -c 385112 $ ~/vdb.py --json /var/db/pkg/sys-apps/portage-3.0.30-r1/ | wc -c 273428 toml and yaml are formatted in a human-readable manner, but json is not. Pipe the json output to app-misc/jq to get a better sense of its structure: $ ~/vdb.py --json /var/db/pkg/sys-apps/portage-3.0.30-r1/ | jq ... Compare with the raw contents of the files: $ ls -lh --block-size=1 | grep -v '\(environment.bz2\|repository\|\.ebuild\)' | awk 'BEGIN { sum = 0; } { sum += $5; } END { print sum }' 378658 Yes, the json is actually smaller because it does not contain large amounts of duplicated path strings in CONTENTS (which is 375320 bytes by itself, or 89% of the total size). I recommend json and think it is the best choice because: - json provides the smallest on-disk footprint - json is part of Python's standard library (so is yaml, and toml will be in Python 3.11) - Every programming language has multiple json parsers -- lots of effort has been spent making them extremely fast. I think we would have a significant time period for the transition. I think I would include support for the new format in Portage, and ship a tool with portage to switch back and forth between old and new formats on-disk. Maybe after a year, drop the code from Portage to support the old format? Thoughts? --000000000000954fd305da234ad4 Content-Type: text/x-python; charset="US-ASCII"; name="vdb.py" Content-Disposition: attachment; filename="vdb.py" Content-Transfer-Encoding: base64 Content-ID: X-Attachment-Id: f_l0pyl5h70 IyEvdXNyL2Jpbi9lbnYgcHl0aG9uCgppbXBvcnQgYXJncGFyc2UKaW1wb3J0IGpzb24KaW1wb3J0 IHBwcmludAppbXBvcnQgc3lzCmltcG9ydCB0b21sCmltcG9ydCB5YW1sCgpmcm9tIHBhdGhsaWIg aW1wb3J0IFBhdGgKCgpkZWYgbWFpbihhcmd2KToKICAgIHBwID0gcHByaW50LlByZXR0eVByaW50 ZXIoaW5kZW50PTIpCgogICAgcGFyc2VyID0gYXJncGFyc2UuQXJndW1lbnRQYXJzZXIoKQogICAg Z3JvdXAgPSBwYXJzZXIuYWRkX211dHVhbGx5X2V4Y2x1c2l2ZV9ncm91cChyZXF1aXJlZD1UcnVl KQogICAgZ3JvdXAuYWRkX2FyZ3VtZW50KCctLWpzb24nLCBhY3Rpb249J3N0b3JlX3RydWUnKQog ICAgZ3JvdXAuYWRkX2FyZ3VtZW50KCctLXRvbWwnLCBhY3Rpb249J3N0b3JlX3RydWUnKQogICAg Z3JvdXAuYWRkX2FyZ3VtZW50KCctLXlhbWwnLCBhY3Rpb249J3N0b3JlX3RydWUnKQogICAgZ3Jv dXAuYWRkX2FyZ3VtZW50KCctLXBwcmludCcsIGFjdGlvbj0nc3RvcmVfdHJ1ZScpCiAgICBwYXJz ZXIuYWRkX2FyZ3VtZW50KCd2ZGJkaXInLCB0eXBlPXN0cikKCiAgICBvcHRzID0gcGFyc2VyLnBh cnNlX2FyZ3MoYXJndlsxOl0pCgogICAgdmRiID0gUGF0aChvcHRzLnZkYmRpcikKICAgIGlmIG5v dCB2ZGIuaXNfZGlyKCk6CiAgICAgICAgcHJpbnQoZid7dmRifSBpcyBub3QgYSBkaXJlY3Rvcnkn KQogICAgICAgIHN5cy5leGl0KC0xKQoKICAgIGQgPSB7fQoKICAgIGZvciBmaWxlIGluICh4IGZv ciB4IGluIHZkYi5pdGVyZGlyKCkpOgogICAgICAgIGlmIG5vdCBmaWxlLm5hbWUuaXN1cHBlcigp OgogICAgICAgICAgICAjIHByaW50KGYiSWdub3JpbmcgZmlsZSB7ZmlsZS5uYW1lfSIpCiAgICAg ICAgICAgIGNvbnRpbnVlCgogICAgICAgIHZhbHVlID0gZmlsZS5yZWFkX3RleHQoKS5yc3RyaXAo J1xuJykKCiAgICAgICAgaWYgZmlsZS5uYW1lID09ICJDT05URU5UUyI6CiAgICAgICAgICAgIGNv bnRlbnRzID0ge30KCiAgICAgICAgICAgIGZvciBsaW5lIGluIHZhbHVlLnNwbGl0bGluZXMoa2Vl cGVuZHM9RmFsc2UpOgogICAgICAgICAgICAgICAgKHR5cGUsICpyZXN0KSA9IGxpbmUuc3BsaXQo c2VwPScgJykKICAgICAgICAgICAgICAgIHBhcnRzID0gcmVzdFswXS5zcGxpdChzZXA9Jy8nKQog ICAgICAgICAgICAgICAgcCA9IGNvbnRlbnRzCgogICAgICAgICAgICAgICAgaWYgdHlwZSA9PSAn ZGlyJzoKICAgICAgICAgICAgICAgICAgICBhc3NlcnQobGVuKHJlc3QpID09IDEpCgogICAgICAg ICAgICAgICAgICAgIGZvciBwYXJ0IGluIHBhcnRzWzE6XToKICAgICAgICAgICAgICAgICAgICAg ICAgcCA9IHAuc2V0ZGVmYXVsdChwYXJ0LCB7fSkKICAgICAgICAgICAgICAgIGVsc2U6CiAgICAg ICAgICAgICAgICAgICAgZm9yIHBhcnQgaW4gcGFydHNbMTotMV06CiAgICAgICAgICAgICAgICAg ICAgICAgIHAgPSBwLmdldChwYXJ0KQoKICAgICAgICAgICAgICAgIGlmIHR5cGUgPT0gJ29iaic6 CiAgICAgICAgICAgICAgICAgICAgYXNzZXJ0KGxlbihyZXN0KSA9PSAzKQogICAgICAgICAgICAg ICAgICAgIHBbcGFydHNbLTFdXSA9IHsnaGFzaCc6IHJlc3RbMV0sICdzaXplJzogcmVzdFsyXX0K ICAgICAgICAgICAgICAgIGVsaWYgdHlwZSA9PSAnc3ltJzoKICAgICAgICAgICAgICAgICAgICBh c3NlcnQobGVuKHJlc3QpID09IDQpCiAgICAgICAgICAgICAgICAgICAgcFtwYXJ0c1stMV1dID0g eyd0YXJnZXQnOiByZXN0WzJdLCAnc2l6ZSc6IHJlc3RbM119CgogICAgICAgICAgICBkW2ZpbGUu bmFtZV0gPSBjb250ZW50cwoKICAgICAgICBlbGlmIGZpbGUubmFtZSBpbiAoJ0RFRklORURfUEhB U0VTJywgJ0ZFQVRVUkVTJywgJ0hPTUVQQUdFJywKICAgICAgICAgICAgICAgICAgICAgICAgICAg J0lOSEVSSVRFRCcsICdJVVNFJywgJ0lVU0VfRUZGRUNUSVZFJywgJ0xJQ0VOU0UnLAogICAgICAg ICAgICAgICAgICAgICAgICAgICAnS0VZV09SRFMnLCAnUEtHVVNFJywgJ1JFU1RSSUNUJywgJ1VT RScpOgogICAgICAgICAgICBkW2ZpbGUubmFtZV0gPSB2YWx1ZS5zcGxpdCgnICcpCiAgICAgICAg ZWxzZToKICAgICAgICAgICAgZFtmaWxlLm5hbWVdID0gdmFsdWUKCiAgICBpZiBvcHRzLmpzb246 CiAgICAgICAganNvbi5kdW1wKGQsIHN5cy5zdGRvdXQpCiAgICBpZiBvcHRzLnRvbWw6CiAgICAg ICAgdG9tbC5kdW1wKGQsIHN5cy5zdGRvdXQpCiAgICBpZiBvcHRzLnlhbWw6CiAgICAgICAgeWFt bC5kdW1wKGQsIHN5cy5zdGRvdXQpCiAgICBpZiBvcHRzLnBwcmludDoKICAgICAgICBwcC5wcHJp bnQoZCkKCgppZiBfX25hbWVfXyA9PSAnX19tYWluX18nOgogICAgbWFpbihzeXMuYXJndikK --000000000000954fd305da234ad4--