The VDB uses a one-file-per-variable format. This has some inefficiencies, with many file systems. For example the 'EAPI' file that contains a single character will consume a 4K block on disk. $ cd /var/db/pkg/sys-apps/portage-3.0.30-r1/ $ ls -lh --block-size=1 | awk 'BEGIN { sum = 0; } { sum += $5; } END { print sum }' 418517 $ du -sh --apparent-size . 413K . $ du -sh . 556K . During normal operations, portage has to read each of these 35+ files/package individually. I suggest that we change the VDB format to a commonly used format that can be quickly read by portage and any other tools. Combining these 35+ files into a single file with a commonly used format should: - speed up vdb access - improve disk usage - allow external tools to access VDB data more easily I've attached a program that prints the VDB contents of a specified package in different formats: json, toml, and yaml (and also Python PrettyPrinter, just because). I think it's important to keep the VDB format as plain-text for ease of manipulation, so I have not considered anything like sqlite. I expected to prefer toml, but I actually find it to be rather gross looking. $ ~/vdb.py --toml /var/db/pkg/sys-apps/portage-3.0.30-r1/ | wc -c 444663 $ ~/vdb.py --yaml /var/db/pkg/sys-apps/portage-3.0.30-r1/ | wc -c 385112 $ ~/vdb.py --json /var/db/pkg/sys-apps/portage-3.0.30-r1/ | wc -c 273428 toml and yaml are formatted in a human-readable manner, but json is not. Pipe the json output to app-misc/jq to get a better sense of its structure: $ ~/vdb.py --json /var/db/pkg/sys-apps/portage-3.0.30-r1/ | jq ... Compare with the raw contents of the files: $ ls -lh --block-size=1 | grep -v '\(environment.bz2\|repository\|\.ebuild\)' | awk 'BEGIN { sum = 0; } { sum += $5; } END { print sum }' 378658 Yes, the json is actually smaller because it does not contain large amounts of duplicated path strings in CONTENTS (which is 375320 bytes by itself, or 89% of the total size). I recommend json and think it is the best choice because: - json provides the smallest on-disk footprint - json is part of Python's standard library (so is yaml, and toml will be in Python 3.11) - Every programming language has multiple json parsers -- lots of effort has been spent making them extremely fast. I think we would have a significant time period for the transition. I think I would include support for the new format in Portage, and ship a tool with portage to switch back and forth between old and new formats on-disk. Maybe after a year, drop the code from Portage to support the old format? Thoughts?