The VDB uses a one-file-per-variable format. This has some
inefficiencies, with many file systems. For example the 'EAPI' file
that contains a single character will consume a 4K block on disk.

$ cd /var/db/pkg/sys-apps/portage-3.0.30-r1/
$ ls -lh --block-size=1 | awk 'BEGIN { sum = 0; } { sum += $5; } END {
print sum }'
418517
$ du -sh --apparent-size .
413K    .
$ du -sh .
556K    .

During normal operations, portage has to read each of these 35+
files/package individually.

I suggest that we change the VDB format to a commonly used format that
can be quickly read by portage and any other tools. Combining these
35+ files into a single file with a commonly used format should:

- speed up vdb access
- improve disk usage
- allow external tools to access VDB data more easily

I've attached a program that prints the VDB contents of a specified
package in different formats: json, toml, and yaml (and also Python
PrettyPrinter, just because). I think it's important to keep the VDB
format as plain-text for ease of manipulation, so I have not
considered anything like sqlite.

I expected to prefer toml, but I actually find it to be rather gross looking.

$ ~/vdb.py --toml /var/db/pkg/sys-apps/portage-3.0.30-r1/ | wc -c
444663
$ ~/vdb.py --yaml /var/db/pkg/sys-apps/portage-3.0.30-r1/ | wc -c
385112
$ ~/vdb.py --json /var/db/pkg/sys-apps/portage-3.0.30-r1/ | wc -c
273428

toml and yaml are formatted in a human-readable manner, but json is
not. Pipe the json output to app-misc/jq to get a better sense of its
structure:

$ ~/vdb.py --json /var/db/pkg/sys-apps/portage-3.0.30-r1/ | jq
...

Compare with the raw contents of the files:

$ ls -lh --block-size=1 | grep -v
'\(environment.bz2\|repository\|\.ebuild\)' | awk 'BEGIN { sum = 0; }
{ sum += $5; } END { print sum }'
378658

Yes, the json is actually smaller because it does not contain large
amounts of duplicated path strings in CONTENTS (which is 375320 bytes
by itself, or 89% of the total size).

I recommend json and think it is the best choice because:

- json provides the smallest on-disk footprint
- json is part of Python's standard library (so is yaml, and toml will
be in Python 3.11)
- Every programming language has multiple json parsers
-- lots of effort has been spent making them extremely fast.

I think we would have a significant time period for the transition. I
think I would include support for the new format in Portage, and ship
a tool with portage to switch back and forth between old and new
formats on-disk. Maybe after a year, drop the code from Portage to
support the old format?

Thoughts?