* [gentoo-dev] orphaned files on system?
@ 2003-04-17 11:41 leon j. breedt
2003-04-20 14:46 ` Daniel Armyr
2003-04-21 7:44 ` Evan Powers
0 siblings, 2 replies; 4+ messages in thread
From: leon j. breedt @ 2003-04-17 11:41 UTC (permalink / raw
To: gentoo-dev
[-- Attachment #1.1: Type: text/plain, Size: 1185 bytes --]
hi,
i use the attached script to scan for unpackaged files on my filesystem,
and found quite a few in /etc, /usr/lib, /usr/X11R6 as well as the
expected places. most of them were symlinks, the intention being fairly
obvious (like the NVIDIA OpenGL stuff).
but i was hoping someone could explain why files like /etc/make.conf, /etc/csh.env,
/etc/env.d/05gcc and /usr/include/awk/acconfig.h didn't belong to any package.
i run the script with:
$ ./gtfilelint -v -C gtfilelint.conf -o orphans.list
use -h to see available params. multiple -v increases verbosity.
if you have a system with lots of packages, its going to take some time,
as it caches all the /var/db/pkg/**/CONTENTS entries in a Berkeley hashdb
for quick lookups, then runs /usr/bin/find on /, and compares results. exclusions
to the find output are made by adding python re module regexes to
gtfilelint.conf.
if you run it as user, you may get some error output from find about permissions.
you will want to specify a config file, otherwise you'll get a lot of stuff
you probably don't care about.
hope someone finds this useful
leon
--
in the beginning, was the code.
[-- Attachment #1.2: gtfilelint --]
[-- Type: text/plain, Size: 5980 bytes --]
#!/usr/bin/env python
#
# Finds files on Gentoo Linux systems that do not belong
# to any installed package.
#
# Released under the GNU GPL.
#
# (C) Copyright 2003 Leon J. Breedt
#
# $Id$
import dbhash
import getopt
import os
import os.path
import re
import string
import sys
TRUE = 1
FALSE = 0
version = '0.1.1'
configfile = '/etc/gtfilelint.conf'
dbdir = '/var/db/pkg'
cachefile = '/tmp/gtfilelint.db'
outputfile = None
warnmissing = FALSE
findcmd = 'find / -print'
exclusions = []
verbosity = 0
cachedb = None
def verb(msg, level=1):
if verbosity >= level:
sys.stderr.write('-- %s\n' % msg)
def vverb(msg):
verb(msg, 2)
def info(msg):
sys.stderr.write('>> %s\n' % msg)
def error(msg):
sys.stderr.write('error: %s\n' % msg)
sys.exit(1)
def warn(msg):
sys.stderr.write('warning: %s\n' % msg)
def usage():
print 'usage: %s [options]' % sys.argv[0]
print 'options:'
print '-h|--help display this message'
print '-V|--version print program version and exit'
print '-v|--verbose print verbose messages about what is being done'
print '-d|--dbdir directory containing package database (default: %s)' % dbdir
print '-c|--cachefile file to place temporary cache in (default: %s)' % cachefile
print '-C|--configfile configuration file (default: %s)' % configfile
print '-o|--outputfile file to print orphan list to (default: stdout)'
print '--warnmissing warn if files declared in CONTENTS don\'t exist'
def parse_cmdline():
global verbosity, dbdir, cachefile, configfile, outputfile, warnmissing
opts, args = getopt.getopt(sys.argv[1:], "hVvd:c:C:o:", ["help", "version", "verbose", "dbdir=", "cachefile=", "configfile=", "outputfile=", "warnmissing"])
for opt, arg in opts:
if opt in ("-h", "--help"):
usage()
sys.exit(0)
if opt in ("-V", "--version"):
print version
sys.exit(0)
if opt in ("-v", "--verbose"):
verbosity = verbosity + 1
if opt in ("-d", "--dbdir"):
dbdir = arg
if opt in ("-c", "--cachefile"):
cachefile = arg
if opt in ("-C", "--configfile"):
configfile = arg
if opt in ("-o", "--outputfile"):
outputfile = arg
if opt == "--warnmissing":
warnmissing = TRUE
def parse_config():
if not os.path.exists(configfile) or not os.access(configfile, os.R_OK):
warn('missing configfile "%s"' % configfile)
return
fp = open(configfile, 'r')
for line in fp.readlines():
line = string.strip(line)
if len(line) == 0:
continue
if line[0] == '#':
continue
exclusions.append(re.compile(line))
verb('adding "%s" to list of exclusion regular expressions' % line)
fp.close()
def cache_package_files(package, packagepath):
verb('caching contents of "%s"' % package)
fp = open(packagepath + '/CONTENTS')
lineno = 0
for line in fp.readlines():
lineno = lineno + 1
line = string.strip(line)
if len(line) == 0:
continue
key = None
m = re.match(r"^dir (\S.*)$", line)
if m:
key = m.group(1)
m = None
else:
m = re.match(r"^obj (\S.*) (\S+) (\d+)\s*$", line)
if m:
key = m.group(1)
m = None
m = re.match(r"^sym (\S.*) -> .*$", line)
if m:
key = m.group(1)
if key != None:
if not os.path.exists(key) and warnmissing:
warn('%s: "%s" does not exist on filesystem, ignoring' % (package, key))
vverb('caching "%s"' % key)
cachedb[key] = ''
else:
vverb('key is None for "%s" CONTENTS line %d' % (package, lineno))
fp.close()
def scan_group_packages(group, grouppath):
packages = os.listdir(grouppath)
packages.sort()
verb('found %d packages in group "%s"' % (len(packages), group))
for package in packages:
packagepath = grouppath + '/' + package
cache_package_files(package, packagepath)
def create_system_filelist():
info('scanning all files on system')
sout = os.popen(findcmd, 'r')
verb('reading paths from "%s"' % findcmd)
paths = sout.readlines()
orphans = 0
rptfp = None
for path in paths:
path = string.strip(path)
if len(path) == 0:
continue
if path[0] != '/':
warn('ignoring relative path "%s"' % path)
continue
matched = FALSE
for exre in exclusions:
if exre.match(path):
matched = TRUE
break
if matched:
vverb('"%s" matched exclusion regex, ignoring' % path)
continue
if not cachedb.has_key(path):
if orphans == 0:
if outputfile:
info('writing orphaned file list [%s]' % outputfile)
rptfp = open(outputfile, 'w+')
else:
info('orphaned files:')
rptfp = sys.stdout
rptfp.flush()
orphans = orphans + 1
rptfp.write('%s\n' % path)
rptfp.flush()
if rptfp:
rptfp.close()
sout.close()
if orphans > 0:
info('%d orphaned file(s) found' % orphans)
else:
info('no orphaned files on system')
# Main
try:
parse_cmdline()
parse_config()
info('creating packaged files cache [%s]' % cachefile)
cachedb = dbhash.open(cachefile, 'n')
try:
groups = os.listdir(dbdir)
groups.sort()
for group in groups:
grouppath = dbdir + '/' + group
scan_group_packages(group, grouppath)
create_system_filelist()
finally:
if cachedb:
cachedb.close()
os.unlink(cachefile)
except KeyboardInterrupt: 1
except:
raise
[-- Attachment #1.3: gtfilelint.conf --]
[-- Type: text/plain, Size: 1099 bytes --]
# we don't really care about these dynamic paths
^/var/log/.*
^/var/db/.*
^/var/spool/.*
^/var/tmp/.*
^/var/lib/.*
^/var/cache/.*
^/var/run/.*
# / is not owned by any package
^/$
# don't care about root's config files
^/root/.*
# /usr/local is typically just user compiled stuff,
# don't care about it -- this list from baselayout
^/usr/local/bin/.*
^/usr/local/doc$
^/usr/local/lib/.*
^/usr/local/man$
^/usr/local/src/.*
^/usr/local/sbin/.*
^/usr/local/games/.*
^/usr/local/share/doc/.*
^/usr/local/share/man/.*
^/usr/local/share/.*
# what is /lib/dev-state? dunno...but the dir is in
# baselayout, even if the files arent
^/lib/dev-state/.*
# devices aren't that important to us...the packaged
# files will not be visible anyway due to devfs
^/dev/.*
# anyone packaging anything into /tmp should be shot
^/tmp/.*
# portage tree we don't care about either
^/usr/portage$
^/usr/portage/.*
# mountpoints shouldn't have package files installed in them
^/mnt/.*
# system filesystems should be ignored
^/proc/.*
^/sys/.*
# USER CUSTOMIZATIONS
^/data
^/data/.*
^/cdrom.*
^/windata
^/windata/.*
[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [gentoo-dev] orphaned files on system?
2003-04-17 11:41 [gentoo-dev] orphaned files on system? leon j. breedt
@ 2003-04-20 14:46 ` Daniel Armyr
2003-04-21 7:44 ` Evan Powers
1 sibling, 0 replies; 4+ messages in thread
From: Daniel Armyr @ 2003-04-20 14:46 UTC (permalink / raw
To: leon j. breedt; +Cc: gentoo-dev
Hm, gave it a run. The code looks pretty neat. I havn't studied it
enough to know exactly what it does, but it seem like it has some real
nifty features. Unfortunately It seems awfully slow and heavy on the
system. My first run crashed X, I think. The second hung up when caching
vanilla-sources. Third time worked though, and I like the brevity of the
list it produces. A check on the used processor time says it is about 4
times slower than a simple sh script that was presented a while back
which does no filtering of the output. It simply prints all files on the
system ot mentioned in /var/db/pkg. Either way, looks like a good
program, but I think it can be worth the attempt to try to find ways to
make it faster.
//Daniel Armyr
leon j. breedt wrote:
>hi,
>
>i use the attached script to scan for unpackaged files on my filesystem,
>and found quite a few in /etc, /usr/lib, /usr/X11R6 as well as the
>expected places. most of them were symlinks, the intention being fairly
>obvious (like the NVIDIA OpenGL stuff).
>
>but i was hoping someone could explain why files like /etc/make.conf, /etc/csh.env,
>/etc/env.d/05gcc and /usr/include/awk/acconfig.h didn't belong to any package.
>
>i run the script with:
>
>$ ./gtfilelint -v -C gtfilelint.conf -o orphans.list
>
>use -h to see available params. multiple -v increases verbosity.
>
>if you have a system with lots of packages, its going to take some time,
>as it caches all the /var/db/pkg/**/CONTENTS entries in a Berkeley hashdb
>for quick lookups, then runs /usr/bin/find on /, and compares results. exclusions
>to the find output are made by adding python re module regexes to
>gtfilelint.conf.
>
>if you run it as user, you may get some error output from find about permissions.
>you will want to specify a config file, otherwise you'll get a lot of stuff
>you probably don't care about.
>
>hope someone finds this useful
>
>leon
>
>
>
>------------------------------------------------------------------------
>
>#!/usr/bin/env python
>#
># Finds files on Gentoo Linux systems that do not belong
># to any installed package.
>#
># Released under the GNU GPL.
>#
># (C) Copyright 2003 Leon J. Breedt
>#
># $Id$
>
>import dbhash
>import getopt
>import os
>import os.path
>import re
>import string
>import sys
>
>TRUE = 1
>FALSE = 0
>version = '0.1.1'
>configfile = '/etc/gtfilelint.conf'
>dbdir = '/var/db/pkg'
>cachefile = '/tmp/gtfilelint.db'
>outputfile = None
>warnmissing = FALSE
>findcmd = 'find / -print'
>exclusions = []
>verbosity = 0
>cachedb = None
>
>def verb(msg, level=1):
> if verbosity >= level:
> sys.stderr.write('-- %s\n' % msg)
>
>def vverb(msg):
> verb(msg, 2)
>
>def info(msg):
> sys.stderr.write('>> %s\n' % msg)
>
>def error(msg):
> sys.stderr.write('error: %s\n' % msg)
> sys.exit(1)
>
>def warn(msg):
> sys.stderr.write('warning: %s\n' % msg)
>
>def usage():
> print 'usage: %s [options]' % sys.argv[0]
> print 'options:'
> print '-h|--help display this message'
> print '-V|--version print program version and exit'
> print '-v|--verbose print verbose messages about what is being done'
> print '-d|--dbdir directory containing package database (default: %s)' % dbdir
> print '-c|--cachefile file to place temporary cache in (default: %s)' % cachefile
> print '-C|--configfile configuration file (default: %s)' % configfile
> print '-o|--outputfile file to print orphan list to (default: stdout)'
> print '--warnmissing warn if files declared in CONTENTS don\'t exist'
>
>def parse_cmdline():
> global verbosity, dbdir, cachefile, configfile, outputfile, warnmissing
> opts, args = getopt.getopt(sys.argv[1:], "hVvd:c:C:o:", ["help", "version", "verbose", "dbdir=", "cachefile=", "configfile=", "outputfile=", "warnmissing"])
> for opt, arg in opts:
> if opt in ("-h", "--help"):
> usage()
> sys.exit(0)
> if opt in ("-V", "--version"):
> print version
> sys.exit(0)
> if opt in ("-v", "--verbose"):
> verbosity = verbosity + 1
> if opt in ("-d", "--dbdir"):
> dbdir = arg
> if opt in ("-c", "--cachefile"):
> cachefile = arg
> if opt in ("-C", "--configfile"):
> configfile = arg
> if opt in ("-o", "--outputfile"):
> outputfile = arg
> if opt == "--warnmissing":
> warnmissing = TRUE
>
>def parse_config():
> if not os.path.exists(configfile) or not os.access(configfile, os.R_OK):
> warn('missing configfile "%s"' % configfile)
> return
> fp = open(configfile, 'r')
> for line in fp.readlines():
> line = string.strip(line)
> if len(line) == 0:
> continue
> if line[0] == '#':
> continue
> exclusions.append(re.compile(line))
> verb('adding "%s" to list of exclusion regular expressions' % line)
> fp.close()
>
>def cache_package_files(package, packagepath):
> verb('caching contents of "%s"' % package)
> fp = open(packagepath + '/CONTENTS')
> lineno = 0
> for line in fp.readlines():
> lineno = lineno + 1
> line = string.strip(line)
> if len(line) == 0:
> continue
> key = None
> m = re.match(r"^dir (\S.*)$", line)
> if m:
> key = m.group(1)
> m = None
> else:
> m = re.match(r"^obj (\S.*) (\S+) (\d+)\s*$", line)
> if m:
> key = m.group(1)
> m = None
> m = re.match(r"^sym (\S.*) -> .*$", line)
> if m:
> key = m.group(1)
> if key != None:
> if not os.path.exists(key) and warnmissing:
> warn('%s: "%s" does not exist on filesystem, ignoring' % (package, key))
> vverb('caching "%s"' % key)
> cachedb[key] = ''
> else:
> vverb('key is None for "%s" CONTENTS line %d' % (package, lineno))
> fp.close()
>
>def scan_group_packages(group, grouppath):
> packages = os.listdir(grouppath)
> packages.sort()
> verb('found %d packages in group "%s"' % (len(packages), group))
> for package in packages:
> packagepath = grouppath + '/' + package
> cache_package_files(package, packagepath)
>
>def create_system_filelist():
> info('scanning all files on system')
> sout = os.popen(findcmd, 'r')
> verb('reading paths from "%s"' % findcmd)
> paths = sout.readlines()
> orphans = 0
> rptfp = None
> for path in paths:
> path = string.strip(path)
> if len(path) == 0:
> continue
> if path[0] != '/':
> warn('ignoring relative path "%s"' % path)
> continue
> matched = FALSE
> for exre in exclusions:
> if exre.match(path):
> matched = TRUE
> break
> if matched:
> vverb('"%s" matched exclusion regex, ignoring' % path)
> continue
> if not cachedb.has_key(path):
> if orphans == 0:
> if outputfile:
> info('writing orphaned file list [%s]' % outputfile)
> rptfp = open(outputfile, 'w+')
> else:
> info('orphaned files:')
> rptfp = sys.stdout
> rptfp.flush()
> orphans = orphans + 1
> rptfp.write('%s\n' % path)
> rptfp.flush()
> if rptfp:
> rptfp.close()
> sout.close()
> if orphans > 0:
> info('%d orphaned file(s) found' % orphans)
> else:
> info('no orphaned files on system')
>
># Main
>try:
> parse_cmdline()
> parse_config()
> info('creating packaged files cache [%s]' % cachefile)
> cachedb = dbhash.open(cachefile, 'n')
> try:
> groups = os.listdir(dbdir)
> groups.sort()
> for group in groups:
> grouppath = dbdir + '/' + group
> scan_group_packages(group, grouppath)
> create_system_filelist()
> finally:
> if cachedb:
> cachedb.close()
> os.unlink(cachefile)
>except KeyboardInterrupt: 1
>except:
> raise
>
>
>------------------------------------------------------------------------
>
># we don't really care about these dynamic paths
>^/var/log/.*
>^/var/db/.*
>^/var/spool/.*
>^/var/tmp/.*
>^/var/lib/.*
>^/var/cache/.*
>^/var/run/.*
>
># / is not owned by any package
>^/$
>
># don't care about root's config files
>^/root/.*
>
># /usr/local is typically just user compiled stuff,
># don't care about it -- this list from baselayout
>^/usr/local/bin/.*
>^/usr/local/doc$
>^/usr/local/lib/.*
>^/usr/local/man$
>^/usr/local/src/.*
>^/usr/local/sbin/.*
>^/usr/local/games/.*
>^/usr/local/share/doc/.*
>^/usr/local/share/man/.*
>^/usr/local/share/.*
>
># what is /lib/dev-state? dunno...but the dir is in
># baselayout, even if the files arent
>^/lib/dev-state/.*
>
># devices aren't that important to us...the packaged
># files will not be visible anyway due to devfs
>^/dev/.*
>
># anyone packaging anything into /tmp should be shot
>^/tmp/.*
>
># portage tree we don't care about either
>^/usr/portage$
>^/usr/portage/.*
>
># mountpoints shouldn't have package files installed in them
>^/mnt/.*
>
># system filesystems should be ignored
>^/proc/.*
>^/sys/.*
>
># USER CUSTOMIZATIONS
>^/data
>^/data/.*
>^/cdrom.*
>^/windata
>^/windata/.*
>
>
--
gentoo-dev@gentoo.org mailing list
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [gentoo-dev] orphaned files on system?
2003-04-17 11:41 [gentoo-dev] orphaned files on system? leon j. breedt
2003-04-20 14:46 ` Daniel Armyr
@ 2003-04-21 7:44 ` Evan Powers
2003-04-21 20:10 ` leon j. breedt
1 sibling, 1 reply; 4+ messages in thread
From: Evan Powers @ 2003-04-21 7:44 UTC (permalink / raw
To: gentoo-dev
On Thursday 17 April 2003 07:41 am, leon j. breedt wrote:
> i use the attached script to scan for unpackaged files on my filesystem,
There was a thread about such scripts here in early March, if you didn't know.
You could find it in the archives; might give you some ideas. Subject was
"Cruft detecting script".
> if you have a system with lots of packages, its going to take some time,
> as it caches all the /var/db/pkg/**/CONTENTS entries in a Berkeley hashdb
> for quick lookups, then runs /usr/bin/find on /, and compares results.
Hmm.... Have you timed your script yet? What sorts of run times are you
getting with that implementation?
Back in the aforementioned March thread, I posted a pretty naive script
(included at bottom for reference) to do the same thing; on my system it ran
in about 46 seconds. Is that faster or slower than your script? Granted yours
almost certainly does more than mine, but I would think the run time would be
dominated by the generation and comparison of the two file manifests, so the
numbers should be comparable.
The reason I even mention it is that I'm wondering if the hash table is a good
data structure in this situation. My gut tells me it isn't, but I can't argue
with timing numbers that say otherwise.
I'm thinking:
1) creating the hash table and creating two sorted lists (find output and the
CONTENTS) are tasks of roughly equivalent complexity (the hash probably wins
by a modest margin)
2) comparing two sorted lists has complexity of O(n), but using the hash table
is going to have a complexity maybe of O(C*n), where C is some large constant
dependent on the hash bucket size or something
3) the sorted list method probably uses less memory, which is probably
important given the dataset size; also, it probably has considerably better
locality of reference
Like I said, I can't argue with numbers though. Have any thoughts on this
analysis?
Evan
---script-cruft.sh---
#!/bin/sh
find / '(' -path /proc \
-or -path /dev \
-or -path /boot \
-or -path /mnt \
-or -path /tmp \
-or -path /var/tmp \
-or -path /root \
-or -path /home \
-or -path /lib/dev-state \
-or -path /lib/modules \
-or -path /usr/portage \
-or -path /var/cache/edb \
-or -path /var/db/pkg \
')' -prune -or -print \
| sort >/tmp/allfiles
qpkg -nc -l \
| sed -n -e 's/ -> .*//' -e '1,2 d' -e '/^$/,+2! p' \
| sort \
| uniq >/tmp/portagefiles
comm -2 -3 /tmp/allfiles /tmp/portagefiles
---script-cruft.sh---
--
gentoo-dev@gentoo.org mailing list
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [gentoo-dev] orphaned files on system?
2003-04-21 7:44 ` Evan Powers
@ 2003-04-21 20:10 ` leon j. breedt
0 siblings, 0 replies; 4+ messages in thread
From: leon j. breedt @ 2003-04-21 20:10 UTC (permalink / raw
To: gentoo-dev
[-- Attachment #1.1: Type: text/plain, Size: 1908 bytes --]
On Mon, Apr 21, 2003 at 03:44:21AM -0400, Evan Powers wrote:
> Hmm.... Have you timed your script yet? What sorts of run times are you
> getting with that implementation?
Comparing the execution time of my implementation to yours...
Lets say mine has some issues. It runs find on the entire
filesystem tree, reads the complete output of find into memory,
and performs regex-based filtering after the fact instead of
as command-line parameters to find like you do.
To be honest, I'm ashamed that creating two sorted manifests and
comparing didn't occur to me. Your solution is annoyingly compact
and simple :)
I tweaked your script to not use qpkg -nc -l but inline Perl
parsing, and got some amazing results (roughly eight times faster on
this system).
The numbers...
Your original script:
$ time ./script-cruft.sh
./script-cruft.sh 14.56s user 0.92s system 100% cpu 15.473 total
My tweaked version:
$ time ./cruft-script-fast.sh
./script-cruft-fast.sh > fast 1.33s user 0.46s system 99% cpu 1.792 total
My Python script (oh the shame):
$ time ./gtfilelint -C gtfilelint.conf -o output.list
./gtfilelint <...> 15.86s user 7.15s system 99% cpu 23.105 total
These times are after Linux caching has kicked in. Executed
the scripts multiple times, reported only the final times. Before
caching, my script took about 160s, then 50, then 40, then 32, and
finally 23.
I guess I'll be using the tweaked version of your script from now on
(attached) :)
Interestingly, delving into the innards of epm and qpkg, revealed a
bug in their CONTENTS parsing code...They can't handle filenames with
spaces in them. They truncate the filename at the place a space occurs.
Other than that, the output generated by my tweaked version and your
original should be identical for the same set of paths to exclude.
Leon
--
in the beginning, was the code.
[-- Attachment #1.2: script-cruft-fast.sh --]
[-- Type: application/x-sh, Size: 943 bytes --]
[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2003-04-21 19:56 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-04-17 11:41 [gentoo-dev] orphaned files on system? leon j. breedt
2003-04-20 14:46 ` Daniel Armyr
2003-04-21 7:44 ` Evan Powers
2003-04-21 20:10 ` leon j. breedt
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox