From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from pigeon.gentoo.org ([208.92.234.80] helo=lists.gentoo.org) by finch.gentoo.org with esmtp (Exim 4.60) (envelope-from ) id 1O60J4-0001v4-NK for garchives@archives.gentoo.org; Sun, 25 Apr 2010 11:47:26 +0000 Received: from pigeon.gentoo.org (localhost [127.0.0.1]) by pigeon.gentoo.org (Postfix) with SMTP id AFBE9E0930; Sun, 25 Apr 2010 11:47:24 +0000 (UTC) Received: from mail-px0-f181.google.com (mail-px0-f181.google.com [209.85.212.181]) by pigeon.gentoo.org (Postfix) with ESMTP id 33DFFE0913 for ; Sun, 25 Apr 2010 11:47:17 +0000 (UTC) Received: by pxi19 with SMTP id 19so1139122pxi.40 for ; Sun, 25 Apr 2010 04:47:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:received:date:from:to:subject :message-id:references:mime-version:content-type:content-disposition :in-reply-to:user-agent; bh=AlF7zHQM7/s9KlQ0nlGbKZjU+0U7GeaDox2q/hYnqpI=; b=Cya/8vjSj+pbjSKEg7tuOCXCbzhdBpU87KxRudH+YKUV8Q4aWVJSzm/0pfRQ8CyR+h +l++YMMGYXb1pS5OKBc+RisJG1uJrwrZT01wb0G15bzFwRo88js1TuamM4hdqpNxihUE YJ71w8WEo7HG93hC0tHUUHzB/8W6W5zf4Kt7U= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=date:from:to:subject:message-id:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; b=XY5dty0vfNalm80uiX8NHeBeycjo9zmEAkf4aqR33T+OmSbFj//mixtqNvoXABm80V ND2C+G90GcvjtGylcHjsR/Bl/FsNHoCt1nUf2c7bA7DPqrofp8jyGTNH08yAAA0Q9AQB CXeTrgedTRxa2LfcKKWBRsbqFAWNiOcF9uYp8= Received: by 10.114.33.26 with SMTP id g26mr2534972wag.216.1272196036618; Sun, 25 Apr 2010 04:47:16 -0700 (PDT) Received: from smtp.gmail.com (c-67-171-128-62.hsd1.wa.comcast.net [67.171.128.62]) by mx.google.com with ESMTPS id 29sm15268253waf.3.2010.04.25.04.47.14 (version=TLSv1/SSLv3 cipher=RC4-MD5); Sun, 25 Apr 2010 04:47:15 -0700 (PDT) Received: by smtp.gmail.com (sSMTP sendmail emulation); Sun, 25 Apr 2010 04:45:20 -0700 Date: Sun, 25 Apr 2010 04:45:20 -0700 From: Brian Harring To: gentoo-dev@lists.gentoo.org Subject: Re: [gentoo-dev] [RFC][NEW] Utility to find orphaned files Message-ID: <20100425114519.GD16877@hrair> References: <4BD42501.9070505@gentoo.org> Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Id: Gentoo Linux mail X-BeenThere: gentoo-dev@lists.gentoo.org Reply-to: gentoo-dev@lists.gentoo.org MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="3siQDZowHQqNOShm" Content-Disposition: inline In-Reply-To: <4BD42501.9070505@gentoo.org> User-Agent: Mutt/1.5.20 (2009-06-14) X-Archives-Salt: e6c138c7-bbfa-45c7-aa7d-25fa3f3cdaf8 X-Archives-Hash: fc8f1f998d1f112b6bf6c6df0b873069 --3siQDZowHQqNOShm Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Sun, Apr 25, 2010 at 01:18:25PM +0200, Angelo Arrifano wrote: > Hello developers developers and developers, >=20 > Ever wondered how much crap is left in your X-years old Gentoo box? >=20 > I just developed a python utility to efficiently find orphaned files in > the system. By orphaned files I mean the files that are present on > system directories and don't belong to any installed package. >=20 > The package builds a virtual filesystem (cache) on the RAM using python > hash tables. Then it uses the cache to find the ownership of files > inside user-specified dirs. >=20 > Building the cache takes less than 10 seconds here in a system with 1366 > installed packages. >=20 > This is not intended to be a finished program yet, I'm looking forward > for your constructive commentaries. You're going to want to do realpathing here... also you'll need to=20 handle syms, and spaces are allowed in paths. I'd personally suggest=20 using one of the PM api's for this. Part of the reason I advise poking at the PM apis is that it covers up=20 some of the nastier details w/ contents and others w/ parsing; simple=20 example, python -c " import sys =66rom pkgcore.config import load_config =66rom pkgcore.fs import contents, livefs contents =3D contents.contentsSet() for pkg in load_config().get_default('domain').named_repos['vdb']: contents.update(pkg.contents); stream =3D (x for x in livefs.iter_scan(sys.argv[1]) if x not in=20 contents) print '\n'.join(map(str, sorted(stream))) " desired-path Note also that's a *very* quick writing. I'd personally look at=20 serializing the sorted lists to disk for both streams (what contents=20 says is on disk vs what is on disk), and then lockstep walking the=20 lists; via that you can keep the memory usage down. ~harring --3siQDZowHQqNOShm Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.13 (GNU/Linux) iEYEARECAAYFAkvUK08ACgkQsiLx3HvNzgeR8gCfZtNk/SaH0PboVC8joP6nlxRD EKIAoLYGrRDmty3vJtw61usEFw1TsIe3 =Nx83 -----END PGP SIGNATURE----- --3siQDZowHQqNOShm--