public inbox for gentoo-dev@lists.gentoo.org
 help / color / mirror / Atom feed
* [gentoo-dev] [RFC][NEW] Utility to find orphaned files
@ 2010-04-25 11:18 Angelo Arrifano
  2010-04-25 11:45 ` Brian Harring
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Angelo Arrifano @ 2010-04-25 11:18 UTC (permalink / raw
  To: List/Gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 850 bytes --]

Hello developers developers and developers,

Ever wondered how much crap is left in your X-years old Gentoo box?

I just developed a python utility to efficiently find orphaned files in
the system. By orphaned files I mean the files that are present on
system directories and don't belong to any installed package.

The package builds a virtual filesystem (cache) on the RAM using python
hash tables. Then it uses the cache to find the ownership of files
inside user-specified dirs.

Building the cache takes less than 10 seconds here in a system with 1366
installed packages.

This is not intended to be a finished program yet, I'm looking forward
for your constructive commentaries.

[Attached]

Regards,
-- 
Angelo Arrifano AKA MiKNiX
Gentoo Embedded/OMAP850 Developer
Linwizard Developer
http://www.gentoo.org/~miknix
http://miknix.homelinux.com

[-- Attachment #2: find-orphaned-0.01.tar.bz2 --]
[-- Type: application/x-bzip, Size: 7939 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [gentoo-dev] [RFC][NEW] Utility to find orphaned files
  2010-04-25 11:18 [gentoo-dev] [RFC][NEW] Utility to find orphaned files Angelo Arrifano
@ 2010-04-25 11:45 ` Brian Harring
  2010-04-25 13:43 ` Daniel Pielmeier
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 9+ messages in thread
From: Brian Harring @ 2010-04-25 11:45 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 1793 bytes --]

On Sun, Apr 25, 2010 at 01:18:25PM +0200, Angelo Arrifano wrote:
> Hello developers developers and developers,
> 
> Ever wondered how much crap is left in your X-years old Gentoo box?
> 
> I just developed a python utility to efficiently find orphaned files in
> the system. By orphaned files I mean the files that are present on
> system directories and don't belong to any installed package.
> 
> The package builds a virtual filesystem (cache) on the RAM using python
> hash tables. Then it uses the cache to find the ownership of files
> inside user-specified dirs.
> 
> Building the cache takes less than 10 seconds here in a system with 1366
> installed packages.
> 
> This is not intended to be a finished program yet, I'm looking forward
> for your constructive commentaries.

You're going to want to do realpathing here... also you'll need to 
handle syms, and spaces are allowed in paths.  I'd personally suggest 
using one of the PM api's for this.

Part of the reason I advise poking at the PM apis is that it covers up 
some of the nastier details w/ contents and others w/ parsing; simple 
example,

python -c "
import sys
from pkgcore.config import load_config
from pkgcore.fs import contents, livefs
contents = contents.contentsSet()
for pkg in load_config().get_default('domain').named_repos['vdb']:
  contents.update(pkg.contents);
stream = (x for x in livefs.iter_scan(sys.argv[1]) if x not in 
contents)
print '\n'.join(map(str, sorted(stream)))
" desired-path

Note also that's a *very* quick writing.  I'd personally look at 
serializing the sorted lists to disk for both streams (what contents 
says is on disk vs what is on disk), and then lockstep walking the 
lists; via that you can keep the memory usage down.

~harring

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [gentoo-dev] [RFC][NEW] Utility to find orphaned files
  2010-04-25 11:18 [gentoo-dev] [RFC][NEW] Utility to find orphaned files Angelo Arrifano
  2010-04-25 11:45 ` Brian Harring
@ 2010-04-25 13:43 ` Daniel Pielmeier
  2010-04-30 16:24   ` Enrico Weigelt
  2010-04-25 15:34 ` [gentoo-dev] " Yuri Vasilevski
  2010-04-25 17:43 ` Benedikt Böhm
  3 siblings, 1 reply; 9+ messages in thread
From: Daniel Pielmeier @ 2010-04-25 13:43 UTC (permalink / raw
  To: gentoo-dev


[-- Attachment #1.1: Type: text/plain, Size: 2013 bytes --]

Angelo Arrifano schrieb am 25.04.2010 13:18:
> Hello developers developers and developers,
> 
> Ever wondered how much crap is left in your X-years old Gentoo box?
> 
> I just developed a python utility to efficiently find orphaned files in
> the system. By orphaned files I mean the files that are present on
> system directories and don't belong to any installed package.
> 
> The package builds a virtual filesystem (cache) on the RAM using python
> hash tables. Then it uses the cache to find the ownership of files
> inside user-specified dirs.
> 
> Building the cache takes less than 10 seconds here in a system with 1366
> installed packages.
> 
> This is not intended to be a finished program yet, I'm looking forward
> for your constructive commentaries.

What about searching the complete file system but using an exclude file where
you can put directories and files which should not be searched. It is tedious to
tell every path on the command-line. Also for instance if you specify /lib it
will also search under /lib/modules and I am sure you do not consider all
contents there as unneeded.

You also need to consider that your tool will return other false positives like
byte compiled python modules and perl header files. In general everything an
ebuild does in phases where it adds files to file-system but files are not
stored to CONTENTS (pkg_{pre,post}inst). At this point the files are needed but
not recognized by the package manager. If the ebuild does not take care of this
files when removing (pkg_{pre,post}rm) the package they will remain on the
file-system and are now unneeded.

I have written something in perl which I recently tried to implement in python
(not the same functionality like the perl version yet). I am not a good perl or
python programmer but it fits my needs especially the perl version as I know a
bit more perl than python.

I attach both versions and a sample exclude file. Maybe it will be of help.

-- 
Daniel Pielmeier

[-- Attachment #1.2: cruft.tar.bz2 --]
[-- Type: application/x-bzip2, Size: 5687 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [gentoo-dev] [RFC][NEW] Utility to find orphaned files
  2010-04-25 11:18 [gentoo-dev] [RFC][NEW] Utility to find orphaned files Angelo Arrifano
  2010-04-25 11:45 ` Brian Harring
  2010-04-25 13:43 ` Daniel Pielmeier
@ 2010-04-25 15:34 ` Yuri Vasilevski
  2010-04-25 17:10   ` Angelo Arrifano
  2010-04-25 17:43 ` Benedikt Böhm
  3 siblings, 1 reply; 9+ messages in thread
From: Yuri Vasilevski @ 2010-04-25 15:34 UTC (permalink / raw
  To: gentoo-dev

Hello,

On Sun, 25 Apr 2010 13:18:25 +0200
Angelo Arrifano <miknix@gentoo.org> wrote:

> Hello developers developers and developers,
> 
> Ever wondered how much crap is left in your X-years old Gentoo box?
> 
> I just developed a python utility to efficiently find orphaned files
> in the system. By orphaned files I mean the files that are present on
> system directories and don't belong to any installed package.
> 
> The package builds a virtual filesystem (cache) on the RAM using
> python hash tables. Then it uses the cache to find the ownership of
> files inside user-specified dirs.
> 
> Building the cache takes less than 10 seconds here in a system with
> 1366 installed packages.
> 
> This is not intended to be a finished program yet, I'm looking forward
> for your constructive commentaries.

There is a tool that does that, qfile from app-portage/portage-utils.
Check the "-o, --orphans        * List orphan files" option.

It's not as straight forward as it could be, as it checks only for
files specified as arguments or read from file.

But you can trivially use it like:
# find /dir/you/want/to/check/for/orphans | qfile -o -f -

Best,
Yuri.



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [gentoo-dev] [RFC][NEW] Utility to find orphaned files
  2010-04-25 15:34 ` [gentoo-dev] " Yuri Vasilevski
@ 2010-04-25 17:10   ` Angelo Arrifano
  0 siblings, 0 replies; 9+ messages in thread
From: Angelo Arrifano @ 2010-04-25 17:10 UTC (permalink / raw
  To: gentoo-dev

On 25-04-2010 17:34, Yuri Vasilevski wrote:
> Hello,
> 
> On Sun, 25 Apr 2010 13:18:25 +0200
> Angelo Arrifano <miknix@gentoo.org> wrote:
> 
>> Hello developers developers and developers,
>>
>> Ever wondered how much crap is left in your X-years old Gentoo box?
>>
>> I just developed a python utility to efficiently find orphaned files
>> in the system. By orphaned files I mean the files that are present on
>> system directories and don't belong to any installed package.
>>
>> The package builds a virtual filesystem (cache) on the RAM using
>> python hash tables. Then it uses the cache to find the ownership of
>> files inside user-specified dirs.
>>
>> Building the cache takes less than 10 seconds here in a system with
>> 1366 installed packages.
>>
>> This is not intended to be a finished program yet, I'm looking forward
>> for your constructive commentaries.
> 
> There is a tool that does that, qfile from app-portage/portage-utils.
> Check the "-o, --orphans        * List orphan files" option.
> 
> It's not as straight forward as it could be, as it checks only for
> files specified as arguments or read from file.
> 
> But you can trivially use it like:
> # find /dir/you/want/to/check/for/orphans | qfile -o -f -
> 
> Best,
> Yuri.
> 

Based on the comments so far, I'll try to make my PoC a better tool.
My primary objective is to make this some kind of disk cleanup utility
for Gentoo boxens. I don't expect Gentoo systems to be *that* polluted
but sometimes we all have to do ugly things to fix broken systems real
fast. - If you know what I mean.

There are other things that came to my mind, like using stored hashes to
check the system files integrity (as in security).

My next steps in regard to this utility will be:
* Follow harring suggestion and use available PM API.
* Make the application handle symlinks so we start getting a more
informative output.
* To store the generated cache on disk and to only regenerate it if needed.

Regards,
- Angelo



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [gentoo-dev] [RFC][NEW] Utility to find orphaned files
  2010-04-25 11:18 [gentoo-dev] [RFC][NEW] Utility to find orphaned files Angelo Arrifano
                   ` (2 preceding siblings ...)
  2010-04-25 15:34 ` [gentoo-dev] " Yuri Vasilevski
@ 2010-04-25 17:43 ` Benedikt Böhm
  3 siblings, 0 replies; 9+ messages in thread
From: Benedikt Böhm @ 2010-04-25 17:43 UTC (permalink / raw
  To: gentoo-dev

On Sun, Apr 25, 2010 at 1:18 PM, Angelo Arrifano <miknix@gentoo.org> wrote:
> Hello developers developers and developers,
>
> Ever wondered how much crap is left in your X-years old Gentoo box?
>
> I just developed a python utility to efficiently find orphaned files in
> the system. By orphaned files I mean the files that are present on
> system directories and don't belong to any installed package.
>
> The package builds a virtual filesystem (cache) on the RAM using python
> hash tables. Then it uses the cache to find the ownership of files
> inside user-specified dirs.
>
> Building the cache takes less than 10 seconds here in a system with 1366
> installed packages.
>
> This is not intended to be a finished program yet, I'm looking forward
> for your constructive commentaries.

i have refactored findcruft (search the forums) two years ago (see
http://git.xnull.de/cgit/findcruft2/), maybe you can take a look at
it, especially the false-positives handling.

HTH,
Bene



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [gentoo-dev] [RFC][NEW] Utility to find orphaned files
  2010-04-25 13:43 ` Daniel Pielmeier
@ 2010-04-30 16:24   ` Enrico Weigelt
  2010-05-03 13:34     ` [gentoo-dev] " Peter Hjalmarsson
  0 siblings, 1 reply; 9+ messages in thread
From: Enrico Weigelt @ 2010-04-30 16:24 UTC (permalink / raw
  To: gentoo-dev

* Daniel Pielmeier <billie@gentoo.org> schrieb:

> What about searching the complete file system but using an exclude file where
> you can put directories and files which should not be searched. It is tedious to
> tell every path on the command-line. Also for instance if you specify /lib it
> will also search under /lib/modules and I am sure you do not consider all
> contents there as unneeded.

hmm, perhaps there's some way to assign these files to some package ?
 
> You also need to consider that your tool will return other false positives like
> byte compiled python modules and perl header files. In general everything an
> ebuild does in phases where it adds files to file-system but files are not
> stored to CONTENTS (pkg_{pre,post}inst). At this point the files are needed but
> not recognized by the package manager. If the ebuild does not take care of this
> files when removing (pkg_{pre,post}rm) the package they will remain on the
> file-system and are now unneeded.

Assuming these files are not optional/temporary (aka: can be regenerated on
the fly), I see a generic design problem here: everything belonging to some
package (excluding content data and configs, of course) should be assigned
to the package.

The big Q: how can we achieve this ?


cu
-- 
---------------------------------------------------------------------
 Enrico Weigelt    ==   metux IT service - http://www.metux.de/
---------------------------------------------------------------------
 Please visit the OpenSource QM Taskforce:
 	http://wiki.metux.de/public/OpenSource_QM_Taskforce
 Patches / Fixes for a lot dozens of packages in dozens of versions:
	http://patches.metux.de/
---------------------------------------------------------------------



^ permalink raw reply	[flat|nested] 9+ messages in thread

* [gentoo-dev] Re: [RFC][NEW] Utility to find orphaned files
  2010-04-30 16:24   ` Enrico Weigelt
@ 2010-05-03 13:34     ` Peter Hjalmarsson
  2010-05-11 13:08       ` Angelo Arrifano
  0 siblings, 1 reply; 9+ messages in thread
From: Peter Hjalmarsson @ 2010-05-03 13:34 UTC (permalink / raw
  To: gentoo-dev

fre 2010-04-30 klockan 18:24 +0200 skrev Enrico Weigelt:
> * Daniel Pielmeier <billie@gentoo.org> schrieb:
> 
> > What about searching the complete file system but using an exclude file where
> > you can put directories and files which should not be searched. It is tedious to
> > tell every path on the command-line. Also for instance if you specify /lib it
> > will also search under /lib/modules and I am sure you do not consider all
> > contents there as unneeded.
> 
> hmm, perhaps there's some way to assign these files to some package ?
>  

Eh, no and it should not be since files in that directory is kernel
modules, and most of the files there is created by "cd /usr/src/linux &&
make" or genkernel or something alike and it is supposed to be that way.
Looking at the contents of that directory is pretty easy to see if a
directory there should be left alone or removed (as there is just one
directory per kernel. not any longer running a kernel anymore? remove
the corresponding dir).
It is better to have the script not tuch that directory at all or at
most point out "the directory contains directories for more kernels then
the currently running (i.e. there is more then one dir) and it is
totally THIS big. You may want to take a look if you have files from
older kernels that you do not longer need."
That would leave up to the user to figure out what kernel modules to
keep and what kernel to pount. Or you suggest autocleaning of /boot
and /usr/src/linux-* as well? Dangerous!





^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [gentoo-dev] Re: [RFC][NEW] Utility to find orphaned files
  2010-05-03 13:34     ` [gentoo-dev] " Peter Hjalmarsson
@ 2010-05-11 13:08       ` Angelo Arrifano
  0 siblings, 0 replies; 9+ messages in thread
From: Angelo Arrifano @ 2010-05-11 13:08 UTC (permalink / raw
  To: gentoo-dev

On 03-05-2010 15:34, Peter Hjalmarsson wrote:
> fre 2010-04-30 klockan 18:24 +0200 skrev Enrico Weigelt:
>> * Daniel Pielmeier <billie@gentoo.org> schrieb:
>>
>>> What about searching the complete file system but using an exclude file where
>>> you can put directories and files which should not be searched. It is tedious to
>>> tell every path on the command-line. Also for instance if you specify /lib it
>>> will also search under /lib/modules and I am sure you do not consider all
>>> contents there as unneeded.
>>
>> hmm, perhaps there's some way to assign these files to some package ?
>>  
> 
> Eh, no and it should not be since files in that directory is kernel
> modules, and most of the files there is created by "cd /usr/src/linux &&
> make" or genkernel or something alike and it is supposed to be that way.

Indeed. /lib/firmware is another candidate
> Looking at the contents of that directory is pretty easy to see if a
> directory there should be left alone or removed (as there is just one
> directory per kernel. not any longer running a kernel anymore? remove
> the corresponding dir).

That is dangerous. For example, I always keep the previous 2 kernels
just in case I detect some problem with the latest and I need to quickly
go back.
> It is better to have the script not tuch that directory at all or at
> most point out "the directory contains directories for more kernels then
> the currently running (i.e. there is more then one dir) and it is
> totally THIS big.

Sounds like a plan.
You may want to take a look if you have files from
> older kernels that you do not longer need."
> That would leave up to the user to figure out what kernel modules to
> keep and what kernel to pount. Or you suggest autocleaning of /boot
> and /usr/src/linux-* as well? Dangerous!
> 
> 
> 

I'm seeing that there is enough interest (including me) on such utility.
Since it is difficult to please everyone at start, I'll first open a
project page on sf.net and develop a more powerful PoC that matches my
ideas. There was a lot of good ideas and observations here, so keep them
coming that I'll certainly read them.

When, and only if, the thing grows to a more mature state; I'll try to
open a Gentoo project by the appropriate means.

I'm not very good on free time lately, so I can't promise anything. But,
as long as my interest on it doesn't die I'll slowly keep working on.

Regards,
- Angelo



^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2010-05-11 13:08 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-04-25 11:18 [gentoo-dev] [RFC][NEW] Utility to find orphaned files Angelo Arrifano
2010-04-25 11:45 ` Brian Harring
2010-04-25 13:43 ` Daniel Pielmeier
2010-04-30 16:24   ` Enrico Weigelt
2010-05-03 13:34     ` [gentoo-dev] " Peter Hjalmarsson
2010-05-11 13:08       ` Angelo Arrifano
2010-04-25 15:34 ` [gentoo-dev] " Yuri Vasilevski
2010-04-25 17:10   ` Angelo Arrifano
2010-04-25 17:43 ` Benedikt Böhm

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox