[gentoo-dev] [RFC] Ideas for gentoostats implementation

public inbox for gentoo-dev@lists.gentoo.org
 help / color / mirror / Atom feed

* [gentoo-dev] [RFC] Ideas for gentoostats implementation
@ 2020-04-26  8:08 Michał Górny
  2020-04-26  8:42 ` Alarig Le Lay
                   ` (9 more replies)
  0 siblings, 10 replies; 31+ messages in thread
From: Michał Górny @ 2020-04-26  8:08 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 6331 bytes --]

Hi,

The topic of rebooting gentoostats comes here from time to time.  Unless
I'm mistaken, all the efforts so far were superficial, lacking a clear
plan and unwilling to research the problems.  I'd like to start
a serious discussion focused on the issues we need to solve, and propose
some ideas how we could solve them.

I can't promise I'll find time to implement it.  However, I'd like to
get a clear plan on how it should be done if someone actually does it.

The big questions
=================
The way I see it, the primary goal of the project would be to gather
statistics on popularity of packages, in order to help us prioritize our
attention and make decisions on what to keep and what to remove.  Unlike
Debian's popcon, I don't think we really want to try to investigate
which files are actually used but focus on what's installed.

There are a few important questions that need to be answered first:

1. Which data do we need to collect?

   a. list of installed packages?
   b. versions (or slots?) of installed packages?
   c. USE flags on installed packages?
   d. world and world_sets files
   e. system profile?
   f. enabled repositories? (possibly filtered to official list)
   g. distribution?

I think d. is most important as it gives us information on what users
really want.  a. alone is kinda redundant is we have d.  c. might have
some value when deciding whether to mask a particular flag (and implies
a.).

e. would be valuable if we wanted to determine the future of particular
profiles, as well as e.g. estimate the transition to new versions.

f. would be valuable to determine which repositories are used but we
need to filter private repos from the output for privacy reasons.

g. could be valuable in correlation with other data but not sure if
there's much direct value alone.

2. How to handle Gentoo derivatives?  Some of them could provide
meaningful data but some could provide false data (e.g. when derivatives
override Gentoo packages).  One possible option would be to filter a.-e. 
to stuff coming from ::gentoo.

3. How to keep the data up-to-date?  After all, if we just stack a lot
of old data, we will soon stop getting meaningful results.  I suppose
we'll need to timestamp all data and remove old entries.

4. How to avoid duplication?  If some users submit their results more
often than others, they would bias the results.  3. might be related.

5. How to handle clusters?  Things are simple if we can assume that
people will submit data for a few distinct systems.  But what about
companies that run 50 Gentoo machines with the same or similar setup? 
What about clusters of 1000 almost identical containers?  Big entities
could easily bias the results but we should also make it possible for
them to participate somehow.

6. Security.  We don't want to expose information that could be
correlated to specific systems, as it could disclose their
vulnerabilities.

7. Privacy.  Besides the above, our sysadmins would appreciate if
the data they submitted couldn't be easily correlated to them.  If we
don't respect privacy of our users, we won't get them to submit data.

8. Spam protection.  Finally, the service needs to be resilient to being
spammed with fake data.  Both to users who want to make their packages
look more important, and to script kiddies that want to prove a point.

My (partial) implementation idea
================================
I think our approach should be oriented on privacy/security first,
and attempt to make the best of the data we can get while respecting
this principle.  This means no correlation and no tracking.

Once the tool is installed, the user needs to opt-in to using it.  This
involves accepting a privacy policy and setting up a cronjob.  The tool
would suggest a (random?) time for submission to take place periodically
(say, every week).

The submission would contain only raw data, without any identification
information.  It would be encrypted using our public key.  Once
uploaded, it would be put into our input queue as-is.

Periodically the input queue would be processed in bulk.  The individual
statistics would be updated and the input would be discarded.  This
should prevent people trying to correlate changes in statistics with
individual uploads.

Each counted item would have a timestamp associated, and we'd discard
old items per resubmission period.  This should ensure that we keep
fresh data and people can update their earlier submissions without
storing identification data.

For example, N users submit their data containing a list of packages
every week.  This data is used in bulk to update counts of individual
packages (technically, to append timestamps to list corresponding to
these packages).  Data older than one week is discarded, so we have
rough counts of package use during the last week.

I think this addresses problems 3./6./7.

The other major problem is spam protection.  The best semi-anonymous way
I see is to use submitter's IPv4 addresses (can we support IPv6 then?). 
We could set a limit of, say, 10 submissions per IPv4 address per week. 
If some address would exceed that limit, we could require CAPTCHA
authorization.

I think this would make spamming a bit harder while keeping submissions
easy for the most, and a little harder but possible for those of us
behind ISP NATs.

This should address problems 4./8. and maybe 5. to some degree.

A proper solution to cluster problem would probably involve some way to
internally collect and combine data data before submission.  If you have
large clusters of similar systems, I think you'd want to have all
packages used on different systems reported as one entry.

I think we should collect data from users running all Gentoo
derivatives, as long as they are using Gentoo packages.  The simplest
solution I can think of would be to filter the results on packages (or
profiles) installed from ::gentoo.  This will work only for distros that
expose ::gentoo explicitly (vs copying our ebuilds to their
repositories) though.

What do you think?  Do you foresee other problems?  Do you have other
needs?  Can you think of better solutions?

-- 
Best regards,
Michał Górny

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 618 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation
  2020-04-26  8:08 [gentoo-dev] [RFC] Ideas for gentoostats implementation Michał Górny
@ 2020-04-26  8:42 ` Alarig Le Lay
  2020-04-26  8:43 ` Toralf Förster
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 31+ messages in thread
From: Alarig Le Lay @ 2020-04-26  8:42 UTC (permalink / raw
  To: gentoo-dev

Hi,

On Sun 26 Apr 2020 10:08:32 GMT, Michał Górny wrote:
> The other major problem is spam protection.  The best semi-anonymous way
> I see is to use submitter's IPv4 addresses (can we support IPv6 then?). 
> We could set a limit of, say, 10 submissions per IPv4 address per week. 
> If some address would exceed that limit, we could require CAPTCHA
> authorization.
> 
> I think this would make spamming a bit harder while keeping submissions
> easy for the most, and a little harder but possible for those of us
> behind ISP NATs.

I think that the IPv6 support shouldn’t be a question. I have several
points for it:

1. All the Gentoo infrastructure is IPv6-able (at least the public
   faced as I’m aware), so it could create a specific case
2. As you mention NAT ISPs, most of those are providing IPv6 as well
   (because NAT isn’t cost-less). Also putting the IPv4 rate-limit to a
   /64 IPv6 will reduce the need for a CAPTCHA.
3. Users don’t necessary have an IPv4 access
4. About a third of the Internet traffic is IPv6, so it’s not an
   option in my humble opinion.

Regards,
-- 
Alarig

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation
  2020-04-26  8:08 [gentoo-dev] [RFC] Ideas for gentoostats implementation Michał Górny
  2020-04-26  8:42 ` Alarig Le Lay
@ 2020-04-26  8:43 ` Toralf Förster
  2020-04-26  8:52   ` Michał Górny
  2020-04-26  9:09 ` Ulrich Mueller
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 31+ messages in thread
From: Toralf Förster @ 2020-04-26  8:43 UTC (permalink / raw
  To: gentoo-dev


[-- Attachment #1.1: Type: text/plain, Size: 549 bytes --]

On 4/26/20 10:08 AM, Michał Górny wrote:
> .  This
> involves accepting a privacy policy and setting up a cronjob.  The tool
> would suggest a (random?) time for submission to take place periodically
> (say, every week).

Well, something like "@weekly" should be preferred over eg "42 23 * * *" b/c the later might be too late for desktop users.


> We could set a limit of, say, 10 submissions per IPv4 address per week.

If the output do not differ (too much) then the limit isn't needed, or?

-- 
Toralf
PGP 23217DA7 9B888F45


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation
  2020-04-26  8:43 ` Toralf Förster
@ 2020-04-26  8:52   ` Michał Górny
  2020-04-26 10:15     ` Toralf Förster
  2020-04-26 14:11     ` Kent Fredric
  0 siblings, 2 replies; 31+ messages in thread
From: Michał Górny @ 2020-04-26  8:52 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 688 bytes --]

On Sun, 2020-04-26 at 10:43 +0200, Toralf Förster wrote:
> On 4/26/20 10:08 AM, Michał Górny wrote:
> > .  This
> > involves accepting a privacy policy and setting up a cronjob.  The tool
> > would suggest a (random?) time for submission to take place periodically
> > (say, every week).
> 
> Well, something like "@weekly" should be preferred over eg "42 23 * * *" b/c the later might be too late for desktop users.
> 
> 
> > We could set a limit of, say, 10 submissions per IPv4 address per week.
> 
> If the output do not differ (too much) then the limit isn't needed, or?

Do you have any other idea for spam protection then?

-- 
Best regards,
Michał Górny


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 618 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation
  2020-04-26  8:08 [gentoo-dev] [RFC] Ideas for gentoostats implementation Michał Górny
  2020-04-26  8:42 ` Alarig Le Lay
  2020-04-26  8:43 ` Toralf Förster
@ 2020-04-26  9:09 ` Ulrich Mueller
  2020-04-26  9:32   ` Toralf Förster
                     ` (2 more replies)
  2020-04-26 12:38 ` Thomas Deutschmann
                   ` (6 subsequent siblings)
  9 siblings, 3 replies; 31+ messages in thread
From: Ulrich Mueller @ 2020-04-26  9:09 UTC (permalink / raw
  To: Michał Górny; +Cc: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 550 bytes --]

>>>>> On Sun, 26 Apr 2020, Michał Górny wrote:

> The other major problem is spam protection.  The best semi-anonymous way
> I see is to use submitter's IPv4 addresses (can we support IPv6 then?). 
> We could set a limit of, say, 10 submissions per IPv4 address per week. 
> If some address would exceed that limit, we could require CAPTCHA
> authorization.

Instead of using the IP address, you could generate a UUID when
installing the tool. This would also take care of clusters with machines
that are clones of each other.

Ulrich

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 507 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation
  2020-04-26  9:09 ` Ulrich Mueller
@ 2020-04-26  9:32   ` Toralf Förster
  2020-04-26 10:39     ` Brian Dolbec
  2020-04-26  9:42   ` Michał Górny
  2020-04-26 21:47   ` Andreas K. Hüttel
  2 siblings, 1 reply; 31+ messages in thread
From: Toralf Förster @ 2020-04-26  9:32 UTC (permalink / raw
  To: gentoo-dev


[-- Attachment #1.1: Type: text/plain, Size: 203 bytes --]

On 4/26/20 11:09 AM, Ulrich Mueller wrote:
> Instead of using the IP address, you could generate a UUID when
> installing the tool. 

like the pfl tool did ?

-- 
Toralf
PGP 23217DA7 9B888F45


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation
  2020-04-26  9:09 ` Ulrich Mueller
  2020-04-26  9:32   ` Toralf Förster
@ 2020-04-26  9:42   ` Michał Górny
  2020-04-26 21:47   ` Andreas K. Hüttel
  2 siblings, 0 replies; 31+ messages in thread
From: Michał Górny @ 2020-04-26  9:42 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 705 bytes --]

On Sun, 2020-04-26 at 11:09 +0200, Ulrich Mueller wrote:
> > > > > > On Sun, 26 Apr 2020, Michał Górny wrote:
> > The other major problem is spam protection.  The best semi-anonymous way
> > I see is to use submitter's IPv4 addresses (can we support IPv6 then?). 
> > We could set a limit of, say, 10 submissions per IPv4 address per week. 
> > If some address would exceed that limit, we could require CAPTCHA
> > authorization.
> 
> Instead of using the IP address, you could generate a UUID when
> installing the tool. This would also take care of clusters with machines
> that are clones of each other.
> 

That wouldn't help with abuse at all.

-- 
Best regards,
Michał Górny


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 618 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation
  2020-04-26  8:52   ` Michał Górny
@ 2020-04-26 10:15     ` Toralf Förster
  2020-04-26 10:25       ` Michał Górny
  2020-04-26 14:11     ` Kent Fredric
  1 sibling, 1 reply; 31+ messages in thread
From: Toralf Förster @ 2020-04-26 10:15 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1.1: Type: text/plain, Size: 655 bytes --]

On 4/26/20 10:52 AM, Michał Górny wrote:
> Do you have any other idea for spam protection then?

IMO there're 2 types of spam:

1. made by accident (eg. "* * * * *" instead "@weekly" in crontab)
2. made intentionlly

The 1st can be handled by UUID - just drop any old related dataset from inbox when a new one arrived
For the 2nd: what about accepting only datasets from "valid" UUIDs, meaning where just 1 dataset/week/IPv4 (maybe /16 block) in the mean did arrived in the last few weeks/months ?

Well, other than that maybe spamassassin or Tor peolple have more theory and generic approaches?
:-)

-- 
Toralf
PGP 23217DA7 9B888F45

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation
  2020-04-26 10:15     ` Toralf Förster
@ 2020-04-26 10:25       ` Michał Górny
  2020-04-26 10:54         ` Toralf Förster
  0 siblings, 1 reply; 31+ messages in thread
From: Michał Górny @ 2020-04-26 10:25 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 729 bytes --]

On Sun, 2020-04-26 at 12:15 +0200, Toralf Förster wrote:
> On 4/26/20 10:52 AM, Michał Górny wrote:
> > Do you have any other idea for spam protection then?
> 
> IMO there're 2 types of spam:
> 
> 1. made by accident (eg. "* * * * *" instead "@weekly" in crontab)
> 2. made intentionlly
> 
> The 1st can be handled by UUID - just drop any old related dataset from inbox when a new one arrived
> For the 2nd: what about accepting only datasets from "valid" UUIDs, meaning where just 1 dataset/week/IPv4 (maybe /16 block) in the mean did arrived in the last few weeks/months ?
> 

I'm sorry but could you rephrase that in more sentences?  I don't
understand what you mean.

-- 
Best regards,
Michał Górny


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 618 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation
  2020-04-26  9:32   ` Toralf Förster
@ 2020-04-26 10:39     ` Brian Dolbec
  2020-04-26 13:52       ` Kent Fredric
  0 siblings, 1 reply; 31+ messages in thread
From: Brian Dolbec @ 2020-04-26 10:39 UTC (permalink / raw
  To: gentoo-dev

On Sun, 26 Apr 2020 11:32:06 +0200
Toralf Förster <toralf@gentoo.org> wrote:

> On 4/26/20 11:09 AM, Ulrich Mueller wrote:
> > Instead of using the IP address, you could generate a UUID when
> > installing the tool.   
> 
> like the pfl tool did ?
> 

Like the last gentoostats gsoc project did.

As for enterprise/school/multiple clone deployments.  Those are
generated by one person/team, then deployed.  We would need that
person/team to only enable their test system for gentoostats/disabled
for deployments. Repeated failure to do that could result in that uuid
being blacklisted.   Part of the initial profile details for that
vm/image would be some details about approx numbers of deployments
(yes, subject to change. But useful to know whether it is 10-15 or
100-500.  type of deployment  ie: vm/docker/kubernetes/desktop/server...

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation
  2020-04-26 10:25       ` Michał Górny
@ 2020-04-26 10:54         ` Toralf Förster
  0 siblings, 0 replies; 31+ messages in thread
From: Toralf Förster @ 2020-04-26 10:54 UTC (permalink / raw
  To: gentoo-dev


[-- Attachment #1.1: Type: text/plain, Size: 1575 bytes --]

On 4/26/20 12:25 PM, Michał Górny wrote:
> On Sun, 2020-04-26 at 12:15 +0200, Toralf Förster wrote:
>> On 4/26/20 10:52 AM, Michał Górny wrote:
>>> Do you have any other idea for spam protection then?
>>
>> IMO there're 2 types of spam:
>>
>> 1. made by accident (eg. "* * * * *" instead "@weekly" in crontab)
>> 2. made intentionlly
>>
>> The 1st can be handled by UUID - just drop any old related dataset from inbox when a new one arrived
>> For the 2nd: what about accepting only datasets from "valid" UUIDs, meaning where just 1 dataset/week/IPv4 (maybe /16 block) in the mean did arrived in the last few weeks/months ?
>>
> 
> I'm sorry but could you rephrase that in more sentences?  I don't
> understand what you mean.
> 

Well, inspired by what Tor people do with Tor bridge stats:

- Create an UUID (never published, known only at the client and at the gentoo stats server)
- Calculate a hash of it. The hash is allowed to be published. The hash may be related with contact informations. The contact data may or may not be published. The hash is used for contacting people in case of questions.

The stats sent by the client contains the UUID.
Stats are send to a stats server in an area where they do live fore a while (days).
If a new stats file was got then the stats server deletes all older stats file of thet UUID in the stats area.

Stats are be trusted if they meet conditions already mentioned by Brian Dolbec.

IMO do not care about detecting spam, just try to detect valid UUIDs.

-- 
Toralf
PGP 23217DA7 9B888F45


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation
  2020-04-26  8:08 [gentoo-dev] [RFC] Ideas for gentoostats implementation Michał Górny
                   ` (2 preceding siblings ...)
  2020-04-26  9:09 ` Ulrich Mueller
@ 2020-04-26 12:38 ` Thomas Deutschmann
  2020-04-26 13:46   ` Kent Fredric
  2020-04-26 12:56 ` Kent Fredric
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 31+ messages in thread
From: Thomas Deutschmann @ 2020-04-26 12:38 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1.1: Type: text/plain, Size: 2573 bytes --]

On 2020-04-26 10:08, Michał Górny wrote:
> What do you think?  Do you foresee other problems?  Do you have
> other needs?  Can you think of better solutions?

While I would really like to have data, I think it's impossible to get
correct data and therefore we shouldn't collect any data at all because
the invalid data we would collect would be misused/misinterpreted.

Let's start with your first example already,

> the primary goal of the project would be to gather statistics on
> popularity of packages, in order to help us prioritize our attention
> and make decisions on what to keep and what to remove

Let's assume we will get reports that app-misc/foo is only installed 20
times. If you are going to judge based on this data, "Obviously, nobody
is using that package, it's stuck on <whatever>... safe to remove" your
view is biased:

Because reporting will never be mandatory, we don't know if app-misc/foo
is just unlucky because most of its user haven't opt-in into reporting,
too (you can assume something like this for people with tor-related
programs for example).

Now think about large installations which are probably not allowed to
"phone home", using their private local mirror and are even using build
hosts. I am aware of *multiple* large Gentoo deployments -- for servers.
You will never get data from these installations. Instead, stats will be
drowned by several home users which are more likely to submit data.
Not to mention the new containerized world...

It's the problem you all should know from Mozilla, Google, Microsoft
*duck*: They all do 'data-driven development'. The problem: *We* are
power users. We are using several features most normal users don't even
know. However, most of us are also aware about privacy and are disabling
stats. The result: These companies are killing popular power user
features just because their data indicates that nobody is using that
feature.

Please don't create pressure on users to opt-in to gentoostats to
prevent something like this for Gentoo.

My point is: I'll strongly object against *any* decision based on this
project because the data will be *always wrong*. Therefore the data is
useless and I wouldn't even consider collecting them in first place.
Where there is a trough the pigs gather... and at one point people will
start to ignore that the data is useless just to underline *their* point
in their current situation. :/

-- 
Regards,
Thomas Deutschmann / Gentoo Linux Developer
C4DD 695F A713 8F24 2AA1 5638 5849 7EE5 1D5D 74A5

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 618 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation
  2020-04-26  8:08 [gentoo-dev] [RFC] Ideas for gentoostats implementation Michał Górny
                   ` (3 preceding siblings ...)
  2020-04-26 12:38 ` Thomas Deutschmann
@ 2020-04-26 12:56 ` Kent Fredric
  2020-04-26 17:24 ` Samuel Bernardo
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 31+ messages in thread
From: Kent Fredric @ 2020-04-26 12:56 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 2154 bytes --]

On Sun, 26 Apr 2020 10:08:32 +0200
Michał Górny <mgorny@gentoo.org> wrote:

> A proper solution to cluster problem would probably involve some way to
> internally collect and combine data data before submission.  If you have
> large clusters of similar systems, I think you'd want to have all
> packages used on different systems reported as one entry.

For this, I'd suggest the ability to have an overrideable
"STATS_SERVER" (or something) ENV var URI that tells the submission
clients where to send their reports to.

Then have some server shipped in gentoo people can deploy, and submit
aggregated as a cron job, or potentially hand review the aggregated
submission data before submission, and potentially have tools to
whittle data out you don't want to share at the org level.

Such a tool is potentially useful to an organisation even without its
"submit to gentoo" capacity, as being able to internally analyse what
your organisation is using seems to be useful.

(eg: provide an admin a single point of information showing what
packages they need to audit, if all the nodes in the org are not
entirely controlled at the top level)

Though I think the overall design of anonymity by design is useful, I
can see usecases, especially in the organisation model, where being
able to voluntarily self-identify a node could be useful without
inherently being a privacy concern.

And you'd configure your relay to suppress these node identities in the
submitted data, or map them to a different org-wide identity. 

Example:
  I need to find somebody who is using <x> so I can ask them if <y>
  works, or if <z> is important to this package.

Example:
  Data indicates somebody within my org is using <x>, and I need to ask
  them not to use <x>, as its licensing terms are not compatible with
  our org.

Though for cases of voluntary identification, you'd need an interface
on the server node somewhere that allows you to generate unique ident
tokens, and associate data with them, possibly with a list of flags
dictating what records associated with this identity may be used for
(eg: Contact [y/n] )

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation
  2020-04-26 12:38 ` Thomas Deutschmann
@ 2020-04-26 13:46   ` Kent Fredric
  2020-05-05  0:47     ` Thomas Deutschmann
  0 siblings, 1 reply; 31+ messages in thread
From: Kent Fredric @ 2020-04-26 13:46 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 1072 bytes --]

On Sun, 26 Apr 2020 14:38:54 +0200
Thomas Deutschmann <whissi@gentoo.org> wrote:

> Let's assume we will get reports that app-misc/foo is only installed 20
> times. If you are going to judge based on this data, "Obviously, nobody
> is using that package, it's stuck on <whatever>... safe to remove" your
> view is biased:

I see this as more like what bloom filters get you, but in reverse:

- You still have to factor for "what you don't know"

- But now, instead of having "we don't know if anybody uses this", you
  *can* have a "we know for sure somebody uses this".

The anonymization and uncorrelatable aspects are of course very useful
to encourage people who would otherwise be averse to participate to
participate, but its for sure not a sure thing.

It would certainly be an improvement over what currently happens "No
reverse dependencies, thus, nobody is using it".

Bad things will still happen, but the absence of this tool won't stop
the bad things happening, because presently, the existence of users is
entirely conjecture.

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation
  2020-04-26 10:39     ` Brian Dolbec
@ 2020-04-26 13:52       ` Kent Fredric
  0 siblings, 0 replies; 31+ messages in thread
From: Kent Fredric @ 2020-04-26 13:52 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 1282 bytes --]

On Sun, 26 Apr 2020 03:39:24 -0700
Brian Dolbec <dolsen@gentoo.org> wrote:

> We would need that
> person/team to only enable their test system for gentoostats/disabled
> for deployments. Repeated failure to do that could result in that uuid
> being blacklisted.   Part of the initial profile details for that
> vm/image would be some details about approx numbers of deployments
> (yes, subject to change. But useful to know whether it is 10-15 or
> 100-500.  type of deployment  ie: vm/docker/kubernetes/desktop/server...

If the UUID generation was how I proposed in my other reply: On a
voluntary basis, with ability for UUID's to have metadata about what
the information associated with them may be used for, one could also
have a metadata field indicating what /kind/ of user the UUID was
associated with.

Then people simply installing things for testing, and reporting results
from their test rig could have a "tester" flag associated with a UUID
used only for testing, and then we can exclude that data from the main
reports, while still using it as evidence that a thing may work for
some audience.

The submission rate for UUID's with the "tester" flag could be allowed
to be higher, because it no longer contributes to the overall
statistics.

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation
  2020-04-26  8:52   ` Michał Górny
  2020-04-26 10:15     ` Toralf Förster
@ 2020-04-26 14:11     ` Kent Fredric
  1 sibling, 0 replies; 31+ messages in thread
From: Kent Fredric @ 2020-04-26 14:11 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 2568 bytes --]

On Sun, 26 Apr 2020 10:52:27 +0200
Michał Górny <mgorny@gentoo.org> wrote:

> Do you have any other idea for spam protection then?

What is the realistic risk here for spamming?

If the record is well formed, and pertains to known packages, the worst
I currently imagine is astroturfing: A single individual attempting to
make a package seem more popular than it is.

Just generally IME, spamming aims to make a buck somehow, but if
there's no fields in the data set that can be used for this, and abuse
of existing fields to fill with spam prose get filtered by not
correlating to any known possible values, then the entire record is
simply invalid, and can be removed on that basis.

Conceptually, you could have a report with
"dev-foo/plz-sir-halp-me-I-have-money-and-an-a-nigerian-prince::nigeria-prince",
but for anybody to see that they'd have to be querying data about the
::nigeria-prince overlay, and that's assuming we even show data about
overlays we can't locate.

Trolling ::gentoo with packages that don't exist seems easy to eliminate.

I don't like that astroturfing could be a thing ... but like, I also
don't really care about that happening.

For instance, crates.io has per-crate and per-crate-version download
statistics.

That's super easy to rig, you get lots of spiky noise in infrequently
used packages simply due to various automated services fetching things.

But at scale, the data still turns out to be quasi-useful, as it allows
you to chart adoption and migration... because as soon as a new version
gets shipped, if people are using it, then you'll start to see an
uptick in reports from the new version.

The "change" and "change response" information is very useful, and a
very odd target for astroturfing.

I for one would be greatly interested in "new perl version shipped,
explosion of results due to people upgrading", because then I can gauge
roughly how many people managed to upgrade perl without having to join
#gentoo and cry about it being broken.

(We could also designate a certain UUID flag for use by Gentoo infra,
possibly even a UUID-per-host, the results of which were invisible in
the public data, but still visible to people with approved perms,
because we really do value the ability to know which packages we have
to be careful about causing problems in, and where infra is at with
upgrading various things before we remove the versions infra is using,
whereas currently, working out what infra are currently running
requires lots of direct communication)

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation
  2020-04-26  8:08 [gentoo-dev] [RFC] Ideas for gentoostats implementation Michał Górny
                   ` (4 preceding siblings ...)
  2020-04-26 12:56 ` Kent Fredric
@ 2020-04-26 17:24 ` Samuel Bernardo
  2020-05-04 22:57 ` Andrey Utkin
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 31+ messages in thread
From: Samuel Bernardo @ 2020-04-26 17:24 UTC (permalink / raw
  To: gentoo-dev


[-- Attachment #1.1: Type: text/plain, Size: 7657 bytes --]

Hi everyone,

gentoostats is a novelty for me and I'm not aware of previous
discussions or implementations. But for what I could understand from the
comments and Michał Górny explanation, I would start to ask your
attention to octoverse[1] initiative.

Maybe collected statistics could be a possible from a platform to get
the additional metadata for the stats from user contribution. What I
mean is a way to have a broker to collect all statistics from an
organization internally and then to publish that in the end. With such
solution would allow to add value for enterprise statistics and also to
contribute in the end to Gentoo.

Each broker cloud use in the end git authentication to publish the
results with a merge request that would run the necessary hooks from
Gentoo side. We only need here a document specification for data parsing
in the end.

Sorry if my comment is completely out of context, but such an octoverse
for Gentoo would be very interesting in my perspective.

Best,

Samuel

[1] https://octoverse.github.com/

On 4/26/20 9:08 AM, Michał Górny wrote:
> Hi,
>
> The topic of rebooting gentoostats comes here from time to time.  Unless
> I'm mistaken, all the efforts so far were superficial, lacking a clear
> plan and unwilling to research the problems.  I'd like to start
> a serious discussion focused on the issues we need to solve, and propose
> some ideas how we could solve them.
>
> I can't promise I'll find time to implement it.  However, I'd like to
> get a clear plan on how it should be done if someone actually does it.
>
>
> The big questions
> =================
> The way I see it, the primary goal of the project would be to gather
> statistics on popularity of packages, in order to help us prioritize our
> attention and make decisions on what to keep and what to remove.  Unlike
> Debian's popcon, I don't think we really want to try to investigate
> which files are actually used but focus on what's installed.
>
> There are a few important questions that need to be answered first:
>
> 1. Which data do we need to collect?
>
>    a. list of installed packages?
>    b. versions (or slots?) of installed packages?
>    c. USE flags on installed packages?
>    d. world and world_sets files
>    e. system profile?
>    f. enabled repositories? (possibly filtered to official list)
>    g. distribution?
>
> I think d. is most important as it gives us information on what users
> really want.  a. alone is kinda redundant is we have d.  c. might have
> some value when deciding whether to mask a particular flag (and implies
> a.).
>
> e. would be valuable if we wanted to determine the future of particular
> profiles, as well as e.g. estimate the transition to new versions.
>
> f. would be valuable to determine which repositories are used but we
> need to filter private repos from the output for privacy reasons.
>
> g. could be valuable in correlation with other data but not sure if
> there's much direct value alone.
>
>
> 2. How to handle Gentoo derivatives?  Some of them could provide
> meaningful data but some could provide false data (e.g. when derivatives
> override Gentoo packages).  One possible option would be to filter a.-e. 
> to stuff coming from ::gentoo.
>
>
> 3. How to keep the data up-to-date?  After all, if we just stack a lot
> of old data, we will soon stop getting meaningful results.  I suppose
> we'll need to timestamp all data and remove old entries.
>
>
> 4. How to avoid duplication?  If some users submit their results more
> often than others, they would bias the results.  3. might be related.
>
>
> 5. How to handle clusters?  Things are simple if we can assume that
> people will submit data for a few distinct systems.  But what about
> companies that run 50 Gentoo machines with the same or similar setup? 
> What about clusters of 1000 almost identical containers?  Big entities
> could easily bias the results but we should also make it possible for
> them to participate somehow.
>
>
> 6. Security.  We don't want to expose information that could be
> correlated to specific systems, as it could disclose their
> vulnerabilities.
>
>
> 7. Privacy.  Besides the above, our sysadmins would appreciate if
> the data they submitted couldn't be easily correlated to them.  If we
> don't respect privacy of our users, we won't get them to submit data.
>
>
> 8. Spam protection.  Finally, the service needs to be resilient to being
> spammed with fake data.  Both to users who want to make their packages
> look more important, and to script kiddies that want to prove a point.
>
>
> My (partial) implementation idea
> ================================
> I think our approach should be oriented on privacy/security first,
> and attempt to make the best of the data we can get while respecting
> this principle.  This means no correlation and no tracking.
>
> Once the tool is installed, the user needs to opt-in to using it.  This
> involves accepting a privacy policy and setting up a cronjob.  The tool
> would suggest a (random?) time for submission to take place periodically
> (say, every week).
>
> The submission would contain only raw data, without any identification
> information.  It would be encrypted using our public key.  Once
> uploaded, it would be put into our input queue as-is.
>
> Periodically the input queue would be processed in bulk.  The individual
> statistics would be updated and the input would be discarded.  This
> should prevent people trying to correlate changes in statistics with
> individual uploads.
>
> Each counted item would have a timestamp associated, and we'd discard
> old items per resubmission period.  This should ensure that we keep
> fresh data and people can update their earlier submissions without
> storing identification data.
>
> For example, N users submit their data containing a list of packages
> every week.  This data is used in bulk to update counts of individual
> packages (technically, to append timestamps to list corresponding to
> these packages).  Data older than one week is discarded, so we have
> rough counts of package use during the last week.
>
> I think this addresses problems 3./6./7.
>
>
> The other major problem is spam protection.  The best semi-anonymous way
> I see is to use submitter's IPv4 addresses (can we support IPv6 then?). 
> We could set a limit of, say, 10 submissions per IPv4 address per week. 
> If some address would exceed that limit, we could require CAPTCHA
> authorization.
>
> I think this would make spamming a bit harder while keeping submissions
> easy for the most, and a little harder but possible for those of us
> behind ISP NATs.
>
> This should address problems 4./8. and maybe 5. to some degree.
>
>
> A proper solution to cluster problem would probably involve some way to
> internally collect and combine data data before submission.  If you have
> large clusters of similar systems, I think you'd want to have all
> packages used on different systems reported as one entry.
>
>
> I think we should collect data from users running all Gentoo
> derivatives, as long as they are using Gentoo packages.  The simplest
> solution I can think of would be to filter the results on packages (or
> profiles) installed from ::gentoo.  This will work only for distros that
> expose ::gentoo explicitly (vs copying our ebuilds to their
> repositories) though.
>
>
> What do you think?  Do you foresee other problems?  Do you have other
> needs?  Can you think of better solutions?
>


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation
  2020-04-26  9:09 ` Ulrich Mueller
  2020-04-26  9:32   ` Toralf Förster
  2020-04-26  9:42   ` Michał Górny
@ 2020-04-26 21:47   ` Andreas K. Hüttel
  2 siblings, 0 replies; 31+ messages in thread
From: Andreas K. Hüttel @ 2020-04-26 21:47 UTC (permalink / raw
  To: Michał Górny, gentoo-dev; +Cc: gentoo-dev, Ulrich Mueller

[-- Attachment #1: Type: text/plain, Size: 1213 bytes --]

Am Sonntag, 26. April 2020, 12:09:59 EEST schrieb Ulrich Mueller:
> >>>>> On Sun, 26 Apr 2020, Michał Górny wrote:
> > The other major problem is spam protection.  The best semi-anonymous way
> > I see is to use submitter's IPv4 addresses (can we support IPv6 then?).
> > We could set a limit of, say, 10 submissions per IPv4 address per week.
> > If some address would exceed that limit, we could require CAPTCHA
> > authorization.
> 
> Instead of using the IP address, you could generate a UUID when
> installing the tool. This would also take care of clusters with machines
> that are clones of each other.
> 

TBH, for clusters I would insert a sentence like
"If you are administering a cluster of many identical Gentoo machines, please 
see $WIKIPAGE before enabling submission"

and there then have a few more instructions (like how to enable only for one 
machine, and additionally provide us with the cluster size). I guess in this 
case we can add this further step, since whoever is doing that will be both 
invested in Gentoo and able to read docs.

-- 
Andreas K. Hüttel
dilfridge@gentoo.org
Gentoo Linux developer 
(council, qa, toolchain, base-system, perl, libreoffice)

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation
  2020-04-26  8:08 [gentoo-dev] [RFC] Ideas for gentoostats implementation Michał Górny
                   ` (5 preceding siblings ...)
  2020-04-26 17:24 ` Samuel Bernardo
@ 2020-05-04 22:57 ` Andrey Utkin
  2020-05-05  0:24   ` Thomas Deutschmann
  2020-05-25  2:47   ` Robin H. Johnson
  2020-05-05 14:34 ` Nils Freydank
                   ` (2 subsequent siblings)
  9 siblings, 2 replies; 31+ messages in thread
From: Andrey Utkin @ 2020-05-04 22:57 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 835 bytes --]

Since it is going to be opt-in and optional anyway, we seem to be fine with
having just partial data.

I assume we have logs of distfiles downloads from Gentoo infrastructure, and
can negotiate access to relevant logs of our mirrors. That constitutes partial
data correlated with users' installation activity, as good as it gets.

If we do have some such data, are we using it in any way for the discussed
purposes?

If we don't, but could get it, would we be able to use that data for these
purposes? If no, why?

If we can't get the data, why?


As an aside, I think the best known way to ensure the availability of important
things, from user perspective, is to pay for these important things. Of course
I see how this won't fit culturally very well here and that we're not going to
switch to commercial model just for this reason.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 1014 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation
  2020-05-04 22:57 ` Andrey Utkin
@ 2020-05-05  0:24   ` Thomas Deutschmann
  2020-05-25  2:47   ` Robin H. Johnson
  1 sibling, 0 replies; 31+ messages in thread
From: Thomas Deutschmann @ 2020-05-05  0:24 UTC (permalink / raw
  To: gentoo-dev


[-- Attachment #1.1: Type: text/plain, Size: 637 bytes --]

On 2020-05-05 00:57, Andrey Utkin wrote:
> I assume we have logs of distfiles downloads from Gentoo infrastructure, and
> can negotiate access to relevant logs of our mirrors. That constitutes partial
> data correlated with users' installation activity, as good as it gets.

Even if we would have data for distfiles.gentoo.org this won't help us.
See how Gentoo works: If you follow handbook you will pick a
local/regional mirror. Now all these users are suddenly 'disconnected'
from the download stats...


-- 
Regards,
Thomas Deutschmann / Gentoo Linux Developer
fpr: C4DD 695F A713 8F24 2AA1 5638 5849 7EE5 1D5D 74A5


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 618 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation
  2020-04-26 13:46   ` Kent Fredric
@ 2020-05-05  0:47     ` Thomas Deutschmann
  2020-05-05  5:14       ` Matt Turner
                         ` (2 more replies)
  0 siblings, 3 replies; 31+ messages in thread
From: Thomas Deutschmann @ 2020-05-05  0:47 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1.1: Type: text/plain, Size: 1661 bytes --]

On 2020-04-26 15:46, Kent Fredric wrote:
> On Sun, 26 Apr 2020 14:38:54 +0200
> Thomas Deutschmann <whissi@gentoo.org> wrote:
> 
>> Let's assume we will get reports that app-misc/foo is only installed 20
>> times. If you are going to judge based on this data, "Obviously, nobody
>> is using that package, it's stuck on <whatever>... safe to remove" your
>> view is biased:
> 
> I see this as more like what bloom filters get you, but in reverse:
> 
> [...]
>
> - But now, instead of having "we don't know if anybody uses this", you
>   *can* have a "we know for sure somebody uses this".

But how does that information really help us to decide anything in the end?

Case A, stats are showing 0 users:

Like said, we can't know if this is true or if this package is only used
in setups where people don't report stats.

Case B, stats are showing x users:

Now what? Package from case A could have similar users -- we just don't
know. Assume firefox has 1.000 users, chromium has 500 users and vivaldi
doesn't show up in stats. How does that help us? Would this allow us to
skip publishing GLSAs for vivalid because we assume nobody in Gentoo is
using vivaldi? Does it allow Python project to go forward pushing a mask
for removal in case vivaldi would depend on Python version, Python
project want to get rid of? Would this allow Gentoo PR to make a public
statement like "Firefox is the most popular browser in Gentoo, twice as
users as chromium"?

Yes it would be a signal but a useless signal, not?

-- 
Regards,
Thomas Deutschmann / Gentoo Linux Developer
fpr: C4DD 695F A713 8F24 2AA1 5638 5849 7EE5 1D5D 74A5

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 618 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation
  2020-05-05  0:47     ` Thomas Deutschmann
@ 2020-05-05  5:14       ` Matt Turner
  2020-05-05  6:19         ` Alec Warner
  2020-05-05  7:10       ` Michał Górny
  2020-05-05 19:38       ` Kent Fredric
  2 siblings, 1 reply; 31+ messages in thread
From: Matt Turner @ 2020-05-05  5:14 UTC (permalink / raw
  To: gentoo development

On Mon, May 4, 2020 at 5:48 PM Thomas Deutschmann <whissi@gentoo.org> wrote:
>
> On 2020-04-26 15:46, Kent Fredric wrote:
> > On Sun, 26 Apr 2020 14:38:54 +0200
> > Thomas Deutschmann <whissi@gentoo.org> wrote:
> >
> >> Let's assume we will get reports that app-misc/foo is only installed 20
> >> times. If you are going to judge based on this data, "Obviously, nobody
> >> is using that package, it's stuck on <whatever>... safe to remove" your
> >> view is biased:
> >
> > I see this as more like what bloom filters get you, but in reverse:
> >
> > [...]
> >
> > - But now, instead of having "we don't know if anybody uses this", you
> >   *can* have a "we know for sure somebody uses this".
>
> But how does that information really help us to decide anything in the end?
>
> Case A, stats are showing 0 users:
>
> Like said, we can't know if this is true or if this package is only used
> in setups where people don't report stats.
>
>
> Case B, stats are showing x users:
>
> Now what? Package from case A could have similar users -- we just don't
> know. Assume firefox has 1.000 users, chromium has 500 users and vivaldi
> doesn't show up in stats. How does that help us? Would this allow us to
> skip publishing GLSAs for vivalid because we assume nobody in Gentoo is
> using vivaldi? Does it allow Python project to go forward pushing a mask
> for removal in case vivaldi would depend on Python version, Python
> project want to get rid of? Would this allow Gentoo PR to make a public
> statement like "Firefox is the most popular browser in Gentoo, twice as
> users as chromium"?

I hate the saying "the perfect is the enemy of the good" but I think
it applies here.

You're of course correct that we would not have perfect information.
But the thing about statistics is that you can still know some things
based on a sampling of that perfect information.

I would personally like to have data on whether users of my packages
have certain USE flags enabled. Knowing that would allow me to decide
whether its worth the maintenance burden of supporting features that I
*think* are very rarely used. If instead the data showed me that 50%
of users had IUSE=xyz enabled, I probably wouldn't consider removing
it.

I think your example of potential misuse of data is a bit over dramatic.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation
  2020-05-05  5:14       ` Matt Turner
@ 2020-05-05  6:19         ` Alec Warner
  0 siblings, 0 replies; 31+ messages in thread
From: Alec Warner @ 2020-05-05  6:19 UTC (permalink / raw
  To: Gentoo Dev

[-- Attachment #1: Type: text/plain, Size: 4390 bytes --]

On Mon, May 4, 2020 at 10:14 PM Matt Turner <mattst88@gentoo.org> wrote:

> On Mon, May 4, 2020 at 5:48 PM Thomas Deutschmann <whissi@gentoo.org>
> wrote:
> >
> > On 2020-04-26 15:46, Kent Fredric wrote:
> > > On Sun, 26 Apr 2020 14:38:54 +0200
> > > Thomas Deutschmann <whissi@gentoo.org> wrote:
> > >
> > >> Let's assume we will get reports that app-misc/foo is only installed
> 20
> > >> times. If you are going to judge based on this data, "Obviously,
> nobody
> > >> is using that package, it's stuck on <whatever>... safe to remove"
> your
> > >> view is biased:
> > >
> > > I see this as more like what bloom filters get you, but in reverse:
> > >
> > > [...]
> > >
> > > - But now, instead of having "we don't know if anybody uses this", you
> > >   *can* have a "we know for sure somebody uses this".
> >
> > But how does that information really help us to decide anything in the
> end?
> >
> > Case A, stats are showing 0 users:
> >
> > Like said, we can't know if this is true or if this package is only used
> > in setups where people don't report stats.
> >
> >
> > Case B, stats are showing x users:
> >
> > Now what? Package from case A could have similar users -- we just don't
> > know. Assume firefox has 1.000 users, chromium has 500 users and vivaldi
> > doesn't show up in stats. How does that help us? Would this allow us to
> > skip publishing GLSAs for vivalid because we assume nobody in Gentoo is
> > using vivaldi? Does it allow Python project to go forward pushing a mask
> > for removal in case vivaldi would depend on Python version, Python
> > project want to get rid of? Would this allow Gentoo PR to make a public
> > statement like "Firefox is the most popular browser in Gentoo, twice as
> > users as chromium"?
>
> I hate the saying "the perfect is the enemy of the good" but I think
> it applies here.
>
> You're of course correct that we would not have perfect information.
> But the thing about statistics is that you can still know some things
> based on a sampling of that perfect information.
>
> I would personally like to have data on whether users of my packages
> have certain USE flags enabled. Knowing that would allow me to decide
> whether its worth the maintenance burden of supporting features that I
> *think* are very rarely used. If instead the data showed me that 50%
> of users had IUSE=xyz enabled, I probably wouldn't consider removing
> it.
>
> I think your example of potential misuse of data is a bit over dramatic.
>

Let me present the same point another way.

Today we have no data, so we make an arbitrary decision. It might be right
or wrong; and we may not know until after we decide.
This is traditionally things like "break them and they will come" type of
process. "Mask it, if they complain, I'll unmask it."

In the future, we could have this package data. It may influence decision
making. However I'm not sure from a decision-making standpoint that it is
strictly worse than no data.
The danger (which is what I think Whissi's concern is) is that it could
artificially increase decision certainty.

For example, if I have to decide whether to keep a package, or a flag, or
whatever. I might make an arbitrary decision. I'm aware it's arbitrary, it
might be wrong, and so I'm not super attached to such a decision. I'm not
*certain* about it; but I have to decide one way or the other[0]. Then I
move to a world with package data. Now I'm no longer making an arbitrary
decision; I'm making a decision based on *data*. The *data* tells me my
decision is correct, resulting in a more *certain* decision outcome. I
think this is the fallacy we want to avoid. The data can be informative but
there are significant biases in it that should result in very *little*
certainty added to decision making.

Making decisions based on incomplete data is just life though, so I'm
fairly skeptical of a "we shouldn't collect any data" type of mindset. I'd
be curious to see if we can instill a *culture* component around the use of
data in our development workflows.

-A

[0] There are a bunch of other cultural components here, like different
decision types (1 vs 2) and the ability to make a mistake in public and not
feel bad about it; so I'm aware reality does not reflect this trivial
example. But those are hallmarks of cultural markets I'd like to aim for in
Gentoo, so I would prefer to discuss a world where they exist ;)

[-- Attachment #2: Type: text/html, Size: 5449 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation
  2020-05-05  0:47     ` Thomas Deutschmann
  2020-05-05  5:14       ` Matt Turner
@ 2020-05-05  7:10       ` Michał Górny
  2020-05-05 19:38       ` Kent Fredric
  2 siblings, 0 replies; 31+ messages in thread
From: Michał Górny @ 2020-05-05  7:10 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 1766 bytes --]

On Tue, 2020-05-05 at 02:47 +0200, Thomas Deutschmann wrote:
> Yes it would be a signal but a useless signal, not?
> 

You seem to aim for arbitrarily blocking developers from making
decisions by preventing them from having data.  This won't work. 
Firstly, because *we have* to make decisions, and the worse data we
have, the more arbitrary decisions will be.  Secondly, because we always
will have some data, it will probably be worse than what's being
proposed here.

Generally, having more data means making better informed decisions.
Of course, there's always the potential of having too much data (though
I honestly don't think we're anywhere near that).  There's also
the potential of being lazy and just taking the easiest available data. 
There's no way around that but then, you can also be lazy and make
decisions ignoring any data.

For example, one kind of data we have right now are bugs.  So a package
fails for me in an obvious way yet there's no bug open.  Does that mean
that the package has zero users?  Otherwise someone would have reported
the problem, right?  So here go last rites.

Gentoostats could tell me 'hey, this package has bunch of users still'. 
This questions my first assessment -- 'oh, they probably haven't had to
rebuild it since ...'

If I have no data, we have to rely on 'gut feelings'.  I have a gut
feeling that this package looks useless, why bother.  Is that more
worthwhile than having *some* number to look at?  Even if the data is
biased towards specific kind of users, it would probably work better
than guessing.  And if it looks unreasonable, nobody stops you from
guessing.  I guess that an informed guess is better than a random guess.

-- 
Best regards,
Michał Górny

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 618 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation
  2020-04-26  8:08 [gentoo-dev] [RFC] Ideas for gentoostats implementation Michał Górny
                   ` (6 preceding siblings ...)
  2020-05-04 22:57 ` Andrey Utkin
@ 2020-05-05 14:34 ` Nils Freydank
  2020-05-05 18:04 ` Jaco Kroon
  2020-05-05 19:31 ` Toralf Förster
  9 siblings, 0 replies; 31+ messages in thread
From: Nils Freydank @ 2020-05-05 14:34 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 1079 bytes --]

Hi all,

I find the idea of having data great, but agree that it can lead to a false
sense of having a correct data base. Therefor two thoughts:

First, therefore I'd like to propose that you introduce gentoostats as a
*strictly timed experiment* and evaluate if it actually changed anything within
your decisions and drop it or let run permanently afterwards.

I have no proper solution for the parameters though, maybe something like
"I choose to keep X use flags based on g.s.", but this would ask every dev to
log plenty of decisions manually (read: I don't think this will happen).

Second, I'm a bit frightened of Whissi's thought of dropping anything
security related based on non-input via g.s. -- I'd like to ask you to use the
information based on g.s. *not* for security related decisions, more for
"harmless" ones like the Matt mentioned: Should I really support feature X while
literally everyone of 200 users uses feature Y instead and I have no real
testing ground for feature X (Matt, yell at me if I got you wrong!).

Kind regards,
Nils (holgersson on Freenode)

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 963 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation
  2020-04-26  8:08 [gentoo-dev] [RFC] Ideas for gentoostats implementation Michał Górny
                   ` (7 preceding siblings ...)
  2020-05-05 14:34 ` Nils Freydank
@ 2020-05-05 18:04 ` Jaco Kroon
  2020-05-05 19:31 ` Toralf Förster
  9 siblings, 0 replies; 31+ messages in thread
From: Jaco Kroon @ 2020-05-05 18:04 UTC (permalink / raw
  To: gentoo-dev, Michał Górny

Hi Michał, and the rest of the Gentoo devs,

I've been patiently sitting and watching this discussion.

I raised some ideas with another developer (Not Michał) just days before
he raised this thread to the ML.

I believe all points raised to this point is valid, I'll try to summarise:

1.  This must be completely *opt in*.
2.  Anonymity was discussed by various parties (privacy).
3.  "spam" protection (ie, preventing bogus data from entering).
4.  Trustworthiness of data.
5.  Acceptance of some form of privacy policy.

In my opinion, points 2 and 3 works against each other, in that if
registration is compulsory if you would like to submit stats, then we
can control the spam more easily (not foolproof), but requiring
registration also raises the entry barrier.  I'd be completely willing
to provide at least an email address as part of a submission.

All of the replies seems to have focused purely on yes/no, do it or
don't.  Not many have addressed the benefits to end users/system
administrators.  It seems to focus is on what we as developers can get
out of this.

Regarding the above points:

1.  I fully agree.  This should not be forced on anyone.
2.  Happy to concede that some people may wish to submit anonymously. 
Let them.
3.  I'll address this below.
4.  A lot of the discussion has been around the usefulness of the data,
and I concede to Thomas that this may (or may not) generate "decision
blind spots" or as per "artificially increase decision certainty".  I
don't see how this is worse than what we've got now.
5.  We have the infrastructure for this already by way of licenses.  So
we ship with "GPLv2/3/whatever + GentooPrivacy", and users have to first
take explicit action to accept GentooPrivacy.

I have some other ideas around this, which will tread even further on
privacy, but again, all of this should be a kind of opt-in, and building
on the ideas by Kent where he suggested a form of submission proxy
(STATS_SERVER), we could potentially give the full benefit of the code
to such entities, but then still allow them to submit "upstream" in a
more filtered manner.

Bottom line, in my opinion:  Any data is better than no data!

Whilst we can't say "no one is using xyz", we will at least be able to
say "hey, some people are using xyz", and whilst this may generate some
blinds it at least enables us to test known use cases during
test-builds, eg, we know for a fact a thousand users are using package X
with USE flags "-* a b c", so we should definitely run that as a compile
test.  Your build breaks frequently?  Would you mind submitting stats? 
Great thank you.  You not willing to do that, then my stance becomes one
of "ok, I'll help where I can, but really, please consider us to help
you, if you submit stats we can pre-emptively at least include build
tests for your specific USE flags." - and again, this means we can
actually have our tooling use these stats to generate build tests for
the "known popular" configs.

I point you to RHEL - why are people willing to pay for for RHEL?  What
do they get for that buck?  Because I promise you, the support I get
from fellow Gentoo'ers FAR outweigh the support I have ever gotten from
(paid for) RHEL.  Most of the time.

I myself used to run 500+ Gentoo hosts more than 15 years back.  It was
fun.  I was also a student back then so had much more time on my hands
than I do now.  It was challenging, and fun to try and get things to
work exactly the way we envisioned it should.  I promise you, if what
Michał proposes was available for me back then to firstly keep track of
my own internal assets, and to submit stats upstream to help improve
Gentoo I would not have hesitated for 10 seconds.

And there I touch on a point I'm trying to make - this should be
something that not only helps devs, but brings benefit to users.  I'll
say more on this at the end of the email (possibly force users to run
some of their own infra for this at least, but these stats form the
framework for a multi-system management system too, potentially).  First
I'd like to pay more attention to the individual points raised by Michał.

On 2020/04/26 10:08, Michał Górny wrote:

> Hi,
>
> The topic of rebooting gentoostats comes here from time to time.  Unless
> I'm mistaken, all the efforts so far were superficial, lacking a clear
> plan and unwilling to research the problems.  I'd like to start
> a serious discussion focused on the issues we need to solve, and propose
> some ideas how we could solve them.
>
> I can't promise I'll find time to implement it.  However, I'd like to
> get a clear plan on how it should be done if someone actually does it.

My time is also limited, but I would love to be involved in some way or
another.

> The big questions
> =================
> The way I see it, the primary goal of the project would be to gather
> statistics on popularity of packages, in order to help us prioritize our
> attention and make decisions on what to keep and what to remove.  Unlike
> Debian's popcon, I don't think we really want to try to investigate
> which files are actually used but focus on what's installed.
>
> There are a few important questions that need to be answered first:
>
> 1. Which data do we need to collect?
>
>    a. list of installed packages?
>    b. versions (or slots?) of installed packages?
>    c. USE flags on installed packages?
>    d. world and world_sets files
>    e. system profile?
>    f. enabled repositories? (possibly filtered to official list)
All of the above.  Including exact versions and USE flags for each
package.  Also, I'm sure there are others, but I sometimes have systems
that fall behind on certain packages, either by no longer being included
from world or for other reasons (eg, a specific SLOT that no longer
updates for some reason, although this situation has improved).
>    g. distribution?
/etc/gentoo-release?

Yes, I think so, that partially deals with your "derivative distributions".

h.  date+time of last successful emerge --sync (probably individually
for each repository).
i.  /var/log/emerge.log
j.  hardware data, eg, amount of RAM, CPU clock speed/cores, disks.
k.  hostname + other network info (IP address).

i - build failures might be helpful.  Might be useful  to get exact
merge times assuming that users want some extra features for user
benefit, not gentoo dev benefit.
j,k - definitely not of use to devs, but possibly to users as a form of
"hardware inventory".

Much of this is definitely not data that we want/need, but if the data
gets proxied, then we and our users can use this as a form of inventory
management system too.

> I think d. is most important as it gives us information on what users
> really want.  a. alone is kinda redundant is we have d.  c. might have
> some value when deciding whether to mask a particular flag (and implies
> a.).
>
> e. would be valuable if we wanted to determine the future of particular
> profiles, as well as e.g. estimate the transition to new versions.
>
> f. would be valuable to determine which repositories are used but we
> need to filter private repos from the output for privacy reasons.
I agree with all of this.
> g. could be valuable in correlation with other data but not sure if
> there's much direct value alone.
Don't think so, but see your own point 2.
>
>
> 2. How to handle Gentoo derivatives?  Some of them could provide
> meaningful data but some could provide false data (e.g. when derivatives
> override Gentoo packages).  One possible option would be to filter a.-e. 
> to stuff coming from ::gentoo.

It may be of benefit to know which ::gentoo packages they are using, and
if we make the code available to those distributions as a form of
proxy/peer, then any hosts that submit directly to Gentoo we could
dispatch to that distributions' infra, or if we're really nice, just
keep it and strip out the packages we don't maintain (ie, not ::gentoo
or official repositories).

>
>
> 3. How to keep the data up-to-date?  After all, if we just stack a lot
> of old data, we will soon stop getting meaningful results.  I suppose
> we'll need to timestamp all data and remove old entries.

My opinion on this, automated cron, that dispatches daily.  At least
weekly.  Daily provides better granularity for some other ideas aimed at
system administrators.  Eg, when did what change?  I shove /etc into git
for this reason alone with a nightly cron to commit everything and push
it to a remote server, also serves as a form of configuration backup.

>
>
> 4. How to avoid duplication?  If some users submit their results more
> often than others, they would bias the results.  3. might be related.

I think this directly relate to SPAM.  So I fully agree with the UUID
per installation concept.  But then systems get cloned (our labs used to
be updated on a single machine, then we utilized udpcast to update the
rest of the systems, so they would all end up with the same UUID).  So
the primary purpose of this is to find the origin of the installation,
but can be trivially bypassed either by force generating a new UUID, or
copying from other machines, so this can be trivially manipulated.

I think we need to add a secondary, hardware based identifier.

Digium (now Sangoma) checks for all MAC addresses for ethX, starting
from 0 until the ioctl gets a failure, if eth0 fails, it basically does
"ip ad sh" and end up including the same MAC multiple times, and in
arbitrary order since the NICs aren't guaranteed to be detected in the
same order on every boot.  This (or a related) method could work, so
generate some unique hardware-based identifier, then hash it using say
SHA-256 or BLAKE2 to generate something which can't be trivially
reversed back to the original identifier?  Why ... well, anonymity :). 
We could even include the configured or dhcp obtained hostname into this.

> 5. How to handle clusters?  Things are simple if we can assume that
> people will submit data for a few distinct systems.  But what about
> companies that run 50 Gentoo machines with the same or similar setup? 
> What about clusters of 1000 almost identical containers?  Big entities
> could easily bias the results but we should also make it possible for
> them to participate somehow.
Assuming they do what we did ... they'd probably (hopefully) all end up
with the same (installation time?) UUID but different hardware
identifiers.  So we'd be able to identify them ... and enterprise idea,
report back to those admins (assuming they registered these systems to
their profile) that their clusters have discrepancies.
>
>
> 6. Security.  We don't want to expose information that could be
> correlated to specific systems, as it could disclose their
> vulnerabilities.

Agreed.  But some of this may have particular benefit for system
administrators, so perhaps a secondary level of opt-in for providing
"potentially sensitive data" if the Gentoo infra gets compromised.  We
could perhaps store a raw blob for these users that only gets decrypted
by some key that only they should have/poses.

Or, we could proxy the data, let the sensitive stuff travel to the
proxy/aggregator, and strip that from going higher up.  And they simply
generate those reports locally on their proxy/aggregator.

>
>
> 7. Privacy.  Besides the above, our sysadmins would appreciate if
> the data they submitted couldn't be easily correlated to them.  If we
> don't respect privacy of our users, we won't get them to submit data.

I'm happy with either blind UUID + HW-related-hash submission, without
any further data, but would really appreciate if users are willing to
register.  This would have the following benefits IMHO:

They could subscribe for news items that affects them.
They could subscribe for receiving GLSAs for packages that affect their
systems.
They could get a view of all their systems from a central "management"
interface.

I have a need to be able to ask the asterisk users on Gentoo what they
need/want.  As it stands, I'm suffering from "user blindness".  Again, I
have my own needs, and I scratch those, but helping others to get their
needs scratched is a good thing.  If you don't want to participate,
that's fine, but if you do, you get to reap the benefit.  Towards this
end, and perhaps enabling some users to provide some feedback a further
future step may be to enable users to anonymously submit requests via
the system.  Or we could get anonymous feedback from users from whom
we'd normally not get any.  So if the core infra on this has email
addresses for all users, it could send out the email on-behalf-of the
package maintainer, and feedback could then be submitted via some
anonymous mechanism (eg, link in email that takes the user to a
submissions page, and we explicitly don't encode per-recipient
cookie-style data into the link).  An idea.

>
>
> 8. Spam protection.  Finally, the service needs to be resilient to being
> spammed with fake data.  Both to users who want to make their packages
> look more important, and to script kiddies that want to prove a point.

Data only gets included after being kept up to date for a period of at
least X days.  Based on generated UUID + HW-Hash.  UUID is (optionally
but ideally) linked to a user profile.  HW-Hash is just to identify
unique systems.

Data that doens't get kept up to date could be filtered out after Y
days, where Y <= X.  That way a spammer would at least need to take the
effort of keeping his spamming effort going for X number of days with X
number of unique (trivially spoofable) identifiers.  So we don't deny
that it can be done, I'm just not sure we care?

Other than me, who would benefit to spoof stats for asterisk for
example?  Perhaps someone with a grudge?  But they have my email address
anyway ... so can do far worse than generate a few spoofed submissions.

> My (partial) implementation idea
> ================================
> I think our approach should be oriented on privacy/security first,
> and attempt to make the best of the data we can get while respecting
> this principle.  This means no correlation and no tracking.

I both agree and disagree.  The most basic premise should be no
tracking/correlation unless the user specifically request it towards
specific functionality (eg, emailing of affecting GLSAs/news items,
single-platform for viewing my hosts and what their status are).

> Once the tool is installed, the user needs to opt-in to using it.  This
> involves accepting a privacy policy and setting up a cronjob.  The tool
> would suggest a (random?) time for submission to take place periodically
> (say, every week).

As above, I'd do this as part of accepting a license that states by
accepting this license you accept the most basic submission of stats in
an anonymous manner including only the most basic of identifier
information to identify unique systems.

> The submission would contain only raw data, without any identification
> information.  It would be encrypted using our public key.  Once
> uploaded, it would be put into our input queue as-is.

Correct.  Explicit action required to register UUID to user profile.  If
that is an option.

Eg, gentoo-stat --link-to jaco@iewc.co.za

Then prompt for my password, which I then need to enter in order to link
the UUID of the current system to my registered profile.

So completely anonymous, with minimum data, unless specifically
configured otherwise.

>
> Periodically the input queue would be processed in bulk.  The individual
> statistics would be updated and the input would be discarded.  This
> should prevent people trying to correlate changes in statistics with
> individual uploads.

Ok.  This makes makes sense.  As a sysadmin I'd like that data to be
available for say 30 to 60 or even 90 days, or at least "what changed
from submission X to X+1 spanning the period", because then if something
breaks, I can ask "when did it break?" and then I can ask the stats
system "what changed on the related systems around that time?".  At
Gentoo core infra level, we can potentially discard as  soon as
processed, but depending on the algorihm we may need to keep at least
the latest submitted copy for Y number of days (as defined above).

Ok, yes, I can do that by working through /var/log/emerge.log as well,
or genlop -l, but I need to do that system by system.  If I have an
environment of 500 hosts this gets tedious.  Or what if I'd like to find
what differs between a set of hosts where a feature X works, and others
that don't?

>
> What do you think?  Do you foresee other problems?  Do you have other
> needs?  Can you think of better solutions?
>
I think we should build a hierarchy.  So Gentoo-infra at the top.  End
users may submit only certain types of data there, all other data we as
devs don't care about gets discarded, and if we allow users to register
there directly we limit the functionality thereof in order to maintain
the requirements of the developers here first and foremost.

As such, the submitted package should be based on "data sets" in my
opinion, where the most basic sets could be:

core:
  a) package list including versions and use flags
  b) world and world_sets
  c) uuid
  d) hash(hardware ident)

hardware:
  a) RAM
  b) ...

network:
  a) ...

At the Gentoo-infra layer we can then have a policy that we ONLY accept
"core" sets.  If it's easy to at the proxy/aggregator level define your
own sets, and provide mechanisms to obtain the data (or as plugins on
the hosts themselves, eg, USE="hardware network" gentoo-stats-plugins
style, with the main package only containing what the devs need.  Just
ideas.

Further down the hierarchy additional sets could be defined, and
proxy/aggregator hosts could define what information they allow higher
up the hierarchy.

If we receive information for a gentoo derivate we redirect it to that
distribution.  Although for such a case we really should provide a way
for derivatives to specify their own "default" infra.

Other projects can then build on top of, or as plug-ins of the core
stats project to then provide the more enterprise-like features.  One
could potentially even go as far as automated updating driven from a
central control server in a networked environment where the
proxy/aggregator is able to connect back to the individual hosts to
execute commands on them.

I sincerely hope my ramblings haven't been completely off point.  I
believe the above shows that this can be of benefit to users and
developers alike, and hopefully in a way that does not infringe on user
users' rights or privacy.

One thing could be for aggregators to submit aggregated stats instead of
individual systems, again, same X and Y stuff would apply, however, I
think for aggregated submissions the data skew risk becomes even
larger.  So perhaps we should provide two sets of stats "excluding
aggregated stats" and including, or possibly we can mark some
aggregators as trusted.  I dunno.

Kind Regards,
Jaco

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation
  2020-04-26  8:08 [gentoo-dev] [RFC] Ideas for gentoostats implementation Michał Górny
                   ` (8 preceding siblings ...)
  2020-05-05 18:04 ` Jaco Kroon
@ 2020-05-05 19:31 ` Toralf Förster
  2020-05-05 20:26   ` Daniel Pielmeier
  9 siblings, 1 reply; 31+ messages in thread
From: Toralf Förster @ 2020-05-05 19:31 UTC (permalink / raw
  To: gentoo-dev


[-- Attachment #1.1: Type: text/plain, Size: 471 bytes --]

On 4/26/20 10:08 AM, Michał Górny wrote:
> I don't think we really want to try to investigate
> which files are actually used but focus on what's installed.
Hi,

I do wonder if the http://www.portagefilelist.de/site/start (package app-portage/pfl) would be part of that or not?
The maintainer of the pfl stopped the import of new data last year due to lack fo time to maintain that project and is looking for a usccessor.

-- 
Toralf
PGP 23217DA7 9B888F45


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation
  2020-05-05  0:47     ` Thomas Deutschmann
  2020-05-05  5:14       ` Matt Turner
  2020-05-05  7:10       ` Michał Górny
@ 2020-05-05 19:38       ` Kent Fredric
  2 siblings, 0 replies; 31+ messages in thread
From: Kent Fredric @ 2020-05-05 19:38 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 765 bytes --]

On Tue, 5 May 2020 02:47:48 +0200
Thomas Deutschmann <whissi@gentoo.org> wrote:

> Yes it would be a signal but a useless signal, not?

"There are no users reported using this dist, so we can nuke it" is
still far far superior to "there are no reverse dependencies, so we can
nuke it"

*Even* when the former is false information.

As presently, the "no reverse dependencies, therefore nuke" essentially
asserts there *are* no users to consider.

So the *worst* case scenario for decisions made with these statistics
is our *current* case.

Even if *nobody* uses the service and *all* results indicates "nobody
uses anything", then we'll just be reverting to what we currently do:
Remove things entirely on conjecture that they're not useful.

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation
  2020-05-05 19:31 ` Toralf Förster
@ 2020-05-05 20:26   ` Daniel Pielmeier
  2020-05-05 21:40     ` Toralf Förster
  0 siblings, 1 reply; 31+ messages in thread
From: Daniel Pielmeier @ 2020-05-05 20:26 UTC (permalink / raw
  To: gentoo-dev, Toralf Förster

[-- Attachment #1: Type: text/plain, Size: 755 bytes --]

Am May 5, 2020 7:31:34 PM UTC schrieb "Toralf Förster" <toralf@gentoo.org>:
>On 4/26/20 10:08 AM, Michał Górny wrote:
>> I don't think we really want to try to investigate
>> which files are actually used but focus on what's installed.
>Hi,
>
>I do wonder if the http://www.portagefilelist.de/site/start (package
>app-portage/pfl) would be part of that or not?
>The maintainer of the pfl stopped the import of new data last year due
>to lack fo time to maintain that project and is looking for a
>usccessor.

Actually the maintainer decided to continue the project.
The code is now hosted at Github [1].
The site moved to a new server and the upload is working again.

[1] https://github.com/portagefilelist

-- 
Best regards
Daniel

[-- Attachment #2: Type: text/html, Size: 296 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation
  2020-05-05 20:26   ` Daniel Pielmeier
@ 2020-05-05 21:40     ` Toralf Förster
  0 siblings, 0 replies; 31+ messages in thread
From: Toralf Förster @ 2020-05-05 21:40 UTC (permalink / raw
  To: gentoo-dev


[-- Attachment #1.1: Type: text/plain, Size: 404 bytes --]

On 5/5/20 10:26 PM, Daniel Pielmeier wrote:
> Actually the maintainer decided to continue the project.
> The code is now hosted at Github [1].
> The site moved to a new server and the upload is working again.
> 
> [1] https://github.com/portagefilelist
> 
> -- 
> Best regards
> Daniel

Indeed - I'm reactivating the pfl logic in the tinderbox script.

-- 
Toralf
PGP 23217DA7 9B888F45


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation
  2020-05-04 22:57 ` Andrey Utkin
  2020-05-05  0:24   ` Thomas Deutschmann
@ 2020-05-25  2:47   ` Robin H. Johnson
  1 sibling, 0 replies; 31+ messages in thread
From: Robin H. Johnson @ 2020-05-25  2:47 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 1685 bytes --]

On Mon, May 04, 2020 at 11:57:03PM +0100, Andrey Utkin wrote:
> Since it is going to be opt-in and optional anyway, we seem to be fine with
> having just partial data.
> 
> I assume we have logs of distfiles downloads from Gentoo infrastructure, and
> can negotiate access to relevant logs of our mirrors. That constitutes partial
> data correlated with users' installation activity, as good as it gets.
This assumption is wrong at the root.

> If we do have some such data, are we using it in any way for the discussed
> purposes?
> 
> If we don't, but could get it, would we be able to use that data for these
> purposes? If no, why?
> 
> If we can't get the data, why?
Simply put: Gentoo does not run the last-mile edge of distfile distribution.

$ dig @ns1.gentoo.org +noall +answer distfiles.gentoo.org IN A
distfiles.gentoo.org.	7200	IN	A	64.50.233.100
distfiles.gentoo.org.	7200	IN	A	140.211.166.134
distfiles.gentoo.org.	7200	IN	A	64.50.236.52
$ echo 140.211.166.134 64.50.233.100 64.50.236.52 |fmt -1 |xargs -n1 dig +short -x 
ftp-osl.osuosl.org.
ftp-nyc.osuosl.org.
ftp-chi.osuosl.org.

And historically also TDS & another provider.

Plus all of the regional mirrors that don't even have .gentoo.org
hostnames.

I would like to replace the legacy http://distfiles.gentoo.org/
functionality with a redirection service, at which point you could have
partial data, but it answers a very different question than Goose.

-- 
Robin Hugh Johnson
Gentoo Linux: Dev, Infra Lead, Foundation Treasurer
E-Mail   : robbat2@gentoo.org
GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 1113 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2020-05-25  2:48 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-04-26  8:08 [gentoo-dev] [RFC] Ideas for gentoostats implementation Michał Górny
2020-04-26  8:42 ` Alarig Le Lay
2020-04-26  8:43 ` Toralf Förster
2020-04-26  8:52   ` Michał Górny
2020-04-26 10:15     ` Toralf Förster
2020-04-26 10:25       ` Michał Górny
2020-04-26 10:54         ` Toralf Förster
2020-04-26 14:11     ` Kent Fredric
2020-04-26  9:09 ` Ulrich Mueller
2020-04-26  9:32   ` Toralf Förster
2020-04-26 10:39     ` Brian Dolbec
2020-04-26 13:52       ` Kent Fredric
2020-04-26  9:42   ` Michał Górny
2020-04-26 21:47   ` Andreas K. Hüttel
2020-04-26 12:38 ` Thomas Deutschmann
2020-04-26 13:46   ` Kent Fredric
2020-05-05  0:47     ` Thomas Deutschmann
2020-05-05  5:14       ` Matt Turner
2020-05-05  6:19         ` Alec Warner
2020-05-05  7:10       ` Michał Górny
2020-05-05 19:38       ` Kent Fredric
2020-04-26 12:56 ` Kent Fredric
2020-04-26 17:24 ` Samuel Bernardo
2020-05-04 22:57 ` Andrey Utkin
2020-05-05  0:24   ` Thomas Deutschmann
2020-05-25  2:47   ` Robin H. Johnson
2020-05-05 14:34 ` Nils Freydank
2020-05-05 18:04 ` Jaco Kroon
2020-05-05 19:31 ` Toralf Förster
2020-05-05 20:26   ` Daniel Pielmeier
2020-05-05 21:40     ` Toralf Förster

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox