[gentoo-dev] [RFC] Anti-spam for goose

public inbox for gentoo-dev@lists.gentoo.org
 help / color / mirror / Atom feed

* [gentoo-dev] [RFC] Anti-spam for goose
@ 2020-05-21  8:47 Michał Górny
  2020-05-21  9:17 ` Toralf Förster
                   ` (7 more replies)
  0 siblings, 8 replies; 44+ messages in thread
From: Michał Górny @ 2020-05-21  8:47 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 4579 bytes --]

Hi,

TL;DR: I'm looking for opinions on how to protect goose from spam,
i.e. mass fake submissions.

Problem
=======
Goose currently lacks proper limiting of submitted data.  The only
limiter currently in place is based on unique submitter id that is
randomly generated at setup time and in full control of the submitter. 
This only protects against accidental duplicates but it can't protect
against deliberate action.

An attacker could easily submit thousands (millions?) of fake entries by
issuing a lot of requests with different ids.  Creating them is
as trivial as using successive numbers.  The potential damage includes:

- distorting the metrics to the point of it being useless (even though
some people consider it useless by design).

- submitting lots of arbitrary data to cause DoS via growing
the database until no disk space is left.

- blocking large range of valid user ids, causing collisions with
legitimate users more likely.

I don't think it worthwhile to discuss the motivation for doing so:
whether it would be someone wishing harm to Gentoo, disagreeing with
the project or merely wanting to try and see if it would work.  The case
of SKS keyservers teaches us a lesson that you can't leave holes like
this open a long time because someone eventually will abuse them.

Option 1: IP-based limiting
===========================
The original idea was to set a hard limit of submissions per week based
on IP address of the submitter.  This has (at least as far as IPv4 is
concerned) the advantages that:

- submitted has limited control of his IP address (i.e. he can't just
submit stuff using arbitrary data)

- IP address range is naturally limited

- IP addresses have non-zero cost

This method could strongly reduce the number of fake submissions one
attacker could devise.  However, it has a few problems too:

- a low limit would harm legitimate submitters sharing IP address
(i.e. behind NAT)

- it actively favors people with access to large number of IP addresses

- it doesn't map cleanly to IPv6 (where some people may have just one IP
address, and others may have whole /64 or /48 ranges)

- it may cause problems for anonymizing network users (and we want to
encourage Tor usage for privacy)

All this considered, IP address limiting can't be used the primary
method of preventing fake submissions.  However, I suppose it could work
as an additional DoS prevention, limiting the number of submissions from
a single address over short periods of time.

Example: if we limit to 10 requests an hour, then a single IP can be
used ot manufacture at most 240 submissions a day.  This might be
sufficient to render them unusable but should keep the database
reasonably safe.

Option 2: proof-of-work
=======================
An alternative of using a proof-of-work algorithm was suggested to me
yesterday.  The idea is that every submission has to be accompanied with
the result of some cumbersome calculation that can't be trivially run
in parallel or optimized out to dedicated hardware.

On the plus side, it would rely more on actual physical hardware than IP
addresses provided by ISPs.  While it would be a waste of CPU time
and memory, doing it just once a week wouldn't be that much harm.

On the minus side, it would penalize people with weak hardware.

For example, 'time hashcash -m -b 28 -r test' gives:

- 34 s (-s estimated 38 s) on Ryzen 5 3600

- 3 minutes (estimated 92 s) on some old 32-bit Celeron M

At the same time, it would still permit a lot of fake submissions.  For
example, randomx [1] claims to require 2G of memory in fast mode.  This
would still allow me to use 7 threads.  If we adjusted the algorithm to
take ~30 seconds, that means 7 submissions every 30 s, i.e. 20k
submissions a day.

So in the end, while this is interesting, it doesn't seem like
a workable anti-spam measure.

Option 3: explicit CAPTCHA
==========================
A traditional way of dealing with spam -- require every new system
identifier to be confirmed by solving a CAPTCHA (or a few identifiers
for one CAPTCHA).

The advantage of this method is that it requires a real human work to be
performed, effectively limiting the ability to submit spam.
The disadvantage is that it is cumbersome to users, so many of them will
just resign from participating.

Other ideas
===========
Do you have any other ideas on how we could resolve this?

[1] https://github.com/tevador/RandomX

-- 
Best regards,
Michał Górny

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 618 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [gentoo-dev] [RFC] Anti-spam for goose
  2020-05-21  8:47 [gentoo-dev] [RFC] Anti-spam for goose Michał Górny
@ 2020-05-21  9:17 ` Toralf Förster
  2020-05-21  9:43   ` Michał Górny
  2020-05-21  9:48 ` Tomas Mozes
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 44+ messages in thread
From: Toralf Förster @ 2020-05-21  9:17 UTC (permalink / raw
  To: gentoo-dev


[-- Attachment #1.1: Type: text/plain, Size: 279 bytes --]

On 5/21/20 10:47 AM, Michał Górny wrote:
> TL;DR: I'm looking for opinions on how to protect goose from spam,
> i.e. mass fake submissions.
> 

I'd combine IP-limits with proof-of-work.
CAPTCHA should be the very last option IMO.

-- 
Toralf
PGP 23217DA7 9B888F45


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [gentoo-dev] [RFC] Anti-spam for goose
  2020-05-21  9:17 ` Toralf Förster
@ 2020-05-21  9:43   ` Michał Górny
  2020-05-21 20:07     ` Toralf Förster
  0 siblings, 1 reply; 44+ messages in thread
From: Michał Górny @ 2020-05-21  9:43 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 436 bytes --]

On Thu, 2020-05-21 at 11:17 +0200, Toralf Förster wrote:
> On 5/21/20 10:47 AM, Michał Górny wrote:
> > TL;DR: I'm looking for opinions on how to protect goose from spam,
> > i.e. mass fake submissions.
> > 
> 
> I'd combine IP-limits with proof-of-work.
> CAPTCHA should be the very last option IMO.
> 

To be honest, I don't see the point for proof-of-work if we have IP
limits.

-- 
Best regards,
Michał Górny


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 618 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [gentoo-dev] [RFC] Anti-spam for goose
  2020-05-21  8:47 [gentoo-dev] [RFC] Anti-spam for goose Michał Górny
  2020-05-21  9:17 ` Toralf Förster
@ 2020-05-21  9:48 ` Tomas Mozes
  2020-05-21 10:10   ` Michał Górny
  2020-05-21 10:45   ` Jaco Kroon
  2020-05-21 11:03 ` Fabian Groffen
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 44+ messages in thread
From: Tomas Mozes @ 2020-05-21  9:48 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 5655 bytes --]

On Thu, May 21, 2020 at 10:47 AM Michał Górny <mgorny@gentoo.org> wrote:

> Hi,
>
> TL;DR: I'm looking for opinions on how to protect goose from spam,
> i.e. mass fake submissions.
>
>
> Problem
> =======
> Goose currently lacks proper limiting of submitted data.  The only
> limiter currently in place is based on unique submitter id that is
> randomly generated at setup time and in full control of the submitter.
> This only protects against accidental duplicates but it can't protect
> against deliberate action.
>
> An attacker could easily submit thousands (millions?) of fake entries by
> issuing a lot of requests with different ids.  Creating them is
> as trivial as using successive numbers.  The potential damage includes:
>
> - distorting the metrics to the point of it being useless (even though
> some people consider it useless by design).
>
> - submitting lots of arbitrary data to cause DoS via growing
> the database until no disk space is left.
>
> - blocking large range of valid user ids, causing collisions with
> legitimate users more likely.
>
> I don't think it worthwhile to discuss the motivation for doing so:
> whether it would be someone wishing harm to Gentoo, disagreeing with
> the project or merely wanting to try and see if it would work.  The case
> of SKS keyservers teaches us a lesson that you can't leave holes like
> this open a long time because someone eventually will abuse them.
>
>
> Option 1: IP-based limiting
> ===========================
> The original idea was to set a hard limit of submissions per week based
> on IP address of the submitter.  This has (at least as far as IPv4 is
> concerned) the advantages that:
>
> - submitted has limited control of his IP address (i.e. he can't just
> submit stuff using arbitrary data)
>
> - IP address range is naturally limited
>
> - IP addresses have non-zero cost
>
> This method could strongly reduce the number of fake submissions one
> attacker could devise.  However, it has a few problems too:
>
> - a low limit would harm legitimate submitters sharing IP address
> (i.e. behind NAT)
>
> - it actively favors people with access to large number of IP addresses
>
> - it doesn't map cleanly to IPv6 (where some people may have just one IP
> address, and others may have whole /64 or /48 ranges)
>
> - it may cause problems for anonymizing network users (and we want to
> encourage Tor usage for privacy)
>
> All this considered, IP address limiting can't be used the primary
> method of preventing fake submissions.  However, I suppose it could work
> as an additional DoS prevention, limiting the number of submissions from
> a single address over short periods of time.
>
> Example: if we limit to 10 requests an hour, then a single IP can be
> used ot manufacture at most 240 submissions a day.  This might be
> sufficient to render them unusable but should keep the database
> reasonably safe.
>
>
> Option 2: proof-of-work
> =======================
> An alternative of using a proof-of-work algorithm was suggested to me
> yesterday.  The idea is that every submission has to be accompanied with
> the result of some cumbersome calculation that can't be trivially run
> in parallel or optimized out to dedicated hardware.
>
> On the plus side, it would rely more on actual physical hardware than IP
> addresses provided by ISPs.  While it would be a waste of CPU time
> and memory, doing it just once a week wouldn't be that much harm.
>
> On the minus side, it would penalize people with weak hardware.
>
> For example, 'time hashcash -m -b 28 -r test' gives:
>
> - 34 s (-s estimated 38 s) on Ryzen 5 3600
>
> - 3 minutes (estimated 92 s) on some old 32-bit Celeron M
>
> At the same time, it would still permit a lot of fake submissions.  For
> example, randomx [1] claims to require 2G of memory in fast mode.  This
> would still allow me to use 7 threads.  If we adjusted the algorithm to
> take ~30 seconds, that means 7 submissions every 30 s, i.e. 20k
> submissions a day.
>
> So in the end, while this is interesting, it doesn't seem like
> a workable anti-spam measure.
>
>
> Option 3: explicit CAPTCHA
> ==========================
> A traditional way of dealing with spam -- require every new system
> identifier to be confirmed by solving a CAPTCHA (or a few identifiers
> for one CAPTCHA).
>
> The advantage of this method is that it requires a real human work to be
> performed, effectively limiting the ability to submit spam.
> The disadvantage is that it is cumbersome to users, so many of them will
> just resign from participating.
>
>
> Other ideas
> ===========
> Do you have any other ideas on how we could resolve this?
>
>
> [1] https://github.com/tevador/RandomX
>
>
> --
> Best regards,
> Michał Górny
>



Sadly, the problem with IP addresses is (in this case), that there are
anonymous. One can easily start an attack with thousands of IPs (all around
the world).

One solution would be to introduce user accounts:
- one needs to register with an email
- you can rate limit based on the client (not the IP)

For example I've 200 servers, I'd create one account, verify my email
(maybe captcha too) and deploy a config with my token on all servers. Then
I'd setup a cron job on every server to submit stats. A token can have some
lifetime and you could create a new one when the old is about to expire.

If you discover I'm doing false reports, you'd block all my submissions. I
can still do fake submissions, but you'd need a per-host verification to
avoid that.

Tomas

[-- Attachment #2: Type: text/html, Size: 6468 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [gentoo-dev] [RFC] Anti-spam for goose
  2020-05-21  9:48 ` Tomas Mozes
@ 2020-05-21 10:10   ` Michał Górny
  2020-05-21 10:37     ` Tomas Mozes
  2020-05-21 10:45   ` Jaco Kroon
  1 sibling, 1 reply; 44+ messages in thread
From: Michał Górny @ 2020-05-21 10:10 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 6788 bytes --]

On Thu, 2020-05-21 at 11:48 +0200, Tomas Mozes wrote:
> On Thu, May 21, 2020 at 10:47 AM Michał Górny <mgorny@gentoo.org> wrote:
> 
> > Hi,
> > 
> > TL;DR: I'm looking for opinions on how to protect goose from spam,
> > i.e. mass fake submissions.
> > 
> > 
> > Problem
> > =======
> > Goose currently lacks proper limiting of submitted data.  The only
> > limiter currently in place is based on unique submitter id that is
> > randomly generated at setup time and in full control of the submitter.
> > This only protects against accidental duplicates but it can't protect
> > against deliberate action.
> > 
> > An attacker could easily submit thousands (millions?) of fake entries by
> > issuing a lot of requests with different ids.  Creating them is
> > as trivial as using successive numbers.  The potential damage includes:
> > 
> > - distorting the metrics to the point of it being useless (even though
> > some people consider it useless by design).
> > 
> > - submitting lots of arbitrary data to cause DoS via growing
> > the database until no disk space is left.
> > 
> > - blocking large range of valid user ids, causing collisions with
> > legitimate users more likely.
> > 
> > I don't think it worthwhile to discuss the motivation for doing so:
> > whether it would be someone wishing harm to Gentoo, disagreeing with
> > the project or merely wanting to try and see if it would work.  The case
> > of SKS keyservers teaches us a lesson that you can't leave holes like
> > this open a long time because someone eventually will abuse them.
> > 
> > 
> > Option 1: IP-based limiting
> > ===========================
> > The original idea was to set a hard limit of submissions per week based
> > on IP address of the submitter.  This has (at least as far as IPv4 is
> > concerned) the advantages that:
> > 
> > - submitted has limited control of his IP address (i.e. he can't just
> > submit stuff using arbitrary data)
> > 
> > - IP address range is naturally limited
> > 
> > - IP addresses have non-zero cost
> > 
> > This method could strongly reduce the number of fake submissions one
> > attacker could devise.  However, it has a few problems too:
> > 
> > - a low limit would harm legitimate submitters sharing IP address
> > (i.e. behind NAT)
> > 
> > - it actively favors people with access to large number of IP addresses
> > 
> > - it doesn't map cleanly to IPv6 (where some people may have just one IP
> > address, and others may have whole /64 or /48 ranges)
> > 
> > - it may cause problems for anonymizing network users (and we want to
> > encourage Tor usage for privacy)
> > 
> > All this considered, IP address limiting can't be used the primary
> > method of preventing fake submissions.  However, I suppose it could work
> > as an additional DoS prevention, limiting the number of submissions from
> > a single address over short periods of time.
> > 
> > Example: if we limit to 10 requests an hour, then a single IP can be
> > used ot manufacture at most 240 submissions a day.  This might be
> > sufficient to render them unusable but should keep the database
> > reasonably safe.
> > 
> > 
> > Option 2: proof-of-work
> > =======================
> > An alternative of using a proof-of-work algorithm was suggested to me
> > yesterday.  The idea is that every submission has to be accompanied with
> > the result of some cumbersome calculation that can't be trivially run
> > in parallel or optimized out to dedicated hardware.
> > 
> > On the plus side, it would rely more on actual physical hardware than IP
> > addresses provided by ISPs.  While it would be a waste of CPU time
> > and memory, doing it just once a week wouldn't be that much harm.
> > 
> > On the minus side, it would penalize people with weak hardware.
> > 
> > For example, 'time hashcash -m -b 28 -r test' gives:
> > 
> > - 34 s (-s estimated 38 s) on Ryzen 5 3600
> > 
> > - 3 minutes (estimated 92 s) on some old 32-bit Celeron M
> > 
> > At the same time, it would still permit a lot of fake submissions.  For
> > example, randomx [1] claims to require 2G of memory in fast mode.  This
> > would still allow me to use 7 threads.  If we adjusted the algorithm to
> > take ~30 seconds, that means 7 submissions every 30 s, i.e. 20k
> > submissions a day.
> > 
> > So in the end, while this is interesting, it doesn't seem like
> > a workable anti-spam measure.
> > 
> > 
> > Option 3: explicit CAPTCHA
> > ==========================
> > A traditional way of dealing with spam -- require every new system
> > identifier to be confirmed by solving a CAPTCHA (or a few identifiers
> > for one CAPTCHA).
> > 
> > The advantage of this method is that it requires a real human work to be
> > performed, effectively limiting the ability to submit spam.
> > The disadvantage is that it is cumbersome to users, so many of them will
> > just resign from participating.
> > 
> > 
> > Other ideas
> > ===========
> > Do you have any other ideas on how we could resolve this?
> > 
> > 
> > [1] https://github.com/tevador/RandomX
> > 
> > 
> > --
> > Best regards,
> > Michał Górny
> > 
> 
> 
> Sadly, the problem with IP addresses is (in this case), that there are
> anonymous. One can easily start an attack with thousands of IPs (all around
> the world).
> 
> One solution would be to introduce user accounts:
> - one needs to register with an email

Problem 1: you can trivially mass-create email addresses.

> - you can rate limit based on the client (not the IP)
> 
> For example I've 200 servers, I'd create one account, verify my email
> (maybe captcha too) and deploy a config with my token on all servers. Then
> I'd setup a cron job on every server to submit stats. A token can have some
> lifetime and you could create a new one when the old is about to expire.
> 
> If you discover I'm doing false reports, you'd block all my submissions. I
> can still do fake submissions, but you'd need a per-host verification to
> avoid that.
> 

Problem 2: we can't really discover this because the goal is to protect
users' privacy.  The best we can do is to discover that someone is
submitting a lot from a single account (but are them legitimate?).
But then, we can just block them.

But in the end, this has the same problem as CAPTCHA -- or maybe it's
even worse.  It requires additional effort from the users, effectively
making it less likely for them to participate.  Furthermore, it requires
them to submit e-mail addresses which they may consider PII.  Even if we
don't store them permanently but just use for initial verification, they
still could choose not to participate.

-- 
Best regards,
Michał Górny


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 618 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [gentoo-dev] [RFC] Anti-spam for goose
  2020-05-21 10:10   ` Michał Górny
@ 2020-05-21 10:37     ` Tomas Mozes
  0 siblings, 0 replies; 44+ messages in thread
From: Tomas Mozes @ 2020-05-21 10:37 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 7830 bytes --]

On Thu, May 21, 2020 at 12:10 PM Michał Górny <mgorny@gentoo.org> wrote:

> On Thu, 2020-05-21 at 11:48 +0200, Tomas Mozes wrote:
> > On Thu, May 21, 2020 at 10:47 AM Michał Górny <mgorny@gentoo.org> wrote:
> >
> > > Hi,
> > >
> > > TL;DR: I'm looking for opinions on how to protect goose from spam,
> > > i.e. mass fake submissions.
> > >
> > >
> > > Problem
> > > =======
> > > Goose currently lacks proper limiting of submitted data.  The only
> > > limiter currently in place is based on unique submitter id that is
> > > randomly generated at setup time and in full control of the submitter.
> > > This only protects against accidental duplicates but it can't protect
> > > against deliberate action.
> > >
> > > An attacker could easily submit thousands (millions?) of fake entries
> by
> > > issuing a lot of requests with different ids.  Creating them is
> > > as trivial as using successive numbers.  The potential damage includes:
> > >
> > > - distorting the metrics to the point of it being useless (even though
> > > some people consider it useless by design).
> > >
> > > - submitting lots of arbitrary data to cause DoS via growing
> > > the database until no disk space is left.
> > >
> > > - blocking large range of valid user ids, causing collisions with
> > > legitimate users more likely.
> > >
> > > I don't think it worthwhile to discuss the motivation for doing so:
> > > whether it would be someone wishing harm to Gentoo, disagreeing with
> > > the project or merely wanting to try and see if it would work.  The
> case
> > > of SKS keyservers teaches us a lesson that you can't leave holes like
> > > this open a long time because someone eventually will abuse them.
> > >
> > >
> > > Option 1: IP-based limiting
> > > ===========================
> > > The original idea was to set a hard limit of submissions per week based
> > > on IP address of the submitter.  This has (at least as far as IPv4 is
> > > concerned) the advantages that:
> > >
> > > - submitted has limited control of his IP address (i.e. he can't just
> > > submit stuff using arbitrary data)
> > >
> > > - IP address range is naturally limited
> > >
> > > - IP addresses have non-zero cost
> > >
> > > This method could strongly reduce the number of fake submissions one
> > > attacker could devise.  However, it has a few problems too:
> > >
> > > - a low limit would harm legitimate submitters sharing IP address
> > > (i.e. behind NAT)
> > >
> > > - it actively favors people with access to large number of IP addresses
> > >
> > > - it doesn't map cleanly to IPv6 (where some people may have just one
> IP
> > > address, and others may have whole /64 or /48 ranges)
> > >
> > > - it may cause problems for anonymizing network users (and we want to
> > > encourage Tor usage for privacy)
> > >
> > > All this considered, IP address limiting can't be used the primary
> > > method of preventing fake submissions.  However, I suppose it could
> work
> > > as an additional DoS prevention, limiting the number of submissions
> from
> > > a single address over short periods of time.
> > >
> > > Example: if we limit to 10 requests an hour, then a single IP can be
> > > used ot manufacture at most 240 submissions a day.  This might be
> > > sufficient to render them unusable but should keep the database
> > > reasonably safe.
> > >
> > >
> > > Option 2: proof-of-work
> > > =======================
> > > An alternative of using a proof-of-work algorithm was suggested to me
> > > yesterday.  The idea is that every submission has to be accompanied
> with
> > > the result of some cumbersome calculation that can't be trivially run
> > > in parallel or optimized out to dedicated hardware.
> > >
> > > On the plus side, it would rely more on actual physical hardware than
> IP
> > > addresses provided by ISPs.  While it would be a waste of CPU time
> > > and memory, doing it just once a week wouldn't be that much harm.
> > >
> > > On the minus side, it would penalize people with weak hardware.
> > >
> > > For example, 'time hashcash -m -b 28 -r test' gives:
> > >
> > > - 34 s (-s estimated 38 s) on Ryzen 5 3600
> > >
> > > - 3 minutes (estimated 92 s) on some old 32-bit Celeron M
> > >
> > > At the same time, it would still permit a lot of fake submissions.  For
> > > example, randomx [1] claims to require 2G of memory in fast mode.  This
> > > would still allow me to use 7 threads.  If we adjusted the algorithm to
> > > take ~30 seconds, that means 7 submissions every 30 s, i.e. 20k
> > > submissions a day.
> > >
> > > So in the end, while this is interesting, it doesn't seem like
> > > a workable anti-spam measure.
> > >
> > >
> > > Option 3: explicit CAPTCHA
> > > ==========================
> > > A traditional way of dealing with spam -- require every new system
> > > identifier to be confirmed by solving a CAPTCHA (or a few identifiers
> > > for one CAPTCHA).
> > >
> > > The advantage of this method is that it requires a real human work to
> be
> > > performed, effectively limiting the ability to submit spam.
> > > The disadvantage is that it is cumbersome to users, so many of them
> will
> > > just resign from participating.
> > >
> > >
> > > Other ideas
> > > ===========
> > > Do you have any other ideas on how we could resolve this?
> > >
> > >
> > > [1] https://github.com/tevador/RandomX
> > >
> > >
> > > --
> > > Best regards,
> > > Michał Górny
> > >
> >
> >
> > Sadly, the problem with IP addresses is (in this case), that there are
> > anonymous. One can easily start an attack with thousands of IPs (all
> around
> > the world).
> >
> > One solution would be to introduce user accounts:
> > - one needs to register with an email
>
> Problem 1: you can trivially mass-create email addresses.
>
>
IP verification:
- get enough IPs (botnet) and send your payload

User verification:
- get an email, verify account, solve a captcha

I know if someone wants to, he'll try to bypass the user verification, but
it's just more work to do. We can also enforce IP restrictions and use a
combination of both.

> - you can rate limit based on the client (not the IP)
> >
> > For example I've 200 servers, I'd create one account, verify my email
> > (maybe captcha too) and deploy a config with my token on all servers.
> Then
> > I'd setup a cron job on every server to submit stats. A token can have
> some
> > lifetime and you could create a new one when the old is about to expire.
> >
> > If you discover I'm doing false reports, you'd block all my submissions.
> I
> > can still do fake submissions, but you'd need a per-host verification to
> > avoid that.
> >
>
> Problem 2: we can't really discover this because the goal is to protect
> users' privacy.  The best we can do is to discover that someone is
> submitting a lot from a single account (but are them legitimate?).
> But then, we can just block them.
>
> But in the end, this has the same problem as CAPTCHA -- or maybe it's
> even worse.  It requires additional effort from the users, effectively
> making it less likely for them to participate.  Furthermore, it requires
> them to submit e-mail addresses which they may consider PII.  Even if we
> don't store them permanently but just use for initial verification, they
> still could choose not to participate.
>

I think if someone wants to participate and believes the cause he will.
Many of the users are on bugzila anyway, so the email is on the Gentoo side
anyway. Contributors have their emails in each Gentoo commit.

If spamming is a serious problem you can turn it into an invite-only system
creating a chain of trust.


> --
> Best regards,
> Michał Górny
>
>

[-- Attachment #2: Type: text/html, Size: 10022 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [gentoo-dev] [RFC] Anti-spam for goose
  2020-05-21  9:48 ` Tomas Mozes
  2020-05-21 10:10   ` Michał Górny
@ 2020-05-21 10:45   ` Jaco Kroon
  2020-05-21 11:02     ` Michał Górny
  1 sibling, 1 reply; 44+ messages in thread
From: Jaco Kroon @ 2020-05-21 10:45 UTC (permalink / raw
  To: gentoo-dev, Tomas Mozes

[-- Attachment #1: Type: text/plain, Size: 8337 bytes --]

Hi,

On 2020/05/21 11:48, Tomas Mozes wrote:
>
>
> On Thu, May 21, 2020 at 10:47 AM Michał Górny <mgorny@gentoo.org
> <mailto:mgorny@gentoo.org>> wrote:
>
>     Hi,
>
>     TL;DR: I'm looking for opinions on how to protect goose from spam,
>     i.e. mass fake submissions.
>     Option 1: IP-based limiting
>     ===========================
>     The original idea was to set a hard limit of submissions per week
>     based
>     on IP address of the submitter.  This has (at least as far as IPv4 is
>     concerned) the advantages that:
>
>     - submitted has limited control of his IP address (i.e. he can't just
>     submit stuff using arbitrary data)
>
>     - IP address range is naturally limited
>
>     - IP addresses have non-zero cost
>
>     This method could strongly reduce the number of fake submissions one
>     attacker could devise.  However, it has a few problems too:
>
>     - a low limit would harm legitimate submitters sharing IP address
>     (i.e. behind NAT)
>
>     - it actively favors people with access to large number of IP
>     addresses
>
>     - it doesn't map cleanly to IPv6 (where some people may have just
>     one IP
>     address, and others may have whole /64 or /48 ranges)
>
So this gets tricky.  A single host could as you say either have a /128
or possibly a whole /64.  ISPs are "encouraged" to use a single /64 per
connecting user on the access layer (can be link-local technically, but
it seems to be frowned upon).  Generally then you're encourages to
delegate a /56 to the router, but at the very least a /60.  Some
recommendations even state to delegate a /48 at this point.  That's
outright crazy seeing that a /48 essentially boils down to 65536
individual LANs behind the router, /56 is 256 LANs which frankly I
reckon is adequate.  The only advantage of /48 is cleaner boudary
mapping onto : separators.  This is OPINION.  I also use "encouraged"
since these are

Short version:  If you're willing to rate limit on larger blocks it
could work.  /64s are probably OK, but most hosts will typically have a
/128, so you'll be limiting LANs, and switching IPs is trivial as you'd
have access to at least a /64 (or ~18.45 * 10^18).

You could have multiple layers ... ie:

each /128 gets 1 or 2 submissions per day
each /64 gets 200/day
each /56 gets 400/day
each /48 gets 600/day

But now you need to keep bucket loads of data ... so DOS on the rate
limiting mechanism itself becomes possible unless you're happy to limit
the size of the tables and discard "low risk of exceeding entries" somehow.

Even for v4, as an attacker ... well, as I'm sitting here right now I've
got direct access to almost a /20 (4096 addresses).  I know a number of
people with larger scopes than that.  Use bot-nets and the scope goes up
even more.

>
>
>     Option 2: proof-of-work
>     =======================
>     An alternative of using a proof-of-work algorithm was suggested to me
>     yesterday.  The idea is that every submission has to be
>     accompanied with
>     the result of some cumbersome calculation that can't be trivially run
>     in parallel or optimized out to dedicated hardware.
>
>     On the plus side, it would rely more on actual physical hardware
>     than IP
>     addresses provided by ISPs.  While it would be a waste of CPU time
>     and memory, doing it just once a week wouldn't be that much harm.
>
>     On the minus side, it would penalize people with weak hardware.
>
>     For example, 'time hashcash -m -b 28 -r test' gives:
>
>     - 34 s (-s estimated 38 s) on Ryzen 5 3600
>
>     - 3 minutes (estimated 92 s) on some old 32-bit Celeron M
>
>     At the same time, it would still permit a lot of fake
>     submissions.  For
>     example, randomx [1] claims to require 2G of memory in fast mode. 
>     This
>     would still allow me to use 7 threads.  If we adjusted the
>     algorithm to
>     take ~30 seconds, that means 7 submissions every 30 s, i.e. 20k
>     submissions a day.
>
>     So in the end, while this is interesting, it doesn't seem like
>     a workable anti-spam measure.
>
Indeed.  This was considered for email SPAM protection as well about two
decades back.  Amongst other proposals.

Perhaps some crazy proof-of-work for registration of a token, but given
how cheap it is to lease CPU cycles you'd need to balance the effects. 
And given bot nets ... using other people's hardware for proof-of-work
doesn't seem inconceivable (bitcoin miners embedded on web pages being
an example of the stuff that people pull).

>
>
>     Option 3: explicit CAPTCHA
>     ==========================
>     A traditional way of dealing with spam -- require every new system
>     identifier to be confirmed by solving a CAPTCHA (or a few identifiers
>     for one CAPTCHA).
>
>     The advantage of this method is that it requires a real human work
>     to be
>     performed, effectively limiting the ability to submit spam.
>
Yea.  One would think.  CAPTCHAs are massively intrusive and in my
opinion more effort than they're worth.

This may be beneficial to *generate* a token.  In other words - when
generating a token, that token needs to be registered by way of capthca.

>     The disadvantage is that it is cumbersome to users, so many of
>     them will
>     just resign from participating.
>
Agreed.

>
>
>     Other ideas
>     ===========
>     Do you have any other ideas on how we could resolve this?
>
Generated token + hardware based hash.  Rate limit the combination to 1/day.

Don't use included results until it's been kept up to date for a minimum
period.  Say updated at least 20 times 30 days.

I note currently you can submit once in 7, I'd change this approach to
something like:

* Update the results as often as you wish but at most every 23 hours
(basically aim at submitting daily).
* Expire all results that haven't been updated in X number of days (I'd
use a 7 here out of hand).
* Expire the token after 30 days of not being kept up to date and
require going through the initial working again.

The downside here is that many machines are not powered up at least once
a day to be able to perform that initial submission sequence.  So
perhaps it's a bit stringent.

So single token can submit for multiple hosts (cloned machines).

Both the token and hardware hash can of course be tainted and is under
"attacker control".

>
> Sadly, the problem with IP addresses is (in this case), that there are
> anonymous. One can easily start an attack with thousands of IPs (all
> around the world).
>
> One solution would be to introduce user accounts:
> - one needs to register with an email
> - you can rate limit based on the client (not the IP)
>
> For example I've 200 servers, I'd create one account, verify my email
> (maybe captcha too) and deploy a config with my token on all servers.
> Then I'd setup a cron job on every server to submit stats. A token can
> have some lifetime and you could create a new one when the old is
> about to expire.

I, sadly so, agree with this.  I'm quite happy to register an account,
and on each machine during gander --init enter a username + password to
link my email-based token to the host.

If machine gets cloned, the hardware hash takes care of conflicts.

Else, during gander --init some proof-of-work may be OK to generate an
anonymous token.

Rate limit *token generation* against IP address for anonymous tokens.

For anonymous tokens, only a single hardware hash is allowed, if the
hardware hash changes, re-require proof-of-work, discard data.

Final summary:

# gander --init --account jaco@uls.co.za [--mayclone]
Password: ?????
... generate HW hash if not --mayclone
... submit credentials (via https please).
... get token.
#

Or:

# gander --init --anonymous
... contact server, get work
... do work
... generate HW hash
... submit HW hash + proof of work
... get token

Now each (token+hash) can submit at most once in 23 hours, discard data
after 7 days if not kept up to date.

Anonymous tokens are linked to a HW hash.  User accounts gets to issue
tokens as needed, and each one has a flag that allows for setting
whether or not the HW hash may change.  This is more for user benefit
for those of us that does make use of clones.  And the explicit
requirement is to prevent accidental error.

Kind Regards,
Jaco

[-- Attachment #2: Type: text/html, Size: 13175 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [gentoo-dev] [RFC] Anti-spam for goose
  2020-05-21 10:45   ` Jaco Kroon
@ 2020-05-21 11:02     ` Michał Górny
  2020-05-21 14:27       ` Jaco Kroon
  0 siblings, 1 reply; 44+ messages in thread
From: Michał Górny @ 2020-05-21 11:02 UTC (permalink / raw
  To: gentoo-dev, Tomas Mozes

[-- Attachment #1: Type: text/plain, Size: 2530 bytes --]

On Thu, 2020-05-21 at 12:45 +0200, Jaco Kroon wrote:
> Even for v4, as an attacker ... well, as I'm sitting here right now I've
> got direct access to almost a /20 (4096 addresses).  I know a number of
> people with larger scopes than that.  Use bot-nets and the scope goes up
> even more.

See how unfair the world is!  You are filling your bathtub with IP
addresses, and my ISP has taken mine only recently.

> >     Option 3: explicit CAPTCHA
> >     ==========================
> >     A traditional way of dealing with spam -- require every new system
> >     identifier to be confirmed by solving a CAPTCHA (or a few identifiers
> >     for one CAPTCHA).
> > 
> >     The advantage of this method is that it requires a real human work
> >     to be
> >     performed, effectively limiting the ability to submit spam.
> > 
> Yea.  One would think.  CAPTCHAs are massively intrusive and in my
> opinion more effort than they're worth.
> 
> This may be beneficial to *generate* a token.  In other words - when
> generating a token, that token needs to be registered by way of capthca.
> 
> > 
> >     Other ideas
> >     ===========
> >     Do you have any other ideas on how we could resolve this?
> > 
> Generated token + hardware based hash.

How are you going to verify that the hardware-based hash is real,
and not just a random value created to circumvent the protection?

>   Rate limit the combination to 1/day.
> 
> Don't use included results until it's been kept up to date for a minimum
> period.  Say updated at least 20 times 30 days.

For privacy reasons, we don't correlate the results.  So this is
impossible to implement.

> The downside here is that many machines are not powered up at least once
> a day to be able to perform that initial submission sequence.  So
> perhaps it's a bit stringent.

Exactly.  Even once a week is a bit risky but once a day is too narrow
a period.

To some degree, we could decide we don't care about exact numbers
as much as some degree of weighed proportions.  This would mean that,
say, people who submit daily get the count of 7, at the loss of people
who don't run their machines that much.  It would effectively put more
emphasis on more active users.  It's debatable whether this is desirable
or not.

> 
Both the token and hardware hash can of course be tainted and is under
> "attacker control".

Exactly.  So it really looks like exercise for the sake of exercise.


-- 
Best regards,
Michał Górny


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 618 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [gentoo-dev] [RFC] Anti-spam for goose
  2020-05-21  8:47 [gentoo-dev] [RFC] Anti-spam for goose Michał Górny
  2020-05-21  9:17 ` Toralf Förster
  2020-05-21  9:48 ` Tomas Mozes
@ 2020-05-21 11:03 ` Fabian Groffen
  2020-05-21 11:33 ` Robert Bridge
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 44+ messages in thread
From: Fabian Groffen @ 2020-05-21 11:03 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 6764 bytes --]

Hi,

On 21-05-2020 10:47:07 +0200, Michał Górny wrote:
> Hi,
> 
> TL;DR: I'm looking for opinions on how to protect goose from spam,
> i.e. mass fake submissions.
> 
> 
> Problem
> =======
> Goose currently lacks proper limiting of submitted data.  The only
> limiter currently in place is based on unique submitter id that is
> randomly generated at setup time and in full control of the submitter. 
> This only protects against accidental duplicates but it can't protect
> against deliberate action.
> 
> An attacker could easily submit thousands (millions?) of fake entries by
> issuing a lot of requests with different ids.  Creating them is
> as trivial as using successive numbers.  The potential damage includes:

Perhaps you could consider something like a reputation system.  I'm
thinking of things like only publishing results after X hours when an id
is new (graylisting?), and gradually build up "trust" there.  In the X
hours you could then determine something is potentially fraud if you see
a new user id, loads of submissions from same IP, etc. what you describe
below I think.

The reputation logic could further build on if it appears to follow a
norm, e.g. compilation times which fall in the average given the
cpu/arch configuration.
Another way would be to see submissions for packages that are actually
bumped/stabilised in the tree, to score an id as more likely to be
genuine.

I think it will be a tad complicated, but static limiting might be as
easy to circumvent as you block it, as has been pointed out already.

Perhaps, it is fruitful to think of the reverse, when is something
obviously bad?  When a single (obscure?) package is suddenly reported
many times by new ids?  When a single id generates hundreds or thousands
of package submissions (is it a cluster being misconfigured, many
identical packages, or what seems to be an a to z scan).
Thing is, would a single "fake" submission (that IMO will unlikely be
ever noticed) screw up the overall state of things?  I think the
fuzzyness of the system as a whole should cover for these.  It is pure
poisioning that should be able to be mitigated, and I agree with you
preferably most of it blocked by default.  Fact probably is that it will
happen nevertheless.

That brings me to the thought: are there things that can be done to make
sure a fraudulous action can be easily undone or negated somehow?  E.g.
should a log be kept, or some action to rollback and replay.  Sorry to
have no concrete examples here.

Fabian

> 
> - distorting the metrics to the point of it being useless (even though
> some people consider it useless by design).
> 
> - submitting lots of arbitrary data to cause DoS via growing
> the database until no disk space is left.
> 
> - blocking large range of valid user ids, causing collisions with
> legitimate users more likely.
> 
> I don't think it worthwhile to discuss the motivation for doing so:
> whether it would be someone wishing harm to Gentoo, disagreeing with
> the project or merely wanting to try and see if it would work.  The case
> of SKS keyservers teaches us a lesson that you can't leave holes like
> this open a long time because someone eventually will abuse them.
> 
> 
> Option 1: IP-based limiting
> ===========================
> The original idea was to set a hard limit of submissions per week based
> on IP address of the submitter.  This has (at least as far as IPv4 is
> concerned) the advantages that:
> 
> - submitted has limited control of his IP address (i.e. he can't just
> submit stuff using arbitrary data)
> 
> - IP address range is naturally limited
> 
> - IP addresses have non-zero cost
> 
> This method could strongly reduce the number of fake submissions one
> attacker could devise.  However, it has a few problems too:
> 
> - a low limit would harm legitimate submitters sharing IP address
> (i.e. behind NAT)
> 
> - it actively favors people with access to large number of IP addresses
> 
> - it doesn't map cleanly to IPv6 (where some people may have just one IP
> address, and others may have whole /64 or /48 ranges)
> 
> - it may cause problems for anonymizing network users (and we want to
> encourage Tor usage for privacy)
> 
> All this considered, IP address limiting can't be used the primary
> method of preventing fake submissions.  However, I suppose it could work
> as an additional DoS prevention, limiting the number of submissions from
> a single address over short periods of time.
> 
> Example: if we limit to 10 requests an hour, then a single IP can be
> used ot manufacture at most 240 submissions a day.  This might be
> sufficient to render them unusable but should keep the database
> reasonably safe.
> 
> 
> Option 2: proof-of-work
> =======================
> An alternative of using a proof-of-work algorithm was suggested to me
> yesterday.  The idea is that every submission has to be accompanied with
> the result of some cumbersome calculation that can't be trivially run
> in parallel or optimized out to dedicated hardware.
> 
> On the plus side, it would rely more on actual physical hardware than IP
> addresses provided by ISPs.  While it would be a waste of CPU time
> and memory, doing it just once a week wouldn't be that much harm.
> 
> On the minus side, it would penalize people with weak hardware.
> 
> For example, 'time hashcash -m -b 28 -r test' gives:
> 
> - 34 s (-s estimated 38 s) on Ryzen 5 3600
> 
> - 3 minutes (estimated 92 s) on some old 32-bit Celeron M
> 
> At the same time, it would still permit a lot of fake submissions.  For
> example, randomx [1] claims to require 2G of memory in fast mode.  This
> would still allow me to use 7 threads.  If we adjusted the algorithm to
> take ~30 seconds, that means 7 submissions every 30 s, i.e. 20k
> submissions a day.
> 
> So in the end, while this is interesting, it doesn't seem like
> a workable anti-spam measure.
> 
> 
> Option 3: explicit CAPTCHA
> ==========================
> A traditional way of dealing with spam -- require every new system
> identifier to be confirmed by solving a CAPTCHA (or a few identifiers
> for one CAPTCHA).
> 
> The advantage of this method is that it requires a real human work to be
> performed, effectively limiting the ability to submit spam.
> The disadvantage is that it is cumbersome to users, so many of them will
> just resign from participating.
> 
> 
> Other ideas
> ===========
> Do you have any other ideas on how we could resolve this?
> 
> 
> [1] https://github.com/tevador/RandomX
> 
> 
> -- 
> Best regards,
> Michał Górny
> 



-- 
Fabian Groffen
Gentoo on a different level

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [gentoo-dev] [RFC] Anti-spam for goose
  2020-05-21  8:47 [gentoo-dev] [RFC] Anti-spam for goose Michał Górny
                   ` (2 preceding siblings ...)
  2020-05-21 11:03 ` Fabian Groffen
@ 2020-05-21 11:33 ` Robert Bridge
  2020-05-21 11:56   ` Michał Górny
  2020-05-21 11:57   ` Ulrich Mueller
  2020-05-21 13:22 ` Gordon Pettey
                   ` (3 subsequent siblings)
  7 siblings, 2 replies; 44+ messages in thread
From: Robert Bridge @ 2020-05-21 11:33 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 759 bytes --]

On Thu, 21 May 2020 at 09:47, Michał Górny <mgorny@gentoo.org> wrote:

>
> Option 1: IP-based limiting
> ===========================
>

Preface this with IANAL, check with your own legal counsel...

While IP address based methods might be attractive  technically, do
remember that an IP address is considered Personally Identifiable in
European Data Protection law.

The fact submissions require an action by the user will probably be
sufficient to be explicit consent, any system storing these details should
allow for the use to revoke their consent: If you collect anything
personally identifiable, you will need to provide a mechanism for users to
request the removal of all their submissions.

Tread carefully with this project. :)

[-- Attachment #2: Type: text/html, Size: 1106 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [gentoo-dev] [RFC] Anti-spam for goose
  2020-05-21 11:33 ` Robert Bridge
@ 2020-05-21 11:56   ` Michał Górny
  2020-05-21 11:57   ` Ulrich Mueller
  1 sibling, 0 replies; 44+ messages in thread
From: Michał Górny @ 2020-05-21 11:56 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 1032 bytes --]

On Thu, 2020-05-21 at 12:33 +0100, Robert Bridge wrote:
> On Thu, 21 May 2020 at 09:47, Michał Górny <mgorny@gentoo.org> wrote:
> 
> > Option 1: IP-based limiting
> > ===========================
> > 
> 
> Preface this with IANAL, check with your own legal counsel...
> 
> While IP address based methods might be attractive  technically, do
> remember that an IP address is considered Personally Identifiable in
> European Data Protection law.
> 
> The fact submissions require an action by the user will probably be
> sufficient to be explicit consent, any system storing these details should
> allow for the use to revoke their consent: If you collect anything
> personally identifiable, you will need to provide a mechanism for users to
> request the removal of all their submissions.
> 
> Tread carefully with this project. :)

All the data collected is set to expire in 7 days.  The 'privacy-first'
statement in the project description is there for a reason ;-).

-- 
Best regards,
Michał Górny


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 618 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [gentoo-dev] [RFC] Anti-spam for goose
  2020-05-21 11:33 ` Robert Bridge
  2020-05-21 11:56   ` Michał Górny
@ 2020-05-21 11:57   ` Ulrich Mueller
  2020-05-21 12:08     ` Michał Górny
  1 sibling, 1 reply; 44+ messages in thread
From: Ulrich Mueller @ 2020-05-21 11:57 UTC (permalink / raw
  To: Robert Bridge; +Cc: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 967 bytes --]

>>>>> On Thu, 21 May 2020, Robert Bridge wrote:

> On Thu, 21 May 2020 at 09:47, Michał Górny <mgorny@gentoo.org> wrote:
>> 
>> Option 1: IP-based limiting
>> ===========================
>> 

> Preface this with IANAL, check with your own legal counsel...

> While IP address based methods might be attractive  technically, do
> remember that an IP address is considered Personally Identifiable in
> European Data Protection law.

> The fact submissions require an action by the user will probably be
> sufficient to be explicit consent, any system storing these details should
> allow for the use to revoke their consent: If you collect anything
> personally identifiable, you will need to provide a mechanism for users to
> request the removal of all their submissions.

> Tread carefully with this project. :)

You don't have to store any IP addresses, you can store a cryptographic
hash like their b2sum (salted if necessary).

Ulrich

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 507 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [gentoo-dev] [RFC] Anti-spam for goose
  2020-05-21 11:57   ` Ulrich Mueller
@ 2020-05-21 12:08     ` Michał Górny
  2020-05-21 12:15       ` Robert Bridge
  0 siblings, 1 reply; 44+ messages in thread
From: Michał Górny @ 2020-05-21 12:08 UTC (permalink / raw
  To: gentoo-dev, Robert Bridge

[-- Attachment #1: Type: text/plain, Size: 1152 bytes --]

On Thu, 2020-05-21 at 13:57 +0200, Ulrich Mueller wrote:
> > > > > > On Thu, 21 May 2020, Robert Bridge wrote:
> > On Thu, 21 May 2020 at 09:47, Michał Górny <mgorny@gentoo.org> wrote:
> > > Option 1: IP-based limiting
> > > ===========================
> > > 
> > Preface this with IANAL, check with your own legal counsel...
> > While IP address based methods might be attractive  technically, do
> > remember that an IP address is considered Personally Identifiable in
> > European Data Protection law.
> > The fact submissions require an action by the user will probably be
> > sufficient to be explicit consent, any system storing these details should
> > allow for the use to revoke their consent: If you collect anything
> > personally identifiable, you will need to provide a mechanism for users to
> > request the removal of all their submissions.
> > Tread carefully with this project. :)
> 
> You don't have to store any IP addresses, you can store a cryptographic
> hash like their b2sum (salted if necessary).
> 

Yes, this is as great as storing hashes of phone numbers ;-).

-- 
Best regards,
Michał Górny


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 618 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [gentoo-dev] [RFC] Anti-spam for goose
  2020-05-21 12:08     ` Michał Górny
@ 2020-05-21 12:15       ` Robert Bridge
  2020-05-21 12:25         ` Ulrich Mueller
  0 siblings, 1 reply; 44+ messages in thread
From: Robert Bridge @ 2020-05-21 12:15 UTC (permalink / raw
  To: Michał Górny; +Cc: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 1369 bytes --]

There are only 4 billion to reverse, not that hard really with a rainbow
table...

On Thu, 21 May 2020 at 13:08, Michał Górny <mgorny@gentoo.org> wrote:

> On Thu, 2020-05-21 at 13:57 +0200, Ulrich Mueller wrote:
> > > > > > > On Thu, 21 May 2020, Robert Bridge wrote:
> > > On Thu, 21 May 2020 at 09:47, Michał Górny <mgorny@gentoo.org> wrote:
> > > > Option 1: IP-based limiting
> > > > ===========================
> > > >
> > > Preface this with IANAL, check with your own legal counsel...
> > > While IP address based methods might be attractive  technically, do
> > > remember that an IP address is considered Personally Identifiable in
> > > European Data Protection law.
> > > The fact submissions require an action by the user will probably be
> > > sufficient to be explicit consent, any system storing these details
> should
> > > allow for the use to revoke their consent: If you collect anything
> > > personally identifiable, you will need to provide a mechanism for
> users to
> > > request the removal of all their submissions.
> > > Tread carefully with this project. :)
> >
> > You don't have to store any IP addresses, you can store a cryptographic
> > hash like their b2sum (salted if necessary).
> >
>
> Yes, this is as great as storing hashes of phone numbers ;-).
>
> --
> Best regards,
> Michał Górny
>
>

[-- Attachment #2: Type: text/html, Size: 1886 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [gentoo-dev] [RFC] Anti-spam for goose
  2020-05-21 12:15       ` Robert Bridge
@ 2020-05-21 12:25         ` Ulrich Mueller
  2020-05-21 13:09           ` Kent Fredric
  0 siblings, 1 reply; 44+ messages in thread
From: Ulrich Mueller @ 2020-05-21 12:25 UTC (permalink / raw
  To: Robert Bridge; +Cc: Michał Górny, gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 167 bytes --]

>>>>> On Thu, 21 May 2020, Robert Bridge wrote:

> There are only 4 billion to reverse, not that hard really with a
> rainbow table...

That's why I said salted hash.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 507 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [gentoo-dev] [RFC] Anti-spam for goose
  2020-05-21 12:25         ` Ulrich Mueller
@ 2020-05-21 13:09           ` Kent Fredric
  2020-05-21 13:16             ` Michał Górny
  2020-05-21 13:53             ` Michał Górny
  0 siblings, 2 replies; 44+ messages in thread
From: Kent Fredric @ 2020-05-21 13:09 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 367 bytes --]

On Thu, 21 May 2020 14:25:00 +0200
Ulrich Mueller <ulm@gentoo.org> wrote:

> That's why I said salted hash.

Even a salted hash becomes a trivial joke when the input data you're
hashing has a *total* entropy of only 32bits.

You at very least need a unique salt per hash, or you only have to
expose the salt to create a rainbow table for the whole dataset.

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [gentoo-dev] [RFC] Anti-spam for goose
  2020-05-21 13:09           ` Kent Fredric
@ 2020-05-21 13:16             ` Michał Górny
  2020-05-21 13:41               ` Kent Fredric
  2020-05-21 13:53             ` Michał Górny
  1 sibling, 1 reply; 44+ messages in thread
From: Michał Górny @ 2020-05-21 13:16 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 591 bytes --]

On Fri, 2020-05-22 at 01:09 +1200, Kent Fredric wrote:
> On Thu, 21 May 2020 14:25:00 +0200
> Ulrich Mueller <ulm@gentoo.org> wrote:
> 
> > That's why I said salted hash.
> 
> Even a salted hash becomes a trivial joke when the input data you're
> hashing has a *total* entropy of only 32bits.
> 
> You at very least need a unique salt per hash, or you only have to
> expose the salt to create a rainbow table for the whole dataset.

Isn't the whole point of salted hash to use unique salts?

Nevertheless, it's still near-trivial task.

-- 
Best regards,
Michał Górny


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 618 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [gentoo-dev] [RFC] Anti-spam for goose
  2020-05-21  8:47 [gentoo-dev] [RFC] Anti-spam for goose Michał Górny
                   ` (3 preceding siblings ...)
  2020-05-21 11:33 ` Robert Bridge
@ 2020-05-21 13:22 ` Gordon Pettey
  2020-05-21 13:38 ` Kent Fredric
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 44+ messages in thread
From: Gordon Pettey @ 2020-05-21 13:22 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 5621 bytes --]

Require browser-based interaction to use the service. Do something funky
with AJAX so the page can't be properly used with curl or anything so that
manual effort is required to get the UUID to submit as. Only allow
registered UUIDs, and only allow one submission per day per UUID.
Sure, somebody can go to Mechanical Turk and pay a few cents to generate
fake submission IDs, but at least you have that tiny deterrent of "I've got
to pay 3 cents per spam account :(".

Maybe also add some minor tracking to the database if it isn't already
there to count submissions over time per UUID, and make the default cron
script weekly. If you see some UUID that is submitting at the maximum rate
of daily, you may lean towards accusations of spam.

On Thu, May 21, 2020 at 3:47 AM Michał Górny <mgorny@gentoo.org> wrote:

> Hi,
>
> TL;DR: I'm looking for opinions on how to protect goose from spam,
> i.e. mass fake submissions.
>
>
> Problem
> =======
> Goose currently lacks proper limiting of submitted data.  The only
> limiter currently in place is based on unique submitter id that is
> randomly generated at setup time and in full control of the submitter.
> This only protects against accidental duplicates but it can't protect
> against deliberate action.
>
> An attacker could easily submit thousands (millions?) of fake entries by
> issuing a lot of requests with different ids.  Creating them is
> as trivial as using successive numbers.  The potential damage includes:
>
> - distorting the metrics to the point of it being useless (even though
> some people consider it useless by design).
>
> - submitting lots of arbitrary data to cause DoS via growing
> the database until no disk space is left.
>
> - blocking large range of valid user ids, causing collisions with
> legitimate users more likely.
>
> I don't think it worthwhile to discuss the motivation for doing so:
> whether it would be someone wishing harm to Gentoo, disagreeing with
> the project or merely wanting to try and see if it would work.  The case
> of SKS keyservers teaches us a lesson that you can't leave holes like
> this open a long time because someone eventually will abuse them.
>
>
> Option 1: IP-based limiting
> ===========================
> The original idea was to set a hard limit of submissions per week based
> on IP address of the submitter.  This has (at least as far as IPv4 is
> concerned) the advantages that:
>
> - submitted has limited control of his IP address (i.e. he can't just
> submit stuff using arbitrary data)
>
> - IP address range is naturally limited
>
> - IP addresses have non-zero cost
>
> This method could strongly reduce the number of fake submissions one
> attacker could devise.  However, it has a few problems too:
>
> - a low limit would harm legitimate submitters sharing IP address
> (i.e. behind NAT)
>
> - it actively favors people with access to large number of IP addresses
>
> - it doesn't map cleanly to IPv6 (where some people may have just one IP
> address, and others may have whole /64 or /48 ranges)
>
> - it may cause problems for anonymizing network users (and we want to
> encourage Tor usage for privacy)
>
> All this considered, IP address limiting can't be used the primary
> method of preventing fake submissions.  However, I suppose it could work
> as an additional DoS prevention, limiting the number of submissions from
> a single address over short periods of time.
>
> Example: if we limit to 10 requests an hour, then a single IP can be
> used ot manufacture at most 240 submissions a day.  This might be
> sufficient to render them unusable but should keep the database
> reasonably safe.
>
>
> Option 2: proof-of-work
> =======================
> An alternative of using a proof-of-work algorithm was suggested to me
> yesterday.  The idea is that every submission has to be accompanied with
> the result of some cumbersome calculation that can't be trivially run
> in parallel or optimized out to dedicated hardware.
>
> On the plus side, it would rely more on actual physical hardware than IP
> addresses provided by ISPs.  While it would be a waste of CPU time
> and memory, doing it just once a week wouldn't be that much harm.
>
> On the minus side, it would penalize people with weak hardware.
>
> For example, 'time hashcash -m -b 28 -r test' gives:
>
> - 34 s (-s estimated 38 s) on Ryzen 5 3600
>
> - 3 minutes (estimated 92 s) on some old 32-bit Celeron M
>
> At the same time, it would still permit a lot of fake submissions.  For
> example, randomx [1] claims to require 2G of memory in fast mode.  This
> would still allow me to use 7 threads.  If we adjusted the algorithm to
> take ~30 seconds, that means 7 submissions every 30 s, i.e. 20k
> submissions a day.
>
> So in the end, while this is interesting, it doesn't seem like
> a workable anti-spam measure.
>
>
> Option 3: explicit CAPTCHA
> ==========================
> A traditional way of dealing with spam -- require every new system
> identifier to be confirmed by solving a CAPTCHA (or a few identifiers
> for one CAPTCHA).
>
> The advantage of this method is that it requires a real human work to be
> performed, effectively limiting the ability to submit spam.
> The disadvantage is that it is cumbersome to users, so many of them will
> just resign from participating.
>
>
> Other ideas
> ===========
> Do you have any other ideas on how we could resolve this?
>
>
> [1] https://github.com/tevador/RandomX
>
>
> --
> Best regards,
> Michał Górny
>
>

[-- Attachment #2: Type: text/html, Size: 6370 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [gentoo-dev] [RFC] Anti-spam for goose
  2020-05-21  8:47 [gentoo-dev] [RFC] Anti-spam for goose Michał Górny
                   ` (4 preceding siblings ...)
  2020-05-21 13:22 ` Gordon Pettey
@ 2020-05-21 13:38 ` Kent Fredric
  2020-05-21 13:49   ` Kent Fredric
  2020-05-22 19:20 ` Kent Fredric
  2020-05-22 22:13 ` Peter Stuge
  7 siblings, 1 reply; 44+ messages in thread
From: Kent Fredric @ 2020-05-21 13:38 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 1360 bytes --]

On Thu, 21 May 2020 10:47:07 +0200
Michał Górny <mgorny@gentoo.org> wrote:

> An alternative of using a proof-of-work algorithm was suggested to me
> yesterday.  The idea is that every submission has to be accompanied with
> the result of some cumbersome calculation that can't be trivially run
> in parallel or optimized out to dedicated hardware.

If the proof of work mechanism was restricted to ID generation, then
the amoritized cost would be acceptable.

So instead of the ID being generated locally, you'd send a request
asking for an ID, it would send you the challenge math, you'd send the
answer, and then you'd get your ID.

And their ID would be an encoded copy of their input vectors and
responses, a random chunk, and chunk representing the signature of
IV/RESPONSE/RAND.

Or something like that.

But the gist is it would be impossible to use ID's not generated by the
server.

Then the spam factor to monitor wouldn't be submission rates, it would
be "New ID request" rates, as these should never be needed to be
generated in large volumes.

_And_ taking 5 minutes for ID generation wouldn't be a terrible thing.

( We could possibly collect anonymous stats on ID generation rates, and
average times to generate a response to a challenge, and use that to
determine what our challenge complexity should be )

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [gentoo-dev] [RFC] Anti-spam for goose
  2020-05-21 13:16             ` Michał Górny
@ 2020-05-21 13:41               ` Kent Fredric
  0 siblings, 0 replies; 44+ messages in thread
From: Kent Fredric @ 2020-05-21 13:41 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 377 bytes --]

On Thu, 21 May 2020 15:16:12 +0200
Michał Górny <mgorny@gentoo.org> wrote:

> Isn't the whole point of salted hash to use unique salts?

You'd thinik so, but I've seen too many piece of code where the salt
was a hardcoded string right there in the hash generation.

md5sum( "SeKrIt\0" + pass  )

So I've learned to never assume that salts were unique per entry.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [gentoo-dev] [RFC] Anti-spam for goose
  2020-05-21 13:38 ` Kent Fredric
@ 2020-05-21 13:49   ` Kent Fredric
  0 siblings, 0 replies; 44+ messages in thread
From: Kent Fredric @ 2020-05-21 13:49 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 936 bytes --]

On Fri, 22 May 2020 01:38:02 +1200
Kent Fredric <kentnl@gentoo.org> wrote:

> So instead of the ID being generated locally, you'd send a request
> asking for an ID, it would send you the challenge math, you'd send the
> answer, and then you'd get your ID.

Additionally, you could even allow the client to pass a number, that
stipulates a desired level of trust, in exchange for a more expensive
computation.

If there was an ID generation option that allowed me to, once, request
a challenge that takes an hour to complete, in exchange for getting a
higher "trust" vector, I'd do that.

Then you could present reports and whittle the results down by minimum
trust level.

( And then after the fact, one can adjust the minimum trust level of a
UID key to submit, so if UID keys below a certian trust level become
problematic, you can easily start rejecting them, and demand they
re-key with a higher trust level )

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [gentoo-dev] [RFC] Anti-spam for goose
  2020-05-21 13:09           ` Kent Fredric
  2020-05-21 13:16             ` Michał Górny
@ 2020-05-21 13:53             ` Michał Górny
  1 sibling, 0 replies; 44+ messages in thread
From: Michał Górny @ 2020-05-21 13:53 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 746 bytes --]

On Fri, 2020-05-22 at 01:09 +1200, Kent Fredric wrote:
> On Thu, 21 May 2020 14:25:00 +0200
> Ulrich Mueller <ulm@gentoo.org> wrote:
> 
> > That's why I said salted hash.
> 
> Even a salted hash becomes a trivial joke when the input data you're
> hashing has a *total* entropy of only 32bits.
> 

If anyone cares about the numbers, I've been able to crack my own IP
address (85.*) in 10 minutes using john with trivial IP address wordlist
generator and plain SHA-512 hash.  I suppose you could assume that
having salted hashes would mean up to 30 minutes per IP address but
that's still not much.  I suppose you could use Argon2 or some other
crazy hash but... where is this going, really?

-- 
Best regards,
Michał Górny

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 618 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [gentoo-dev] [RFC] Anti-spam for goose
  2020-05-21 11:02     ` Michał Górny
@ 2020-05-21 14:27       ` Jaco Kroon
  2020-05-21 20:13         ` Viktar Patotski
  0 siblings, 1 reply; 44+ messages in thread
From: Jaco Kroon @ 2020-05-21 14:27 UTC (permalink / raw
  To: gentoo-dev, Michał Górny, Tomas Mozes

Hi Michał,

On 2020/05/21 13:02, Michał Górny wrote:
> On Thu, 2020-05-21 at 12:45 +0200, Jaco Kroon wrote:
>> Even for v4, as an attacker ... well, as I'm sitting here right now I've
>> got direct access to almost a /20 (4096 addresses).  I know a number of
>> people with larger scopes than that.  Use bot-nets and the scope goes up
>> even more.
> See how unfair the world is!  You are filling your bathtub with IP
> addresses, and my ISP has taken mine only recently.
I must admit, I work for an ISP :$
>>>     Option 3: explicit CAPTCHA
>>>     ==========================
>>>     A traditional way of dealing with spam -- require every new system
>>>     identifier to be confirmed by solving a CAPTCHA (or a few identifiers
>>>     for one CAPTCHA).
>>>
>>>     The advantage of this method is that it requires a real human work
>>>     to be
>>>     performed, effectively limiting the ability to submit spam.
>>>
>> Yea.  One would think.  CAPTCHAs are massively intrusive and in my
>> opinion more effort than they're worth.
>>
>> This may be beneficial to *generate* a token.  In other words - when
>> generating a token, that token needs to be registered by way of capthca.
>>
>>>     Other ideas
>>>     ===========
>>>     Do you have any other ideas on how we could resolve this?
>>>
>> Generated token + hardware based hash.
> How are you going to verify that the hardware-based hash is real,
> and not just a random value created to circumvent the protection?

So the generation of the hash is more to validate that it's still on the
same installation (ie, not a cloned token).  Sorry if that wasn't clear,
so trying to solve two possible problems in one go.

>
>>   Rate limit the combination to 1/day.
>>
>> Don't use included results until it's been kept up to date for a minimum
>> period.  Say updated at least 20 times 30 days.
> For privacy reasons, we don't correlate the results.  So this is
> impossible to implement.

Ok, but a token cannot (unless we issue it based on an email based
account) be linked back to a specific user, so does it matter if we
associate uploads with a token?

>> The downside here is that many machines are not powered up at least once
>> a day to be able to perform that initial submission sequence.  So
>> perhaps it's a bit stringent.
> Exactly.  Even once a week is a bit risky but once a day is too narrow
> a period.
>
> To some degree, we could decide we don't care about exact numbers
> as much as some degree of weighed proportions.  This would mean that,
> say, people who submit daily get the count of 7, at the loss of people
> who don't run their machines that much.  It would effectively put more
> emphasis on more active users.  It's debatable whether this is desirable
> or not.
Decaying averages.  Simple to implement, don't need all historic data.
>
> Both the token and hardware hash can of course be tainted and is under
>> "attacker control".
> Exactly.  So it really looks like exercise for the sake of exercise.

Unless tokens are *issued* as per the rest of my email you snipped
away.  Wherein I proposed an issuing of both anonymous and non-anonymous
tokens.

Kind Regards,
Jaco



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [gentoo-dev] [RFC] Anti-spam for goose
  2020-05-21  9:43   ` Michał Górny
@ 2020-05-21 20:07     ` Toralf Förster
  2020-05-22  4:39       ` Michał Górny
  0 siblings, 1 reply; 44+ messages in thread
From: Toralf Förster @ 2020-05-21 20:07 UTC (permalink / raw
  To: gentoo-dev


[-- Attachment #1.1: Type: text/plain, Size: 671 bytes --]

On 5/21/20 11:43 AM, Michał Górny wrote:
> On Thu, 2020-05-21 at 11:17 +0200, Toralf Förster wrote:
>> On 5/21/20 10:47 AM, Michał Górny wrote:
>>> TL;DR: I'm looking for opinions on how to protect goose from spam,
>>> i.e. mass fake submissions.
>>>
>>
>> I'd combine IP-limits with proof-of-work.
>> CAPTCHA should be the very last option IMO.
>>
> 
> To be honest, I don't see the point for proof-of-work if we have IP
> limits.
> 

The POW has to be made for every submission and should (somehow) include the IP-address.
So you have 2 barriers. None of both is perfect but their combination is expensive.

-- 
Toralf
PGP 23217DA7 9B888F45


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [gentoo-dev] [RFC] Anti-spam for goose
  2020-05-21 14:27       ` Jaco Kroon
@ 2020-05-21 20:13         ` Viktar Patotski
  2020-05-22  0:38           ` Alec Warner
                             ` (2 more replies)
  0 siblings, 3 replies; 44+ messages in thread
From: Viktar Patotski @ 2020-05-21 20:13 UTC (permalink / raw
  To: gentoo-dev; +Cc: Michał Górny, Tomas Mozes

[-- Attachment #1: Type: text/plain, Size: 4095 bytes --]

Hi all,

I believe that we are all have forgotten about Donald Knuth: Premature
optimisation is the root of all evill.

We don't have "spam" yet, but we are already trying to protect. There might
be cases when some systems will be posting stats more often than we want,
but probably that will not harm us. Or this will be done by our main users
who runs 1kk of gentoo installations and this "spam"  will be actually
valuable. Moreover, nobody forces us to treat info from 'goose' as first
priority, so we are still able to select on which packages to work. In
short: this topic is not so important yet, I think.

Viktar


On Thu, May 21, 2020, 16:28 Jaco Kroon <jaco@uls.co.za> wrote:

> Hi Michał,
>
> On 2020/05/21 13:02, Michał Górny wrote:
> > On Thu, 2020-05-21 at 12:45 +0200, Jaco Kroon wrote:
> >> Even for v4, as an attacker ... well, as I'm sitting here right now I've
> >> got direct access to almost a /20 (4096 addresses).  I know a number of
> >> people with larger scopes than that.  Use bot-nets and the scope goes up
> >> even more.
> > See how unfair the world is!  You are filling your bathtub with IP
> > addresses, and my ISP has taken mine only recently.
> I must admit, I work for an ISP :$
> >>>     Option 3: explicit CAPTCHA
> >>>     ==========================
> >>>     A traditional way of dealing with spam -- require every new system
> >>>     identifier to be confirmed by solving a CAPTCHA (or a few
> identifiers
> >>>     for one CAPTCHA).
> >>>
> >>>     The advantage of this method is that it requires a real human work
> >>>     to be
> >>>     performed, effectively limiting the ability to submit spam.
> >>>
> >> Yea.  One would think.  CAPTCHAs are massively intrusive and in my
> >> opinion more effort than they're worth.
> >>
> >> This may be beneficial to *generate* a token.  In other words - when
> >> generating a token, that token needs to be registered by way of capthca.
> >>
> >>>     Other ideas
> >>>     ===========
> >>>     Do you have any other ideas on how we could resolve this?
> >>>
> >> Generated token + hardware based hash.
> > How are you going to verify that the hardware-based hash is real,
> > and not just a random value created to circumvent the protection?
>
> So the generation of the hash is more to validate that it's still on the
> same installation (ie, not a cloned token).  Sorry if that wasn't clear,
> so trying to solve two possible problems in one go.
>
> >
> >>   Rate limit the combination to 1/day.
> >>
> >> Don't use included results until it's been kept up to date for a minimum
> >> period.  Say updated at least 20 times 30 days.
> > For privacy reasons, we don't correlate the results.  So this is
> > impossible to implement.
>
> Ok, but a token cannot (unless we issue it based on an email based
> account) be linked back to a specific user, so does it matter if we
> associate uploads with a token?
>
> >> The downside here is that many machines are not powered up at least once
> >> a day to be able to perform that initial submission sequence.  So
> >> perhaps it's a bit stringent.
> > Exactly.  Even once a week is a bit risky but once a day is too narrow
> > a period.
> >
> > To some degree, we could decide we don't care about exact numbers
> > as much as some degree of weighed proportions.  This would mean that,
> > say, people who submit daily get the count of 7, at the loss of people
> > who don't run their machines that much.  It would effectively put more
> > emphasis on more active users.  It's debatable whether this is desirable
> > or not.
> Decaying averages.  Simple to implement, don't need all historic data.
> >
> > Both the token and hardware hash can of course be tainted and is under
> >> "attacker control".
> > Exactly.  So it really looks like exercise for the sake of exercise.
>
> Unless tokens are *issued* as per the rest of my email you snipped
> away.  Wherein I proposed an issuing of both anonymous and non-anonymous
> tokens.
>
> Kind Regards,
> Jaco
>
>
>

[-- Attachment #2: Type: text/html, Size: 5103 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [gentoo-dev] [RFC] Anti-spam for goose
  2020-05-21 20:13         ` Viktar Patotski
@ 2020-05-22  0:38           ` Alec Warner
  2020-05-22  4:42           ` Michał Górny
  2020-05-22  6:17           ` waebbl
  2 siblings, 0 replies; 44+ messages in thread
From: Alec Warner @ 2020-05-22  0:38 UTC (permalink / raw
  To: Gentoo Dev; +Cc: Michał Górny, Tomas Mozes

[-- Attachment #1: Type: text/plain, Size: 5648 bytes --]

On Thu, May 21, 2020 at 1:13 PM Viktar Patotski <xp.vit.blr@gmail.com>
wrote:

> Hi all,
>
> I believe that we are all have forgotten about Donald Knuth: Premature
> optimisation is the root of all evill.
>
> We don't have "spam" yet, but we are already trying to protect. There
> might be cases when some systems will be posting stats more often than we
> want, but probably that will not harm us. Or this will be done by our main
> users who runs 1kk of gentoo installations and this "spam"  will be
> actually valuable. Moreover, nobody forces us to treat info from 'goose' as
> first priority, so we are still able to select on which packages to work.
> In short: this topic is not so important yet, I think.
>

I raised a similar question on irc and the conclusion was that 'it is good
to have ideas' and I don't necessarily disagree there[0]. We cannot build a
foolproof system but some are feasible in some scenarios[1].

[0] Gentoo offers numerous no-login-required services; most of these are
read-only but they typically don't suffer from attacks; or at least, not
attacks that we need to respond to. The most obvious one of these is our
gentoo.org mail service which accepts unauthenticated email to gentoo.org.
Our anti-email-spam countermeasures are what I would call complex, but we
still employ broad measures when needed and the tradeoffs are similar to
the options for goose; e.g. if we are too broad we can block email from
large swaths of the internet.
[1] Bugzilla *has* recently been the target of spam attacks, it *has*
logins required (e.g. to create / modify bugs) and it has not stopped the
spammers from creating accounts. We have discussed different protections
for bugzilla, as it has different parameters. A basic bugzilla account
can't do all that much (you can't modify the bugs of others easily) and
spam posts are easily identified. This is to differentiate from goose where
the powers of each token are the same (submit report) and it may be
difficult to tell an abusive report from a real report.


> Viktar
>
>
> On Thu, May 21, 2020, 16:28 Jaco Kroon <jaco@uls.co.za> wrote:
>
>> Hi Michał,
>>
>> On 2020/05/21 13:02, Michał Górny wrote:
>> > On Thu, 2020-05-21 at 12:45 +0200, Jaco Kroon wrote:
>> >> Even for v4, as an attacker ... well, as I'm sitting here right now
>> I've
>> >> got direct access to almost a /20 (4096 addresses).  I know a number of
>> >> people with larger scopes than that.  Use bot-nets and the scope goes
>> up
>> >> even more.
>> > See how unfair the world is!  You are filling your bathtub with IP
>> > addresses, and my ISP has taken mine only recently.
>> I must admit, I work for an ISP :$
>> >>>     Option 3: explicit CAPTCHA
>> >>>     ==========================
>> >>>     A traditional way of dealing with spam -- require every new system
>> >>>     identifier to be confirmed by solving a CAPTCHA (or a few
>> identifiers
>> >>>     for one CAPTCHA).
>> >>>
>> >>>     The advantage of this method is that it requires a real human work
>> >>>     to be
>> >>>     performed, effectively limiting the ability to submit spam.
>> >>>
>> >> Yea.  One would think.  CAPTCHAs are massively intrusive and in my
>> >> opinion more effort than they're worth.
>> >>
>> >> This may be beneficial to *generate* a token.  In other words - when
>> >> generating a token, that token needs to be registered by way of
>> capthca.
>> >>
>> >>>     Other ideas
>> >>>     ===========
>> >>>     Do you have any other ideas on how we could resolve this?
>> >>>
>> >> Generated token + hardware based hash.
>> > How are you going to verify that the hardware-based hash is real,
>> > and not just a random value created to circumvent the protection?
>>
>> So the generation of the hash is more to validate that it's still on the
>> same installation (ie, not a cloned token).  Sorry if that wasn't clear,
>> so trying to solve two possible problems in one go.
>>
>> >
>> >>   Rate limit the combination to 1/day.
>> >>
>> >> Don't use included results until it's been kept up to date for a
>> minimum
>> >> period.  Say updated at least 20 times 30 days.
>> > For privacy reasons, we don't correlate the results.  So this is
>> > impossible to implement.
>>
>> Ok, but a token cannot (unless we issue it based on an email based
>> account) be linked back to a specific user, so does it matter if we
>> associate uploads with a token?
>>
>> >> The downside here is that many machines are not powered up at least
>> once
>> >> a day to be able to perform that initial submission sequence.  So
>> >> perhaps it's a bit stringent.
>> > Exactly.  Even once a week is a bit risky but once a day is too narrow
>> > a period.
>> >
>> > To some degree, we could decide we don't care about exact numbers
>> > as much as some degree of weighed proportions.  This would mean that,
>> > say, people who submit daily get the count of 7, at the loss of people
>> > who don't run their machines that much.  It would effectively put more
>> > emphasis on more active users.  It's debatable whether this is desirable
>> > or not.
>> Decaying averages.  Simple to implement, don't need all historic data.
>> >
>> > Both the token and hardware hash can of course be tainted and is under
>> >> "attacker control".
>> > Exactly.  So it really looks like exercise for the sake of exercise.
>>
>> Unless tokens are *issued* as per the rest of my email you snipped
>> away.  Wherein I proposed an issuing of both anonymous and non-anonymous
>> tokens.
>>
>> Kind Regards,
>> Jaco
>>
>>
>>

[-- Attachment #2: Type: text/html, Size: 7112 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [gentoo-dev] [RFC] Anti-spam for goose
  2020-05-21 20:07     ` Toralf Förster
@ 2020-05-22  4:39       ` Michał Górny
  0 siblings, 0 replies; 44+ messages in thread
From: Michał Górny @ 2020-05-22  4:39 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 968 bytes --]

On Thu, 2020-05-21 at 22:07 +0200, Toralf Förster wrote:
> On 5/21/20 11:43 AM, Michał Górny wrote:
> > On Thu, 2020-05-21 at 11:17 +0200, Toralf Förster wrote:
> > > On 5/21/20 10:47 AM, Michał Górny wrote:
> > > > TL;DR: I'm looking for opinions on how to protect goose from spam,
> > > > i.e. mass fake submissions.
> > > > 
> > > 
> > > I'd combine IP-limits with proof-of-work.
> > > CAPTCHA should be the very last option IMO.
> > > 
> > 
> > To be honest, I don't see the point for proof-of-work if we have IP
> > limits.
> > 
> 
> The POW has to be made for every submission and should (somehow) include the IP-address.
> So you have 2 barriers. None of both is perfect but their combination is expensive.

No, one of them is expensive while the other is completely covered by
it.  I can't imagine requiring PoW that expensive that it would limit
requests more than a reasonable IP limiting.

-- 
Best regards,
Michał Górny


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 618 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [gentoo-dev] [RFC] Anti-spam for goose
  2020-05-21 20:13         ` Viktar Patotski
  2020-05-22  0:38           ` Alec Warner
@ 2020-05-22  4:42           ` Michał Górny
  2020-05-22  6:03             ` Michał Górny
  2020-05-22  6:17           ` waebbl
  2 siblings, 1 reply; 44+ messages in thread
From: Michał Górny @ 2020-05-22  4:42 UTC (permalink / raw
  To: gentoo-dev; +Cc: Tomas Mozes

[-- Attachment #1: Type: text/plain, Size: 748 bytes --]

On Thu, 2020-05-21 at 22:13 +0200, Viktar Patotski wrote:
> We don't have "spam" yet, but we are already trying to protect. There might
> be cases when some systems will be posting stats more often than we want,
> but probably that will not harm us. Or this will be done by our main users
> who runs 1kk of gentoo installations and this "spam"  will be actually
> valuable. Moreover, nobody forces us to treat info from 'goose' as first
> priority, so we are still able to select on which packages to work. In
> short: this topic is not so important yet, I think.
> 

Tell that to SKS keyserver admins.  Well, on the plus side if it
happens, it probably won't affect user systems in the process.

-- 
Best regards,
Michał Górny


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 618 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [gentoo-dev] [RFC] Anti-spam for goose
  2020-05-22  4:42           ` Michał Górny
@ 2020-05-22  6:03             ` Michał Górny
  0 siblings, 0 replies; 44+ messages in thread
From: Michał Górny @ 2020-05-22  6:03 UTC (permalink / raw
  To: gentoo-dev; +Cc: Tomas Mozes

[-- Attachment #1: Type: text/plain, Size: 1335 bytes --]

On Fri, 2020-05-22 at 06:42 +0200, Michał Górny wrote:
> On Thu, 2020-05-21 at 22:13 +0200, Viktar Patotski wrote:
> > We don't have "spam" yet, but we are already trying to protect. There might
> > be cases when some systems will be posting stats more often than we want,
> > but probably that will not harm us. Or this will be done by our main users
> > who runs 1kk of gentoo installations and this "spam"  will be actually
> > valuable. Moreover, nobody forces us to treat info from 'goose' as first
> > priority, so we are still able to select on which packages to work. In
> > short: this topic is not so important yet, I think.
> > 
> 
> Tell that to SKS keyserver admins.  Well, on the plus side if it
> happens, it probably won't affect user systems in the process.

Well, I didn't make my point very clear, so please let me explain.

Right now the project is in experimental phase.  If we do major changes
right now, the harm is minimal.

If spamming happens one year from now, two years from now... we'd have
many users submitting data.  Suddenly, we would have to invent something
new, and it will probably be impossible within the framework used right
now.  This would most likely mean we'd have to literally kick all users
from the system and start over.

-- 
Best regards,
Michał Górny


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 618 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [gentoo-dev] [RFC] Anti-spam for goose
  2020-05-21 20:13         ` Viktar Patotski
  2020-05-22  0:38           ` Alec Warner
  2020-05-22  4:42           ` Michał Górny
@ 2020-05-22  6:17           ` waebbl
  2020-05-22 13:39             ` Gordon Pettey
  2 siblings, 1 reply; 44+ messages in thread
From: waebbl @ 2020-05-22  6:17 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 439 bytes --]

Am Do., 21. Mai 2020 um 22:14 Uhr schrieb Viktar Patotski <
xp.vit.blr@gmail.com>:

I believe that we are all have forgotten about Donald Knuth: Premature
> optimisation is the root of all evill.
>

I won't consider spam protection to be a optimisation. Instead, the
occurence of spam is IMO a proper use-case from a developers PoV. Therefore
thinking about how to handle it, is a necessary task.

--
With Regards
Bernd <waebbl@gmail.com>

[-- Attachment #2: Type: text/html, Size: 963 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [gentoo-dev] [RFC] Anti-spam for goose
  2020-05-22  6:17           ` waebbl
@ 2020-05-22 13:39             ` Gordon Pettey
  2020-05-22 15:19               ` waebbl
  0 siblings, 1 reply; 44+ messages in thread
From: Gordon Pettey @ 2020-05-22 13:39 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 575 bytes --]

On Fri, May 22, 2020 at 1:18 AM waebbl <waebbl@gmail.com> wrote:

> Am Do., 21. Mai 2020 um 22:14 Uhr schrieb Viktar Patotski <
> xp.vit.blr@gmail.com>:
>
>> I believe that we are all have forgotten about Donald Knuth: Premature
>> optimisation is the root of all evill.
>>
> I won't consider spam protection to be a optimisation. Instead, the
> occurence of spam is IMO a proper use-case from a developers PoV. Therefore
> thinking about how to handle it, is a necessary task.
>
Abusing Knuth's words as an excuse to avoid any and all good practice is
the root of all evil.

[-- Attachment #2: Type: text/html, Size: 1249 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [gentoo-dev] [RFC] Anti-spam for goose
  2020-05-22 13:39             ` Gordon Pettey
@ 2020-05-22 15:19               ` waebbl
  0 siblings, 0 replies; 44+ messages in thread
From: waebbl @ 2020-05-22 15:19 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 740 bytes --]

Am Fr., 22. Mai 2020 um 15:40 Uhr schrieb Gordon Pettey <
petteyg359@gmail.com>:

> On Fri, May 22, 2020 at 1:18 AM waebbl <waebbl@gmail.com> wrote:
>
>> Am Do., 21. Mai 2020 um 22:14 Uhr schrieb Viktar Patotski <
>> xp.vit.blr@gmail.com>:
>>
>>> I believe that we are all have forgotten about Donald Knuth: Premature
>>> optimisation is the root of all evill.
>>>
>> I won't consider spam protection to be a optimisation. Instead, the
>> occurence of spam is IMO a proper use-case from a developers PoV. Therefore
>> thinking about how to handle it, is a necessary task.
>>
> Abusing Knuth's words as an excuse to avoid any and all good practice is
> the root of all evil.
>

Would you consider not even thinking about it a good practice?

[-- Attachment #2: Type: text/html, Size: 1749 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [gentoo-dev] [RFC] Anti-spam for goose
  2020-05-21  8:47 [gentoo-dev] [RFC] Anti-spam for goose Michał Górny
                   ` (5 preceding siblings ...)
  2020-05-21 13:38 ` Kent Fredric
@ 2020-05-22 19:20 ` Kent Fredric
  2020-05-22 19:53   ` Brian Dolbec
  2020-05-22 19:58   ` Michał Górny
  2020-05-22 22:13 ` Peter Stuge
  7 siblings, 2 replies; 44+ messages in thread
From: Kent Fredric @ 2020-05-22 19:20 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 1133 bytes --]

On Thu, 21 May 2020 10:47:07 +0200
Michał Górny <mgorny@gentoo.org> wrote:

> Other ideas
> ===========
> Do you have any other ideas on how we could resolve this?

And a question I'd like to revisit, because nobody responded to it:

- What are the incentives a would-be spammer has to spam this service.

Services that see spam *typically* have a definable objective.

*Typically* it revolves around the ability to submit /arbitrary text/,
which allows them to hawk something, and this becomes a profit motive.

If we implement data validation so that there's no way for them to
profit off what they spam, seems likely they'll be less motivated to
develop the necessary circumvention tools. ( as in, we shouldn't accept
arbitrary CAT/PN pairs as being valid until something can confirm those
pairs exist in reality )

There may be people trying to jack the data up, but ... it seems a less
worthy target.

So it seems the largest risk isn't so much "spam", but "denial of
service", or "data pollution".

Of course, we should still mitigate, but /how/ we mitigate seems to
pivot around this somewhat.

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [gentoo-dev] [RFC] Anti-spam for goose
  2020-05-22 19:20 ` Kent Fredric
@ 2020-05-22 19:53   ` Brian Dolbec
  2020-05-22 20:01     ` John Helmert III
  2020-05-23  9:40     ` Kent Fredric
  2020-05-22 19:58   ` Michał Górny
  1 sibling, 2 replies; 44+ messages in thread
From: Brian Dolbec @ 2020-05-22 19:53 UTC (permalink / raw
  To: gentoo-dev

On Sat, 23 May 2020 07:20:22 +1200
Kent Fredric <kentnl@gentoo.org> wrote:

> On Thu, 21 May 2020 10:47:07 +0200
> Michał Górny <mgorny@gentoo.org> wrote:
> 
> > Other ideas
> > ===========
> > Do you have any other ideas on how we could resolve this?  
> 
> And a question I'd like to revisit, because nobody responded to it:
> 
> - What are the incentives a would-be spammer has to spam this service.
> 
> Services that see spam *typically* have a definable objective.
> 
> *Typically* it revolves around the ability to submit /arbitrary text/,
> which allows them to hawk something, and this becomes a profit motive.
> 
> If we implement data validation so that there's no way for them to
> profit off what they spam, seems likely they'll be less motivated to
> develop the necessary circumvention tools. ( as in, we shouldn't
> accept arbitrary CAT/PN pairs as being valid until something can
> confirm those pairs exist in reality )
> 
> There may be people trying to jack the data up, but ... it seems a
> less worthy target.
> 
> So it seems the largest risk isn't so much "spam", but "denial of
> service", or "data pollution".
> 
> Of course, we should still mitigate, but /how/ we mitigate seems to
> pivot around this somewhat.

We cannot exclude overlays which will have cat/pkg not in the main
gentoo repo.  So, we should not excludea submission that includes a few
of these.  They would just become irrelevant outliers to our
processesing of the data.  In fact some of these outlier pkgs could be
relevant to our including that pkg into the main repo.

But, like you I agree that purely spam submissions would be few, if any.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [gentoo-dev] [RFC] Anti-spam for goose
  2020-05-22 19:20 ` Kent Fredric
  2020-05-22 19:53   ` Brian Dolbec
@ 2020-05-22 19:58   ` Michał Górny
  2020-05-23  7:54     ` Fabian Groffen
  2020-05-23 10:00     ` Kent Fredric
  1 sibling, 2 replies; 44+ messages in thread
From: Michał Górny @ 2020-05-22 19:58 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 1644 bytes --]

On Sat, 2020-05-23 at 07:20 +1200, Kent Fredric wrote:
> On Thu, 21 May 2020 10:47:07 +0200
> Michał Górny <mgorny@gentoo.org> wrote:
> 
> > Other ideas
> > ===========
> > Do you have any other ideas on how we could resolve this?
> 
> And a question I'd like to revisit, because nobody responded to it:
> 
> - What are the incentives a would-be spammer has to spam this service.
> 
> Services that see spam *typically* have a definable objective.
> 
> *Typically* it revolves around the ability to submit /arbitrary text/,
> which allows them to hawk something, and this becomes a profit motive.
> 
> If we implement data validation so that there's no way for them to
> profit off what they spam, seems likely they'll be less motivated to
> develop the necessary circumvention tools. ( as in, we shouldn't accept
> arbitrary CAT/PN pairs as being valid until something can confirm those
> pairs exist in reality )
> 
> There may be people trying to jack the data up, but ... it seems a less
> worthy target.
> 
> So it seems the largest risk isn't so much "spam", but "denial of
> service", or "data pollution".

I've meant 'spam' as 'undesired submissions'.  You seem to have used
a very narrow definition of 'spam' to argue into reaching the same
problem under different name.

Let's put it like this.  This thing starts working.  Package X is
broken, and we see that almost nobody is using it.  We remove that
package.  Now somebody is angry.  He submits a lot of fake data to
render the service useless so that we don't make any future decisions
based on it.

-- 
Best regards,
Michał Górny


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 618 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [gentoo-dev] [RFC] Anti-spam for goose
  2020-05-22 19:53   ` Brian Dolbec
@ 2020-05-22 20:01     ` John Helmert III
  2020-05-23  9:40     ` Kent Fredric
  1 sibling, 0 replies; 44+ messages in thread
From: John Helmert III @ 2020-05-22 20:01 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 426 bytes --]

On Fri, May 22, 2020 at 12:53:03PM -0700, Brian Dolbec wrote:
> We cannot exclude overlays which will have cat/pkg not in the main
> gentoo repo.  So, we should not excludea submission that includes a few
> of these.

To avoid this problem, even if imperfectly, it should be possible to
track what repository a given package is installed from and then check
its validity based on a list of valid packages for a given overlay.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [gentoo-dev] [RFC] Anti-spam for goose
  2020-05-21  8:47 [gentoo-dev] [RFC] Anti-spam for goose Michał Górny
                   ` (6 preceding siblings ...)
  2020-05-22 19:20 ` Kent Fredric
@ 2020-05-22 22:13 ` Peter Stuge
  2020-05-23  9:49   ` Kent Fredric
  7 siblings, 1 reply; 44+ messages in thread
From: Peter Stuge @ 2020-05-22 22:13 UTC (permalink / raw
  To: gentoo-dev

Stop motivated attackers or keep low barrier to entry; pick any one. :)

Michał Górny wrote:
> CAPTCHA
> ==========================
> A traditional way of dealing with spam -- require every new system
> identifier to be confirmed by solving a CAPTCHA (or a few identifiers
> for one CAPTCHA).
> 
> The advantage of this method is that it requires a real human work to be
> performed, effectively limiting the ability to submit spam.
> The disadvantage is that it is cumbersome to users, so many of them will
> just resign from participating.

While services such as reCAPTCHA are (as said) massively intrusive, there
are other, much less intrusive and even terminal-compatible ways to construct
a CAPTCHA. Hello game developers, you have 80x23 "pixels" to render a puzzle
for a human above the response input line - that's not so bad.

Attacking something like a server-generated maths challenge rendered in a
randomly chosen and maybe distorted font would require OCR and/or ML, which
is fairly annoying. The only real problem then would be with OCR packages. ;)

Combine with a rate limit that is increased manually as the service grows
more popular. It can be a soft limit which doesn't report failure but results
in queueing+maybe vetting of reports, to allow some elasticity for peaks.

//Peter

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [gentoo-dev] [RFC] Anti-spam for goose
  2020-05-22 19:58   ` Michał Górny
@ 2020-05-23  7:54     ` Fabian Groffen
  2020-05-23  8:15       ` Michał Górny
  2020-05-23 10:00     ` Kent Fredric
  1 sibling, 1 reply; 44+ messages in thread
From: Fabian Groffen @ 2020-05-23  7:54 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 745 bytes --]

On 22-05-2020 21:58:54 +0200, Michał Górny wrote:
> Let's put it like this.  This thing starts working.  Package X is
> broken, and we see that almost nobody is using it.  We remove that
> package.  Now somebody is angry.  He submits a lot of fake data to
> render the service useless so that we don't make any future decisions
> based on it.

I'm affraid that has a heroic flair to me.  The service should never be
used for decisions like that, because it's a biased sample at most.
Doing stuff like this simply destroys the soul of the distribution.

I hope this isn't one of your genuine objectives with the service.  If
it is, I can see why you fear spam so much.

Fabian

-- 
Fabian Groffen
Gentoo on a different level

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [gentoo-dev] [RFC] Anti-spam for goose
  2020-05-23  7:54     ` Fabian Groffen
@ 2020-05-23  8:15       ` Michał Górny
  0 siblings, 0 replies; 44+ messages in thread
From: Michał Górny @ 2020-05-23  8:15 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 890 bytes --]

On Sat, 2020-05-23 at 09:54 +0200, Fabian Groffen wrote:
> On 22-05-2020 21:58:54 +0200, Michał Górny wrote:
> > Let's put it like this.  This thing starts working.  Package X is
> > broken, and we see that almost nobody is using it.  We remove that
> > package.  Now somebody is angry.  He submits a lot of fake data to
> > render the service useless so that we don't make any future decisions
> > based on it.
> 
> I'm affraid that has a heroic flair to me.  The service should never be
> used for decisions like that, because it's a biased sample at most.
> Doing stuff like this simply destroys the soul of the distribution.
> 
> I hope this isn't one of your genuine objectives with the service.  If
> it is, I can see why you fear spam so much.
> 

What it is is one thing, what an angry user perceives it to be is
another.

-- 
Best regards,
Michał Górny


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 618 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [gentoo-dev] [RFC] Anti-spam for goose
  2020-05-22 19:53   ` Brian Dolbec
  2020-05-22 20:01     ` John Helmert III
@ 2020-05-23  9:40     ` Kent Fredric
  1 sibling, 0 replies; 44+ messages in thread
From: Kent Fredric @ 2020-05-23  9:40 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 960 bytes --]

On Fri, 22 May 2020 12:53:03 -0700
Brian Dolbec <dolsen@gentoo.org> wrote:

> We cannot exclude overlays which will have cat/pkg not in the main
> gentoo repo.  So, we should not excludea submission that includes a few
> of these.  They would just become irrelevant outliers to our
> processesing of the data.  In fact some of these outlier pkgs could be
> relevant to our including that pkg into the main repo.

We *can* still validate them against entries in known overlays.

And even if we *cant* validate everything, we can de-weight and hide
from *default* reports items that can't be found in known overlays.

This would move the difficulty goal from "submit a spam record" to:

- write an overlay
- get it published somewhere
- get it included in the database of known overlays
- then publish a spam record relating to it

Which sounds like a slow and painful process if the risk of being
blacklisted burns down that whole stack.

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [gentoo-dev] [RFC] Anti-spam for goose
  2020-05-22 22:13 ` Peter Stuge
@ 2020-05-23  9:49   ` Kent Fredric
  2020-05-24 13:05     ` Peter Stuge
  0 siblings, 1 reply; 44+ messages in thread
From: Kent Fredric @ 2020-05-23  9:49 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 878 bytes --]

On Fri, 22 May 2020 22:13:11 +0000
Peter Stuge <peter@stuge.se> wrote:

> While services such as reCAPTCHA are (as said) massively intrusive, there
> are other, much less intrusive and even terminal-compatible ways to construct
> a CAPTCHA. Hello game developers, you have 80x23 "pixels" to render a puzzle
> for a human above the response input line - that's not so bad.

Well, they kinda have to be, the state of AI is increasing so much that
current captcha systems undoubtedly also develop their own adversarial
AI to try beat their own captcha.

I don't think we have the sort of power to develop this.

And the inherently low entropy of only having 80x23 with so few
(compared to full RGB) bits per pixel, this gives any would-be AI a
substantial leg up.

Using text distortion is amateur hour these days.

(and there's always mechanical-turk anyway)

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [gentoo-dev] [RFC] Anti-spam for goose
  2020-05-22 19:58   ` Michał Górny
  2020-05-23  7:54     ` Fabian Groffen
@ 2020-05-23 10:00     ` Kent Fredric
  1 sibling, 0 replies; 44+ messages in thread
From: Kent Fredric @ 2020-05-23 10:00 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 1730 bytes --]

On Fri, 22 May 2020 21:58:54 +0200
Michał Górny <mgorny@gentoo.org> wrote:

> Let's put it like this.  This thing starts working.  Package X is
> broken, and we see that almost nobody is using it.  We remove that
> package.  Now somebody is angry.  He submits a lot of fake data to
> render the service useless so that we don't make any future decisions
> based on it.

Sure, and I agree that's a risk. But its not the "random users from the
internet fill your inbox with shallow promises of free money" sort of
risk, that's typically implied by "spam" ;).

The set of potential attackers seems much smaller in our case, and are
expressly likely to be actual consumers of Gentoo.

This attacker type seems to be the sort that mitigates well with:

- Make it so that end users can't forge custom IDs and can only be
  handed out by the server (but the ID doesn't actually add any
  tracking, its just a chunk of randomness with a signature that
  verifies its legitimacy)

- Make ID generation expensive.

- Limit submissions per ID the same way we do now.

That way it doesn't harm typical users beyond their --setup, but hurts
would be attackers.

If we get under attack, we can just suspend ID generation services, or
rate limit ID generation.

(And we can encode data in the ID about when it was generated, and the
strength of the challenge of the generation, and then block submissions
based on criteria when problems occur)

This means we don't need to keep track of what ID's are "valid", server
side, crypto bits do all the leg work.

Even if our private key doing the signing gets compromised, we can
change it, which triggers all users to need to re-id, and flush old
data.

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [gentoo-dev] [RFC] Anti-spam for goose
  2020-05-23  9:49   ` Kent Fredric
@ 2020-05-24 13:05     ` Peter Stuge
  2020-05-24 15:21       ` Kent Fredric
  0 siblings, 1 reply; 44+ messages in thread
From: Peter Stuge @ 2020-05-24 13:05 UTC (permalink / raw
  To: gentoo-dev

Kent Fredric wrote:
> > While services such as reCAPTCHA are (as said) massively intrusive, there
> > are other, much less intrusive and even terminal-compatible ways to construct
> > a CAPTCHA. Hello game developers, you have 80x23 "pixels" to render a puzzle
> > for a human above the response input line - that's not so bad.
> 
> Well, they kinda have to be,

I disagree with that, especially for this service, that was the point I
wanted to make. :)


> the state of AI is increasing so much that current captcha systems
> undoubtedly also develop their own adversarial AI to try beat their
> own captcha.
> 
> I don't think we have the sort of power to develop this.

In any case I don't think that's required.


> And the inherently low entropy of only having 80x23 with so few
> (compared to full RGB) bits per pixel,

A character doesn't compare too bad to RGB. See aalib, or if you
will risk exclusion of color-vision-impaired humans libcaca.


> this gives any would-be AI a substantial leg up.
> 
> Using text distortion is amateur hour these days.
> 
> (and there's always mechanical-turk anyway)

Except this isn't for some web-scale disruptive startup, it's a
statistics/reputation system for an advanced, super-nerdy Linux distribution.

Please think more about the threat model, and remember the rate limit knob.

The bar only needs to be raised high enough.


//Peter


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [gentoo-dev] [RFC] Anti-spam for goose
  2020-05-24 13:05     ` Peter Stuge
@ 2020-05-24 15:21       ` Kent Fredric
  0 siblings, 0 replies; 44+ messages in thread
From: Kent Fredric @ 2020-05-24 15:21 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 371 bytes --]

On Sun, 24 May 2020 13:05:35 +0000
Peter Stuge <peter@stuge.se> wrote:

> The bar only needs to be raised high enough.

Sure. A lot of this is just "think about what could happen in the worst
case imaginable".

Its very unlikely our worst cases will happen.

But we should at least have the ability to easily add mitigations in
future if things do get worse.

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2020-05-24 15:21 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-05-21  8:47 [gentoo-dev] [RFC] Anti-spam for goose Michał Górny
2020-05-21  9:17 ` Toralf Förster
2020-05-21  9:43   ` Michał Górny
2020-05-21 20:07     ` Toralf Förster
2020-05-22  4:39       ` Michał Górny
2020-05-21  9:48 ` Tomas Mozes
2020-05-21 10:10   ` Michał Górny
2020-05-21 10:37     ` Tomas Mozes
2020-05-21 10:45   ` Jaco Kroon
2020-05-21 11:02     ` Michał Górny
2020-05-21 14:27       ` Jaco Kroon
2020-05-21 20:13         ` Viktar Patotski
2020-05-22  0:38           ` Alec Warner
2020-05-22  4:42           ` Michał Górny
2020-05-22  6:03             ` Michał Górny
2020-05-22  6:17           ` waebbl
2020-05-22 13:39             ` Gordon Pettey
2020-05-22 15:19               ` waebbl
2020-05-21 11:03 ` Fabian Groffen
2020-05-21 11:33 ` Robert Bridge
2020-05-21 11:56   ` Michał Górny
2020-05-21 11:57   ` Ulrich Mueller
2020-05-21 12:08     ` Michał Górny
2020-05-21 12:15       ` Robert Bridge
2020-05-21 12:25         ` Ulrich Mueller
2020-05-21 13:09           ` Kent Fredric
2020-05-21 13:16             ` Michał Górny
2020-05-21 13:41               ` Kent Fredric
2020-05-21 13:53             ` Michał Górny
2020-05-21 13:22 ` Gordon Pettey
2020-05-21 13:38 ` Kent Fredric
2020-05-21 13:49   ` Kent Fredric
2020-05-22 19:20 ` Kent Fredric
2020-05-22 19:53   ` Brian Dolbec
2020-05-22 20:01     ` John Helmert III
2020-05-23  9:40     ` Kent Fredric
2020-05-22 19:58   ` Michał Górny
2020-05-23  7:54     ` Fabian Groffen
2020-05-23  8:15       ` Michał Górny
2020-05-23 10:00     ` Kent Fredric
2020-05-22 22:13 ` Peter Stuge
2020-05-23  9:49   ` Kent Fredric
2020-05-24 13:05     ` Peter Stuge
2020-05-24 15:21       ` Kent Fredric

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox