[gentoo-dev] [RFC] Anti-spam for goose

public inbox for gentoo-dev@lists.gentoo.org
 help / color / mirror / Atom feed

From: "Michał Górny" <mgorny@gentoo.org>
To: gentoo-dev <gentoo-dev@lists.gentoo.org>
Subject: [gentoo-dev] [RFC] Anti-spam for goose
Date: Thu, 21 May 2020 10:47:07 +0200	[thread overview]
Message-ID: <496f9d713dc1d890d8af717c77429faac20912e1.camel@gentoo.org> (raw)

[-- Attachment #1: Type: text/plain, Size: 4579 bytes --]

Hi,

TL;DR: I'm looking for opinions on how to protect goose from spam,
i.e. mass fake submissions.

Problem
=======
Goose currently lacks proper limiting of submitted data.  The only
limiter currently in place is based on unique submitter id that is
randomly generated at setup time and in full control of the submitter. 
This only protects against accidental duplicates but it can't protect
against deliberate action.

An attacker could easily submit thousands (millions?) of fake entries by
issuing a lot of requests with different ids.  Creating them is
as trivial as using successive numbers.  The potential damage includes:

- distorting the metrics to the point of it being useless (even though
some people consider it useless by design).

- submitting lots of arbitrary data to cause DoS via growing
the database until no disk space is left.

- blocking large range of valid user ids, causing collisions with
legitimate users more likely.

I don't think it worthwhile to discuss the motivation for doing so:
whether it would be someone wishing harm to Gentoo, disagreeing with
the project or merely wanting to try and see if it would work.  The case
of SKS keyservers teaches us a lesson that you can't leave holes like
this open a long time because someone eventually will abuse them.

Option 1: IP-based limiting
===========================
The original idea was to set a hard limit of submissions per week based
on IP address of the submitter.  This has (at least as far as IPv4 is
concerned) the advantages that:

- submitted has limited control of his IP address (i.e. he can't just
submit stuff using arbitrary data)

- IP address range is naturally limited

- IP addresses have non-zero cost

This method could strongly reduce the number of fake submissions one
attacker could devise.  However, it has a few problems too:

- a low limit would harm legitimate submitters sharing IP address
(i.e. behind NAT)

- it actively favors people with access to large number of IP addresses

- it doesn't map cleanly to IPv6 (where some people may have just one IP
address, and others may have whole /64 or /48 ranges)

- it may cause problems for anonymizing network users (and we want to
encourage Tor usage for privacy)

All this considered, IP address limiting can't be used the primary
method of preventing fake submissions.  However, I suppose it could work
as an additional DoS prevention, limiting the number of submissions from
a single address over short periods of time.

Example: if we limit to 10 requests an hour, then a single IP can be
used ot manufacture at most 240 submissions a day.  This might be
sufficient to render them unusable but should keep the database
reasonably safe.

Option 2: proof-of-work
=======================
An alternative of using a proof-of-work algorithm was suggested to me
yesterday.  The idea is that every submission has to be accompanied with
the result of some cumbersome calculation that can't be trivially run
in parallel or optimized out to dedicated hardware.

On the plus side, it would rely more on actual physical hardware than IP
addresses provided by ISPs.  While it would be a waste of CPU time
and memory, doing it just once a week wouldn't be that much harm.

On the minus side, it would penalize people with weak hardware.

For example, 'time hashcash -m -b 28 -r test' gives:

- 34 s (-s estimated 38 s) on Ryzen 5 3600

- 3 minutes (estimated 92 s) on some old 32-bit Celeron M

At the same time, it would still permit a lot of fake submissions.  For
example, randomx [1] claims to require 2G of memory in fast mode.  This
would still allow me to use 7 threads.  If we adjusted the algorithm to
take ~30 seconds, that means 7 submissions every 30 s, i.e. 20k
submissions a day.

So in the end, while this is interesting, it doesn't seem like
a workable anti-spam measure.

Option 3: explicit CAPTCHA
==========================
A traditional way of dealing with spam -- require every new system
identifier to be confirmed by solving a CAPTCHA (or a few identifiers
for one CAPTCHA).

The advantage of this method is that it requires a real human work to be
performed, effectively limiting the ability to submit spam.
The disadvantage is that it is cumbersome to users, so many of them will
just resign from participating.

Other ideas
===========
Do you have any other ideas on how we could resolve this?

[1] https://github.com/tevador/RandomX

-- 
Best regards,
Michał Górny

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 618 bytes --]

next             reply	other threads:[~2020-05-21  8:47 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-21  8:47 Michał Górny [this message]
2020-05-21  9:17 ` [gentoo-dev] [RFC] Anti-spam for goose Toralf Förster
2020-05-21  9:43   ` Michał Górny
2020-05-21 20:07     ` Toralf Förster
2020-05-22  4:39       ` Michał Górny
2020-05-21  9:48 ` Tomas Mozes
2020-05-21 10:10   ` Michał Górny
2020-05-21 10:37     ` Tomas Mozes
2020-05-21 10:45   ` Jaco Kroon
2020-05-21 11:02     ` Michał Górny
2020-05-21 14:27       ` Jaco Kroon
2020-05-21 20:13         ` Viktar Patotski
2020-05-22  0:38           ` Alec Warner
2020-05-22  4:42           ` Michał Górny
2020-05-22  6:03             ` Michał Górny
2020-05-22  6:17           ` waebbl
2020-05-22 13:39             ` Gordon Pettey
2020-05-22 15:19               ` waebbl
2020-05-21 11:03 ` Fabian Groffen
2020-05-21 11:33 ` Robert Bridge
2020-05-21 11:56   ` Michał Górny
2020-05-21 11:57   ` Ulrich Mueller
2020-05-21 12:08     ` Michał Górny
2020-05-21 12:15       ` Robert Bridge
2020-05-21 12:25         ` Ulrich Mueller
2020-05-21 13:09           ` Kent Fredric
2020-05-21 13:16             ` Michał Górny
2020-05-21 13:41               ` Kent Fredric
2020-05-21 13:53             ` Michał Górny
2020-05-21 13:22 ` Gordon Pettey
2020-05-21 13:38 ` Kent Fredric
2020-05-21 13:49   ` Kent Fredric
2020-05-22 19:20 ` Kent Fredric
2020-05-22 19:53   ` Brian Dolbec
2020-05-22 20:01     ` John Helmert III
2020-05-23  9:40     ` Kent Fredric
2020-05-22 19:58   ` Michał Górny
2020-05-23  7:54     ` Fabian Groffen
2020-05-23  8:15       ` Michał Górny
2020-05-23 10:00     ` Kent Fredric
2020-05-22 22:13 ` Peter Stuge
2020-05-23  9:49   ` Kent Fredric
2020-05-24 13:05     ` Peter Stuge
2020-05-24 15:21       ` Kent Fredric

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=496f9d713dc1d890d8af717c77429faac20912e1.camel@gentoo.org \
    --to=mgorny@gentoo.org \
    --cc=gentoo-dev@lists.gentoo.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox