On Thu, May 21, 2020 at 12:10 PM Michał Górny <mgorny@gentoo.org> wrote:

> On Thu, 2020-05-21 at 11:48 +0200, Tomas Mozes wrote:
> > On Thu, May 21, 2020 at 10:47 AM Michał Górny <mgorny@gentoo.org> wrote:
> >
> > > Hi,
> > >
> > > TL;DR: I'm looking for opinions on how to protect goose from spam,
> > > i.e. mass fake submissions.
> > >
> > >
> > > Problem
> > > =======
> > > Goose currently lacks proper limiting of submitted data.  The only
> > > limiter currently in place is based on unique submitter id that is
> > > randomly generated at setup time and in full control of the submitter.
> > > This only protects against accidental duplicates but it can't protect
> > > against deliberate action.
> > >
> > > An attacker could easily submit thousands (millions?) of fake entries
> by
> > > issuing a lot of requests with different ids.  Creating them is
> > > as trivial as using successive numbers.  The potential damage includes:
> > >
> > > - distorting the metrics to the point of it being useless (even though
> > > some people consider it useless by design).
> > >
> > > - submitting lots of arbitrary data to cause DoS via growing
> > > the database until no disk space is left.
> > >
> > > - blocking large range of valid user ids, causing collisions with
> > > legitimate users more likely.
> > >
> > > I don't think it worthwhile to discuss the motivation for doing so:
> > > whether it would be someone wishing harm to Gentoo, disagreeing with
> > > the project or merely wanting to try and see if it would work.  The
> case
> > > of SKS keyservers teaches us a lesson that you can't leave holes like
> > > this open a long time because someone eventually will abuse them.
> > >
> > >
> > > Option 1: IP-based limiting
> > > ===========================
> > > The original idea was to set a hard limit of submissions per week based
> > > on IP address of the submitter.  This has (at least as far as IPv4 is
> > > concerned) the advantages that:
> > >
> > > - submitted has limited control of his IP address (i.e. he can't just
> > > submit stuff using arbitrary data)
> > >
> > > - IP address range is naturally limited
> > >
> > > - IP addresses have non-zero cost
> > >
> > > This method could strongly reduce the number of fake submissions one
> > > attacker could devise.  However, it has a few problems too:
> > >
> > > - a low limit would harm legitimate submitters sharing IP address
> > > (i.e. behind NAT)
> > >
> > > - it actively favors people with access to large number of IP addresses
> > >
> > > - it doesn't map cleanly to IPv6 (where some people may have just one
> IP
> > > address, and others may have whole /64 or /48 ranges)
> > >
> > > - it may cause problems for anonymizing network users (and we want to
> > > encourage Tor usage for privacy)
> > >
> > > All this considered, IP address limiting can't be used the primary
> > > method of preventing fake submissions.  However, I suppose it could
> work
> > > as an additional DoS prevention, limiting the number of submissions
> from
> > > a single address over short periods of time.
> > >
> > > Example: if we limit to 10 requests an hour, then a single IP can be
> > > used ot manufacture at most 240 submissions a day.  This might be
> > > sufficient to render them unusable but should keep the database
> > > reasonably safe.
> > >
> > >
> > > Option 2: proof-of-work
> > > =======================
> > > An alternative of using a proof-of-work algorithm was suggested to me
> > > yesterday.  The idea is that every submission has to be accompanied
> with
> > > the result of some cumbersome calculation that can't be trivially run
> > > in parallel or optimized out to dedicated hardware.
> > >
> > > On the plus side, it would rely more on actual physical hardware than
> IP
> > > addresses provided by ISPs.  While it would be a waste of CPU time
> > > and memory, doing it just once a week wouldn't be that much harm.
> > >
> > > On the minus side, it would penalize people with weak hardware.
> > >
> > > For example, 'time hashcash -m -b 28 -r test' gives:
> > >
> > > - 34 s (-s estimated 38 s) on Ryzen 5 3600
> > >
> > > - 3 minutes (estimated 92 s) on some old 32-bit Celeron M
> > >
> > > At the same time, it would still permit a lot of fake submissions.  For
> > > example, randomx [1] claims to require 2G of memory in fast mode.  This
> > > would still allow me to use 7 threads.  If we adjusted the algorithm to
> > > take ~30 seconds, that means 7 submissions every 30 s, i.e. 20k
> > > submissions a day.
> > >
> > > So in the end, while this is interesting, it doesn't seem like
> > > a workable anti-spam measure.
> > >
> > >
> > > Option 3: explicit CAPTCHA
> > > ==========================
> > > A traditional way of dealing with spam -- require every new system
> > > identifier to be confirmed by solving a CAPTCHA (or a few identifiers
> > > for one CAPTCHA).
> > >
> > > The advantage of this method is that it requires a real human work to
> be
> > > performed, effectively limiting the ability to submit spam.
> > > The disadvantage is that it is cumbersome to users, so many of them
> will
> > > just resign from participating.
> > >
> > >
> > > Other ideas
> > > ===========
> > > Do you have any other ideas on how we could resolve this?
> > >
> > >
> > > [1] https://github.com/tevador/RandomX
> > >
> > >
> > > --
> > > Best regards,
> > > Michał Górny
> > >
> >
> >
> > Sadly, the problem with IP addresses is (in this case), that there are
> > anonymous. One can easily start an attack with thousands of IPs (all
> around
> > the world).
> >
> > One solution would be to introduce user accounts:
> > - one needs to register with an email
>
> Problem 1: you can trivially mass-create email addresses.
>
>
IP verification:
- get enough IPs (botnet) and send your payload

User verification:
- get an email, verify account, solve a captcha

I know if someone wants to, he'll try to bypass the user verification, but
it's just more work to do. We can also enforce IP restrictions and use a
combination of both.

> - you can rate limit based on the client (not the IP)
> >
> > For example I've 200 servers, I'd create one account, verify my email
> > (maybe captcha too) and deploy a config with my token on all servers.
> Then
> > I'd setup a cron job on every server to submit stats. A token can have
> some
> > lifetime and you could create a new one when the old is about to expire.
> >
> > If you discover I'm doing false reports, you'd block all my submissions.
> I
> > can still do fake submissions, but you'd need a per-host verification to
> > avoid that.
> >
>
> Problem 2: we can't really discover this because the goal is to protect
> users' privacy.  The best we can do is to discover that someone is
> submitting a lot from a single account (but are them legitimate?).
> But then, we can just block them.
>
> But in the end, this has the same problem as CAPTCHA -- or maybe it's
> even worse.  It requires additional effort from the users, effectively
> making it less likely for them to participate.  Furthermore, it requires
> them to submit e-mail addresses which they may consider PII.  Even if we
> don't store them permanently but just use for initial verification, they
> still could choose not to participate.
>

I think if someone wants to participate and believes the cause he will.
Many of the users are on bugzila anyway, so the email is on the Gentoo side
anyway. Contributors have their emails in each Gentoo commit.

If spamming is a serious problem you can turn it into an invite-only system
creating a chain of trust.


> --
> Best regards,
> Michał Górny
>
>