From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gentoo-dev+bounces-101368-garchives=archives.gentoo.org@lists.gentoo.org>
Received: from lists.gentoo.org (pigeon.gentoo.org [208.92.234.80])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
	(No client certificate requested)
	by finch.gentoo.org (Postfix) with ESMTPS id 014D7158041
	for <garchives@archives.gentoo.org>; Fri,  1 Mar 2024 07:06:26 +0000 (UTC)
Received: from pigeon.gentoo.org (localhost [127.0.0.1])
	by pigeon.gentoo.org (Postfix) with SMTP id 5E17F2BC014;
	Fri,  1 Mar 2024 07:06:19 +0000 (UTC)
Received: from smtp.gentoo.org (dev.gentoo.org [IPv6:2001:470:ea4a:1:5054:ff:fec7:86e4])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256)
	(No client certificate requested)
	by pigeon.gentoo.org (Postfix) with ESMTPS id E9F40E2A42
	for <gentoo-dev@lists.gentoo.org>; Fri,  1 Mar 2024 07:06:18 +0000 (UTC)
From: Sam James <sam@gentoo.org>
To: Matt Jolly <kangie@gentoo.org>
Cc: gentoo-dev@lists.gentoo.org
Subject: Re: [gentoo-dev] RFC: banning "AI"-backed (LLM/GPT/whatever)
 contributions to Gentoo
In-Reply-To: <cbb02a15-54b8-4804-b062-4731ed3d7f73@gentoo.org> (Matt Jolly's
	message of "Wed, 28 Feb 2024 21:06:51 +1000")
Organization: Gentoo
References: <a2b8c68b1649213cf237f40e41f9a460a5667c34.camel@gentoo.org>
	<Zd6i7GMVrNXwoicc@dj3ntoo>
	<076b8ede8f3e2d6d49571d516132a96a08b4d591.camel@gentoo.org>
	<u1q8wnbnq@gentoo.org>
	<cbb02a15-54b8-4804-b062-4731ed3d7f73@gentoo.org>
User-Agent: mu4e 1.12.0; emacs 30.0.50
Date: Fri, 01 Mar 2024 07:06:14 +0000
Message-ID: <875xy6sa6x.fsf@gentoo.org>
Precedence: bulk
List-Post: <mailto:gentoo-dev@lists.gentoo.org>
List-Help: <mailto:gentoo-dev+help@lists.gentoo.org>
List-Unsubscribe: <mailto:gentoo-dev+unsubscribe@lists.gentoo.org>
List-Subscribe: <mailto:gentoo-dev+subscribe@lists.gentoo.org>
List-Id: Gentoo Linux mail <gentoo-dev.gentoo.org>
X-BeenThere: gentoo-dev@lists.gentoo.org
Reply-to: gentoo-dev@lists.gentoo.org
X-Auto-Response-Suppress: DR, RN, NRN, OOF, AutoReply
MIME-Version: 1.0
Content-Type: multipart/signed; boundary="=-=-=";
	micalg=pgp-sha512; protocol="application/pgp-signature"
X-Archives-Salt: 00062726-9fb6-49a1-a4aa-1b7e30be0d7d
X-Archives-Hash: 29ced8df63aa18c1934189dcf70d84c5

--=-=-=
Content-Type: text/plain

Matt Jolly <kangie@gentoo.org> writes:

>> But where do we draw the line? Are translation tools like DeepL
>> allowed? I don't see much of a copyright issue for these.
>
> I'd also like to jump in and play devil's advocate. There's a fair
> chance that this is because I just got back from a
> supercomputing/research conf where LLMs were the hot topic in every keynote.
>
> As mentioned by Sam, this RFC is performative. Any users that are going
> to abuse LLMs are going to do it _anyway_, regardless of the rules. We
> already rely on common sense to filter these out; we're always going to
> have BS/Spam PRs and bugs - I don't really think that the content being
> generated by LLM is really any worse.
>
> This doesn't mean that I think we should blanket allow poor quality LLM
> contributions. It's especially important that we take into account the
> potential for bias, factual errors, and outright plagarism when these
> tools are used incorrectly.  We already have methods for weeding out low
> quality contributions and bad faith contributors - let's trust in these
> and see what we can do to strengthen these tools and processes.
>
> A bit closer to home for me, what about using a LLMs as an assistive
> technology / to reduce boilerplate? I'm recovering from RSI - I don't
> know when (if...) I'll be able to type like I used to again. If a model
> is able to infer some mostly salvagable boilerplate from its context
> window I'm going to use it and spend the effort I would writing that to
> fix something else; an outright ban on LLM use will reduce my _ability_
> to contribute to the project.

Another person approached me after this RFC and asked whether tooling
restricted to the current repo would be okay. For me, that'd be mostly
acceptable, given it won't make suggestions based on copyrighted code.

I also don't have a problem with LLMs being used to help refine commit
messages as long as someone is being sensible about it (e.g. if, as in
your situation, you know what you want to say but you can't type much).

I don't know how to phrase a policy off the top of my head which allows
those two things but not the rest.

>
> What about using a LLM for code documentation? Some models can do a
> passable job of writing decent quality function documentation and, in
> production, I _have_ caught real issues in my logic this way. Why should
> I type that out (and write what I think the code does rather than what
> it actually does) if an LLM can get 'close enough' and I only need to do
> light editing?

I suppose in that sense, it's the same as blindly listening to any
linting tool or warning without understanding what it's flagging and if
it's correct.

> [...]
> As a final not-so-hypothetical, what about a LLM trained on Gentoo docs
> and repos, or more likely trained on exclusively open-source
> contributions and fine-tuned on Gentoo specifics? I'm in the process of
> spinning up several models at work to get a handle on the tech / turn
> more electricity into heat - this is a real possibility (if I can ever
> find the time).

I think that'd be interesting. It also does a good job as a rhetorical
point wrt the policy being a bit too blanket here.

See https://www.softwareheritage.org/2023/10/19/swh-statement-on-llm-for-code/
too.

>
> The cat is out of the bag when it comes to LLMs. In my real-world job I
> talk to scientists and engineers using these things (for their
> strengths) to quickly iterate on designs, to summarise experimental
> results, and even to generate testable hypotheses. We're only going to
> see increasing use of this technology going forward.
>
> TL;DR: I think this is a bad idea. We already have effective mechanisms
> for dealing with spam and bad faith contributions. Banning LLM use by
> Gentoo contributors at this point is just throwing the baby out with the
> bathwater.

The problem is that in FOSS, a lot of people are getting flooded with AI
spam and therefore have little regard for any possibly-good parts of it.

I count myself as part of that group - it's very much sludge and I feel
tired just seeing it talked about at the moment.

Is that super rational? No, but we're also volunteers and it's not
unreasonable for said volunteers to then say "well I don't want any more
of that".

I think this colours a lot of the responses here, and it doesn't
invalidate them, but it also explains why nobody is really interested
in being open to this for now. Who can blame them (me included)?

>
> As an alternative I'd be very happy some guidelines for the use of LLMs
> and other assistive technologies like "Don't use LLM code snippets
> unless you understand them", "Don't blindly copy and paste LLM output",
> or, my personal favourite, "Don't be a jerk to our poor bug wranglers".
>
> A blanket "No completely AI/LLM generated works" might be fine, too.
>
> Let's see how the legal issues shake out before we start pre-emptively
> banning useful tools. There's a lot of ongoing action in this space - at
> the very least I'd like to see some thorough discussion of the legal
> issues separately if we're making a case for banning an entire class of
> technology.

I'm sympathetic to the arguments you've made here and I don't want to
act like this sinks your whole argument (it doesn't), but this is
typically not how legal issues are approached. People act conservatively
if there's risk to them, not the other way around ;)

> [...]

Thanks for making me think a bit more about it and considering some
use cases I hadn't really thought about.

I still don't really want ebuilds generated by LLMs, but I could live
with:
a) LLMs being used to refine commit messages;
b) LLMs being used if restricted to suggestions from a FOSS-licenced
codebase

> Matt
>

thanks,
sam

--=-=-=
Content-Type: application/pgp-signature; name="signature.asc"

-----BEGIN PGP SIGNATURE-----

iOUEARYKAI0WIQQlpruI3Zt2TGtVQcJzhAn1IN+RkAUCZeF+Zl8UgAAAAAAuAChp
c3N1ZXItZnByQG5vdGF0aW9ucy5vcGVucGdwLmZpZnRoaG9yc2VtYW4ubmV0MjVB
NkJCODhERDlCNzY0QzZCNTU0MUMyNzM4NDA5RjUyMERGOTE5MA8cc2FtQGdlbnRv
by5vcmcACgkQc4QJ9SDfkZBnLgD/R33dn9qe8zVFLg4qW0bYb1LM/MPBy/KmGFHV
lSu2bxUBAP9H65vywZEhy+/bIIxt6RyQr8t7syE4lqfrIcZ805sB
=0U6e
-----END PGP SIGNATURE-----
--=-=-=--