From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from lists.gentoo.org (pigeon.gentoo.org [208.92.234.80]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by finch.gentoo.org (Postfix) with ESMTPS id 987D0158041 for ; Tue, 5 Mar 2024 06:12:13 +0000 (UTC) Received: from pigeon.gentoo.org (localhost [127.0.0.1]) by pigeon.gentoo.org (Postfix) with SMTP id 34198E2A26; Tue, 5 Mar 2024 06:12:08 +0000 (UTC) Received: from smtp.gentoo.org (woodpecker.gentoo.org [IPv6:2001:470:ea4a:1:5054:ff:fec7:86e4]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by pigeon.gentoo.org (Postfix) with ESMTPS id C5C1EE2A20 for ; Tue, 5 Mar 2024 06:12:07 +0000 (UTC) Received: from grubbs.orbis-terrarum.net (localhost [127.0.0.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by smtp.gentoo.org (Postfix) with ESMTPS id 25C95343023 for ; Tue, 5 Mar 2024 06:12:07 +0000 (UTC) Received: from grubbs.orbis-terrarum.net (localhost [127.0.0.1]) by grubbs.orbis-terrarum.net (Postfix) with ESMTP id 8C883260322 for ; Tue, 5 Mar 2024 06:12:06 +0000 (UTC) Received: (qmail 22833 invoked by uid 10000); 5 Mar 2024 06:12:06 -0000 Date: Tue, 5 Mar 2024 06:12:06 +0000 From: "Robin H. Johnson" To: gentoo-dev@lists.gentoo.org Subject: Re: [gentoo-dev] RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo Message-ID: References: Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Id: Gentoo Linux mail X-BeenThere: gentoo-dev@lists.gentoo.org Reply-to: gentoo-dev@lists.gentoo.org X-Auto-Response-Suppress: DR, RN, NRN, OOF, AutoReply MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="UgFmpmCloTAOhcYB" Content-Disposition: inline In-Reply-To: X-Archives-Salt: e5c10b49-5aff-4839-9547-6a342c83037a X-Archives-Hash: f5baf38e2bd20269649024eab912cea5 --UgFmpmCloTAOhcYB Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable (Full disclosure: I presently work for a non-FAANG cloud company with a primary business focus in providing GPU access, for AI & other workloads; I don't feel that is a conflict of interest, but understand that others might not feel the same way). Yes, we need to formally address the concerns. However, I don't come to the same conclusion about an outright ban. I think we need to: 1. Short-term, clearly point out why much of the present outputs would violate existing policies. Esp. the low-grade garbage output. 2. Short & medium-term: a time-limited policy saying "no AI-backend works temporarily, while waiting for legal precedent", which clear guidelines about what is being the blocking deal. 3. Longer-term, produce a policy that shows how AI generation can be used for good, in a safe way**. 4. Keep the human in the loop; no garbage reinforcing garbage. Further points inline. On Tue, Feb 27, 2024 at 03:45:17PM +0100, Micha=C5=82 G=C3=B3rny wrote: > Hello, >=20 > Given the recent spread of the "AI" bubble, I think we really need to > look into formally addressing the related concerns. In my opinion, > at this point the only reasonable course of action would be to safely > ban "AI"-backed contribution entirely. In other words, explicitly > forbid people from using ChatGPT, Bard, GitHub Copilot, and so on, to > create ebuilds, code, documentation, messages, bug reports and so on for > use in Gentoo. Are there footholds where you see AI tooling would be acceptable to you today? AI-summarization of inputs, if correct & free of hallucinations, is likely to be of immediate value. I see this coming up in terms of analyzing code backtraces as well as better license analysis tooling. The best tools here include citations that should be verified as to why the system thinks the outcome is correct: buyer-beware if you don't verify the citations. > Just to be clear, I'm talking about our "original" content. We can't do > much about upstream projects using it. >=20 > Rationale: >=20 > 1. Copyright concerns. At this point, the copyright situation around > generated content is still unclear. What's pretty clear is that pretty > much all LLMs are trained on huge corpora of copyrighted material, and > all fancy "AI" companies don't give shit about copyright violations. > In particular, there's a good risk that these tools would yield stuff we > can't legally use. The Gentoo Foundation (and SPI) are both US legal entities. That means at least abiding by US copyright law... As of writing this, the present US Copyright office says AI-generated works are NOT eligible for their *own* copyright registration. The outputs are either un-copyrightable or if they are sufficiently similarly to existing works, that original copyright stands (with license and authorship markings required). That's going to be a problem if the EU, UK & other major WIPO members come to a different conclusion, but for now, as a US-based organization, Gentoo has the rules it must follow. The fact that it *might* be uncopyrightable, and NOT tagged as such gives me equal concern to the missing attribution & license statements. Enough untagged uncopyrightable material present MAY invalidate larger copyrights. Clearer definitions about the distinction between public domain vs uncopyrightable are also required in our Gentoo documentation (at a high le= vel ineligible vs not copyrighted vs expired vs laws/acts-of-government vs works-of-government, but there is nuance). >=20 > 2. Quality concerns. LLMs are really great at generating plausibly > looking bullshit. I suppose they can provide good assistance if you are > careful enough, but we can't really rely on all our contributors being > aware of the risks. 100% agree; The quality of output is the largest concern *right now*.=20 The consistency of output is strongly related: given similar inputs (including best practices not changing over time), it should give similar outputs. How good must the output be to negate this concern? Current-state-of-the-art can probably write ebuilds with fewer QA violations than most contributors, esp. given automated QA checking tools for a positive reinforcement loop. Besides the actual output being low-quality, the larger problem is that users submitting it don't realize that it's low-quality (or in a few cases don't care). Gentoo's existing policies may only need tweaks & re-iteration here. - GLEP76 does not set out clear guidelines for uncopyrightable works. - GLEP76 should have a clarification that asserting GCO/DCO over AI-generated works at this time is not acceptable. > 3. Ethical concerns. As pointed out above, the "AI" corporations don't > give shit about copyright, and don't give shit about people. The AI > bubble is causing huge energy waste. It is giving a great excuse for > layoffs and increasing exploitation of IT workers. It is driving > enshittification of the Internet, it is empowering all kinds of spam > and scam. Is an ethical AI entity possible? Your argument here is really an extension of a much older maxim: "There's no ethical consumption under capitalism". This can encompass most tech corporations, AI or not. It's just much more readily exposed with AI than other "big tech" movements, because AI and the name of AI is being used do immoral & unethical things far more frequently that before. An truly ethical AI entity should also not be the outcome of rent-seeking behaviors (maybe profit-seeking, but that returns to the perils of capitalism). The energy waste argument is also one that needs to be made carefully: The training & fine-tuning phases today are energy wastes, only compared to the lifetime energy usage of a human to learn the same things. When that gets m= ore efficient, the human may be the energy waste ;-) [1]. The generation/inference phases may be able to generate correct output MUCH more efficiently than a human. If I think of how many times I run "ebuild ... test" and "pkgcheck scan" some packaging, trying to get it correct: the AI will be able to do a better job than most developers in reasonable course of time... Gentoo's purpose as an organization, is not to be arbiters of ethics: we can stand against unethical actions. Where is that middle ground? At the top, I noted that it will be possible in future for AI generation to be used in a good, safe way, and we should provide some signals to the researchers behind the AI industry on this matter. What should it have? - The output has correct license & copyright attributions for portions that= are copyrightable. - The output explicitly disclaims copyright for uncopyrightable portions (yes, this is a higher bar than we set for humans today). - The output is provably correct (QA checks, actually running tests etc) - The output is free of non-functional/nonsense garbage. - The output is free of hallucinations (aka don't invent dependencies that = don't exist). Can you please contribute other requirements that you feel "good" AI output= should have? [1] Citation needed; Best estimate I have says: https://www.eia.gov/tools/faqs/faq.php?id=3D85&t=3D1 76 MMBtu/person/year https://www.wolframalpha.com/input?i=3D+76+MMBtu+to+MWh =3D> 22.27 MWh/pers= on/year vs Facebook claims entire model development energy consumption on all 4 sizes = of LLaMA was 2,638 MWh https://kaspergroesludvigsen.medium.com/facebook-disclose-the-carbon-footpr= int-of-their-new-llama-models-9629a3c5c28b 2638 / 22.27 =3D> 118.45 people=20 So Development energy was the same as 118 average people doing average thin= gs for a year. (not CompSci students compiling their code many times). The outcome here: don't use AI where a human would be much more efficient, unless you have strong reasons why it would be better to use the AI than a human. We haven't crossed that threshold YET, but the day is coming, esp. w= ith amortized costs that training is a rare event compared to inference. =20 --=20 Robin Hugh Johnson Gentoo Linux: Dev, Infra Lead, Foundation President & Treasurer E-Mail : robbat2@gentoo.org GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85 GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136 --UgFmpmCloTAOhcYB Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 Comment: Robbat2 @ Orbis-Terrarum Networks - The text below is a digital signature. If it doesn't make any sense to you, ignore it. iQKTBAABCgB9FiEEveu2pS8Vb98xaNkRGTlfI8WIJsQFAmXmt7RfFIAAAAAALgAo aXNzdWVyLWZwckBub3RhdGlvbnMub3BlbnBncC5maWZ0aGhvcnNlbWFuLm5ldEJE RUJCNkE1MkYxNTZGREYzMTY4RDkxMTE5Mzk1RjIzQzU4ODI2QzQACgkQGTlfI8WI JsRLNQ/+IURDfiHOCl5EmNpU73ivp/YwORZRh3N7RxiYCWVKzwSxBvqybhVQ6xrl 4f0pTbdxZgYA8kgvejWRQJIJhqKSHU9bq0adm3syKS9XxPk2g/vGbrttM4ASzFi+ VS7RzMJvycU1l+LH4ybeMLaCTrItVWqSJk5EbmCtnlPLJ1ZuD16p2SMpH4SuH4m3 zAsZJ55FaeO2K2hRNyhBUG/c9IpKe3/m+9i82LeGyaKzwZ7vLMWFzukumpNaVz99 7WiEyCV5fS7RYDzmiCIeqq+1B5ilVjp2dYlqa3GbytEsKJQAVWNMz5UaMRB7it0E oAYWlkBQCwDgE+x+a/Pf/+OM92wU013gvswaur5bIRG5JZp/GQDMrKd3ycfQtID4 cDvu+E3fN0tkIQHFzhZTRPMVIEDpeIX5JQNioEnIHkMQ8TUEuX58KnLVmca59rD+ LKInAIO54uRROPmEQmc01zd20a7nhsHvxz7gN9Rac0IzA2XWJOnknAP34u55rNdb G7xvc63w9sNn6LysxaXocRm7B95wFzj1DIkCob7PFdsg1hs6TdDgmu2m+VZlBXwE mSN3qf6Cm3dRzcpF4fHbG3u8p7XEZbysKmZHjF2eI8wzzjbEmeSFgC4NosHG1pBP XphqXC3gAExOrQtaBDedYbQpYDIxnajWWGcYLvsBqoBtwmcxS6g= =0EbU -----END PGP SIGNATURE----- --UgFmpmCloTAOhcYB--