From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from lists.gentoo.org (pigeon.gentoo.org [208.92.234.80]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by finch.gentoo.org (Postfix) with ESMTPS id E94FF159C96 for ; Thu, 25 Jul 2024 22:20:46 +0000 (UTC) Received: from pigeon.gentoo.org (localhost [127.0.0.1]) by pigeon.gentoo.org (Postfix) with SMTP id 1B1F6E2A3A; Thu, 25 Jul 2024 22:20:40 +0000 (UTC) Received: from mail.marc.info (marc1.marc.info [205.134.191.172]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by pigeon.gentoo.org (Postfix) with ESMTPS id A72F0E2A33 for ; Thu, 25 Jul 2024 22:20:39 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=marc.info; s=mail; t=1721945989; bh=mrwr2ivixB/T5FS2oYWxmVjuV4qYm2Id+t+GceCaJEc=; h=Resent-From:Resent-Date:Resent-To:Date:From:To:Subject:Reply-To: In-Reply-To; b=kM7Jvs1shlunWtxKrE6jex9ATK+YsJYwLVnELzLxmD+EgSYNV40dxXoFtuD9fzrde AM7UOgcnrMUZiNagwGoZ9FT37K4Q1gkP8bkUOmPLCTTicpLKtLs6GW5SI4JyqWaxPJ CkkcIg6Rc+JwJJAh2tIvTnHhvC/xG65JhWzoXySU= X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at mail.marc.info Resent-From: hlein@marc.info Resent-Date: Thu, 25 Jul 2024 16:20:36 -0600 Resent-Message-ID: Resent-To: gentoo-user@lists.gentoo.org Date: Tue, 23 Jul 2024 17:52:32 -0600 From: Hank Leininger To: gentoo-user@lists.gentoo.org Subject: Re: [gentoo-user] Emails are no indexable Message-ID: Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Id: Gentoo Linux mail X-BeenThere: gentoo-user@lists.gentoo.org Reply-to: gentoo-user@lists.gentoo.org X-Auto-Response-Suppress: DR, RN, NRN, OOF, AutoReply MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="vyHOQj6bkK264DCr" Content-Disposition: inline In-Reply-To: <72c521ff-f60a-4bef-971d-9bf70ea8df29@ya.ru> X-Archives-Salt: 11f18fef-8fdf-4792-b6d8-344582e3f0b0 X-Archives-Hash: 89cf82a299aef6d98794ac2b5f959aa6 --vyHOQj6bkK264DCr Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable [ Originally sent on 2024-07-09 but it never made it to the list, probably because I am not subscribed. ] On 2024-07-09, Vitaly Zdanevich wrote: > In https://marc.info/robots.txt I see >=20 > User-agent: * > Disallow: / >=20 > It looks bad. You had to scroll down quite a bit to get there. The very top of the file is: User-agent: Googlebot Allow: / Disallow: /?*s=3D* Disallow: /?*a=3D* [ Followed by similar stanzas for some specifically enumerated bots. ] Meaning: Google can index everything on MARC, except for searches (because of both load and transient nature of the results, even though we do those as GETs so the _browser_ feels free to cache them and for link color goodness, etc.) and lists of messages by author (because we want some kind of throttling on MARC's value for OSINT and spam harvesting). The problem is, Google _won't_ index everything. It thinks the number of unique pages on MARC is unreasonable (over 100 million messages, to say nothing of links to individual MIME attachments, list-by-date views, messages-in-thread, etc.). Google has only crawled a small percentage of that, and only indexed a portion of the pages it has crawled. There's no explanation of why, and you're actively discouraged from resubmitting "crawled but not indexed" pages (not practical for millions of URLs anyway). I used to generate sitemap XML files and feed them to googlebot so that it would be encouraged to come and get it. But it would/could never keep up with the volume of new data (~300k messages/month?), meanwhile the existing data it did have would get evicted from indexes with no explanation. It probably wouldn't hurt to try uploading fresh ones (other than the time it would cost me) but I don't have any confidence it would help, either. _Maybe_ it would help convince Google to keep Gentoo content in MARC indexed if each list we archive was individually linked in their entries at https://www.gentoo.org/get-involved/mailing-lists/all-lists.html , but I have no actual evidence or indication that that's the case (nor am I indirectly asking for such a change to be made, because again I don't know that it would do any good). Also: >> On Monday, 8 July 2024 16:07:59 BST Vitaly Zdanevich wrote: >>> list - and nothing found. For example this mirroring >>> https://marc.info/?l=3Dgentoo-user&m=3D171984189706185&w=3D2 - and >>> nothing in Google. Is it excluded from search? This is bad, because >>> people google problems that are already solved in these emails :( Any given page might, in fact, be excluded by Google on purpose, and I'm not supposed to be able to find out if it is[1]. Google seems to be quick to act on GDPR requests and the like, which is nice overall. They do so by excluding certain contested search results when the search comes from a covered country. So if someone in the EU comments in a public email thread and later decides they want their name to disappear, they can cause their message and any that quote them to be suppressed - when searches originate from EU (simplified, I am not a lawyer, etc.). Google used to report which URLs were being removed from searches, but determined that that was itself an information leak they could not abide, so for years now when they send a webmaster a "Notice of European data protection law removal from Google Search" it says "to comply with developments in European law, which seek to prevent the identification of the requester, we are no longer disclosing the affected URLs". I see the rationales and don't object to them, but the result still kind of sucks. [1] Of course it should be possible to, say, use VPNs to evaluate the results of searches coming from different sources, but I'm not gonna. [ No comment on the other message in the thread by Michael/confabulate@ other than, yes, 100% all of that. ] Thanks, --=20 Hank Leininger CDFC 40DD 6B1D E176 8E84 A243 8FC6 9C04 40FD 2D11 --vyHOQj6bkK264DCr Content-Type: application/pgp-signature; name="signature.asc" Content-Description: Digital signature -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEEzfxA3Wsd4XaOhKJDj8acBED9LREFAmagQjQACgkQj8acBED9 LRG3mRAAgGSnNaJ+PpWyu5wq5Ny3JJDpnB561RgkQk6aq+ONAF7qmYdAEWtkYNF8 VFm6ejyaYwEO7ciq/6FahpimqJ+iEz/iROkg4c2cb6mv2vNFTiTFIK1QoydZI14z twx767vQnAHrZWn0s9t1Ht8DOg4yHj9/1qoOhzmnshA3fiJXP62D4C3kQDRHGua1 ttMZRXxZKdgNT/xCKMnf7gHLt1lQx+Yw90JQuKNXhD/TuVLI2nMGCxGCdTIdZ7Qs tOmH65AVgUf+jvlthYQ7GUs5x8Vi04EmpClTfKbsACGo1kS0OJBe8mVgWAJoC9pg YPjtbdVxoN1atFlUD00zIqNk3YGPYqNkFfJ19M6gmht5RefuG3UWl24yPGO+igZf 2n9KAJ5Qr+laCfEZVYtt29kE+K5dnR7bH5RKn24KHq9tWUyNTe/K7sXGBRhSVxz1 tEMi54YCTxIAe5erZz+om5Yp+ijCqKsasTd/liigqhAacj+7bqh7ZAfouK/gkWKh 6zQbKNFVnmJSOzj+abAXxDAaNKvj2CLrDkSuBzKYm4KbkyUoYRoYivISxSmdzXgt EbuJqvxKFW3nCSy0MmqK8XOFUJbo/vhlSWoGtFwqnXt0zSfQDye4tJFFwO9euDvE VbnX55EVL1gexIribbGcI9X77JzLB3qiPg5E7zvfF8d6/ab6Aww= =ims0 -----END PGP SIGNATURE----- --vyHOQj6bkK264DCr--