public inbox for gentoo-user@lists.gentoo.org
 help / color / mirror / Atom feed
* [gentoo-user] Emails are no indexable
@ 2024-07-08 15:07 Vitaly Zdanevich
  2024-07-08 15:10 ` Vitaly Zdanevich
  2024-07-08 17:41 ` Michael
  0 siblings, 2 replies; 5+ messages in thread
From: Vitaly Zdanevich @ 2024-07-08 15:07 UTC (permalink / raw
  To: gentoo-user

[-- Attachment #1: Type: text/plain, Size: 345 bytes --]

Hi, I tried to google in "exact match" a few sentences from this email 
list - and nothing found. For example this mirroring 
https://marc.info/?l=gentoo-user&m=171984189706185&w=2 - and nothing in 
Google. Is it excluded from search? This is bad, because people google 
problems that are already solved in these emails :(

Is it a known issue?

[-- Attachment #2: Type: text/html, Size: 744 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [gentoo-user] Emails are no indexable
  2024-07-08 15:07 [gentoo-user] Emails are no indexable Vitaly Zdanevich
@ 2024-07-08 15:10 ` Vitaly Zdanevich
  2024-07-08 17:41 ` Michael
  1 sibling, 0 replies; 5+ messages in thread
From: Vitaly Zdanevich @ 2024-07-08 15:10 UTC (permalink / raw
  To: gentoo-user

[-- Attachment #1: Type: text/plain, Size: 473 bytes --]

And I tried to ask my question at Forum and got "Error in posting" :(

On 7/8/24 19:07, Vitaly Zdanevich wrote:
>
> Hi, I tried to google in "exact match" a few sentences from this email 
> list - and nothing found. For example this mirroring 
> https://marc.info/?l=gentoo-user&m=171984189706185&w=2 - and nothing 
> in Google. Is it excluded from search? This is bad, because people 
> google problems that are already solved in these emails :(
>
> Is it a known issue?
>

[-- Attachment #2.1: Type: text/html, Size: 1284 bytes --]

[-- Attachment #2.2: EZ1c4pJEEDy6eRz0.png --]
[-- Type: image/png, Size: 115410 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [gentoo-user] Emails are no indexable
  2024-07-08 15:07 [gentoo-user] Emails are no indexable Vitaly Zdanevich
  2024-07-08 15:10 ` Vitaly Zdanevich
@ 2024-07-08 17:41 ` Michael
  2024-07-09 10:03   ` Vitaly Zdanevich
  1 sibling, 1 reply; 5+ messages in thread
From: Michael @ 2024-07-08 17:41 UTC (permalink / raw
  To: gentoo-user

[-- Attachment #1: Type: text/plain, Size: 1085 bytes --]

On Monday, 8 July 2024 16:07:59 BST Vitaly Zdanevich wrote:
> Hi, I tried to google in "exact match" a few sentences from this email
> list - and nothing found. For example this mirroring
> https://marc.info/?l=gentoo-user&m=171984189706185&w=2 - and nothing in
> Google. Is it excluded from search? This is bad, because people google
> problems that are already solved in these emails :(
> 
> Is it a known issue?

It depends what Google or other web crawlers have decided to list in their 
search results and what to exclude.  You should be able to run a search within 
the content of a single website, but only if it has been ranked/listed by 
Google, e.g. say you want to find posts about blurred fonts, in a gentoo M/L, 
but not Debian, contained in marc.info.  You can search Google like so:

"blurred fonts" +gentoo -debian site:marc.info

This should include Gentoo M/L posts about blurred fonts found in the 
marc.info domain, but exclude any Debian related search results.  
Unfortunately, this relies on Google first ranking them in their results to 
allow you to see them.

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [gentoo-user] Emails are no indexable
  2024-07-08 17:41 ` Michael
@ 2024-07-09 10:03   ` Vitaly Zdanevich
  2024-07-23 23:52     ` Hank Leininger
  0 siblings, 1 reply; 5+ messages in thread
From: Vitaly Zdanevich @ 2024-07-09 10:03 UTC (permalink / raw
  To: gentoo-user, Michael

[-- Attachment #1: Type: text/plain, Size: 1219 bytes --]

In https://marc.info/robots.txt I see

User-agent: *
Disallow: /

It looks bad.

On 7/8/24 21:41, Michael wrote:
> On Monday, 8 July 2024 16:07:59 BST Vitaly Zdanevich wrote:
>> Hi, I tried to google in "exact match" a few sentences from this email
>> list - and nothing found. For example this mirroring
>> https://marc.info/?l=gentoo-user&m=171984189706185&w=2  - and nothing in
>> Google. Is it excluded from search? This is bad, because people google
>> problems that are already solved in these emails :(
>>
>> Is it a known issue?
> It depends what Google or other web crawlers have decided to list in their
> search results and what to exclude.  You should be able to run a search within
> the content of a single website, but only if it has been ranked/listed by
> Google, e.g. say you want to find posts about blurred fonts, in a gentoo M/L,
> but not Debian, contained in marc.info.  You can search Google like so:
>
> "blurred fonts" +gentoo -debian site:marc.info
>
> This should include Gentoo M/L posts about blurred fonts found in the
> marc.info domain, but exclude any Debian related search results.
> Unfortunately, this relies on Google first ranking them in their results to
> allow you to see them.

[-- Attachment #2: Type: text/html, Size: 1977 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [gentoo-user] Emails are no indexable
  2024-07-09 10:03   ` Vitaly Zdanevich
@ 2024-07-23 23:52     ` Hank Leininger
  0 siblings, 0 replies; 5+ messages in thread
From: Hank Leininger @ 2024-07-23 23:52 UTC (permalink / raw
  To: gentoo-user

[-- Attachment #1: Type: text/plain, Size: 4012 bytes --]

[ Originally sent on 2024-07-09 but it never made it to the list,
  probably because I am not subscribed. ]

On 2024-07-09, Vitaly Zdanevich wrote:
> In https://marc.info/robots.txt I see
> 
> User-agent: *
> Disallow: /
> 
> It looks bad.

You had to scroll down quite a bit to get there.  The very top of the
file is:

User-agent: Googlebot
Allow: /
Disallow: /?*s=*
Disallow: /?*a=*

[ Followed by similar stanzas for some specifically enumerated bots. ]

Meaning: Google can index everything on MARC, except for searches
(because of both load and transient nature of the results, even though
we do those as GETs so the _browser_ feels free to cache them and for
link color goodness, etc.) and lists of messages by author (because we
want some kind of throttling on MARC's value for OSINT and spam
harvesting).

The problem is, Google _won't_ index everything. It thinks the number of
unique pages on MARC is unreasonable (over 100 million messages, to say
nothing of links to individual MIME attachments, list-by-date views,
messages-in-thread, etc.). Google has only crawled a small percentage of
that, and only indexed a portion of the pages it has crawled. There's no
explanation of why, and you're actively discouraged from resubmitting
"crawled but not indexed" pages (not practical for millions of URLs
anyway).

I used to generate sitemap XML files and feed them to googlebot so that
it would be encouraged to come and get it. But it would/could never keep
up with the volume of new data (~300k messages/month?), meanwhile the
existing data it did have would get evicted from indexes with no
explanation. It probably wouldn't hurt to try uploading fresh ones
(other than the time it would cost me) but I don't have any confidence
it would help, either.

_Maybe_ it would help convince Google to keep Gentoo content in MARC
indexed if each list we archive was individually linked in their entries
at https://www.gentoo.org/get-involved/mailing-lists/all-lists.html ,
but I have no actual evidence or indication that that's the case (nor am
I indirectly asking for such a change to be made, because again I don't
know that it would do any good).

Also:

>> On Monday, 8 July 2024 16:07:59 BST Vitaly Zdanevich wrote:
>>> list - and nothing found. For example this mirroring
>>> https://marc.info/?l=gentoo-user&m=171984189706185&w=2  - and
>>> nothing in Google. Is it excluded from search? This is bad, because
>>> people google problems that are already solved in these emails :(

Any given page might, in fact, be excluded by Google on purpose, and I'm
not supposed to be able to find out if it is[1].

Google seems to be quick to act on GDPR requests and the like, which is
nice overall. They do so by excluding certain contested search results
when the search comes from a covered country. So if someone in the EU
comments in a public email thread and later decides they want their name
to disappear, they can cause their message and any that quote them to be
suppressed - when searches originate from EU (simplified, I am not a
lawyer, etc.).

Google used to report which URLs were being removed from searches, but
determined that that was itself an information leak they could not
abide, so for years now when they send a webmaster a "Notice of European
data protection law removal from Google Search" it says "to comply with
developments in European law, which seek to prevent the identification
of the requester, we are no longer disclosing the affected URLs".

I see the rationales and don't object to them, but the result still kind
of sucks.

[1] Of course it should be possible to, say, use VPNs to evaluate the
    results of searches coming from different sources, but I'm not gonna.

[ No comment on the other message in the thread by Michael/confabulate@
  other than, yes, 100% all of that. ]

Thanks,

-- 

Hank Leininger <hlein@marc.info>
CDFC 40DD 6B1D E176 8E84  A243 8FC6 9C04 40FD 2D11

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2024-07-25 22:20 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-08 15:07 [gentoo-user] Emails are no indexable Vitaly Zdanevich
2024-07-08 15:10 ` Vitaly Zdanevich
2024-07-08 17:41 ` Michael
2024-07-09 10:03   ` Vitaly Zdanevich
2024-07-23 23:52     ` Hank Leininger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox