public inbox for gentoo-dev@lists.gentoo.org
 help / color / mirror / Atom feed
* [gentoo-dev] sources.gentoo.org instability
@ 2011-12-05  7:10 Alec Warner
  2011-12-05 11:48 ` Andreas K. Huettel
  0 siblings, 1 reply; 5+ messages in thread
From: Alec Warner @ 2011-12-05  7:10 UTC (permalink / raw
  To: Gentoo Dev

Hello,

For a while sources.gentoo.org has been puttering along and its health
has slowly declined. We migrated it to some newer shiny hardware in an
attempt to mitigate the problem but that did not pan out. 90% (or
more) of sources.gentoo.org traffic is crawler bots and not actual
humans. That being said; if we cannot serve requests to the bots
within our timeouts we serve 500's instead which is never really what
we want (particularly when we spent 20s of CPU to calculate 80% of the
response only to see the client timeout :/.)

The majority of the expensive requests are related to package.mask and
use.local.desc queries by crawlers. Like crawling the entire 13000 rev
history for package.mask (or similar.)

While it is likely we will monkey patch viewvc to be less wasteful; in
the meantime I have removed use.local.desc from sources.gentoo.org
(and also anoncvs, because they share the same repo.) I hope this is a
short term (order of weeks) hack.

-A



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [gentoo-dev] sources.gentoo.org instability
  2011-12-05  7:10 [gentoo-dev] sources.gentoo.org instability Alec Warner
@ 2011-12-05 11:48 ` Andreas K. Huettel
  2011-12-05 16:27   ` Alec Warner
  0 siblings, 1 reply; 5+ messages in thread
From: Andreas K. Huettel @ 2011-12-05 11:48 UTC (permalink / raw
  To: gentoo-dev


Seriously, what do we gain from crawlers accessing sources.gentoo.org?  I cant 
really remember seeing it once in a google query result... 

Possibly it would not even be required to deny all requests, but just deny 
everything related to ancient history...

> Hello,
> 
> For a while sources.gentoo.org has been puttering along and its health
> has slowly declined. We migrated it to some newer shiny hardware in an
> attempt to mitigate the problem but that did not pan out. 90% (or
> more) of sources.gentoo.org traffic is crawler bots and not actual
> humans. That being said; if we cannot serve requests to the bots
> within our timeouts we serve 500's instead which is never really what
> we want (particularly when we spent 20s of CPU to calculate 80% of the
> response only to see the client timeout :/.)
> 
> The majority of the expensive requests are related to package.mask and
> use.local.desc queries by crawlers. Like crawling the entire 13000 rev
> history for package.mask (or similar.)
> 
> While it is likely we will monkey patch viewvc to be less wasteful; in
> the meantime I have removed use.local.desc from sources.gentoo.org
> (and also anoncvs, because they share the same repo.) I hope this is a
> short term (order of weeks) hack.
> 
> -A

-- 
Andreas K. Huettel
Gentoo Linux developer
kde, sci, arm, tex, printing




^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [gentoo-dev] sources.gentoo.org instability
  2011-12-05 11:48 ` Andreas K. Huettel
@ 2011-12-05 16:27   ` Alec Warner
  2011-12-05 22:20     ` Chí-Thanh Christopher Nguyễn
  0 siblings, 1 reply; 5+ messages in thread
From: Alec Warner @ 2011-12-05 16:27 UTC (permalink / raw
  To: gentoo-dev

On Mon, Dec 5, 2011 at 3:48 AM, Andreas K. Huettel <dilfridge@gentoo.org> wrote:
>
> Seriously, what do we gain from crawlers accessing sources.gentoo.org?  I cant
> really remember seeing it once in a google query result...

We want the site searchable.

>
> Possibly it would not even be required to deny all requests, but just deny
> everything related to ancient history...
>
>> Hello,
>>
>> For a while sources.gentoo.org has been puttering along and its health
>> has slowly declined. We migrated it to some newer shiny hardware in an
>> attempt to mitigate the problem but that did not pan out. 90% (or
>> more) of sources.gentoo.org traffic is crawler bots and not actual
>> humans. That being said; if we cannot serve requests to the bots
>> within our timeouts we serve 500's instead which is never really what
>> we want (particularly when we spent 20s of CPU to calculate 80% of the
>> response only to see the client timeout :/.)
>>
>> The majority of the expensive requests are related to package.mask and
>> use.local.desc queries by crawlers. Like crawling the entire 13000 rev
>> history for package.mask (or similar.)
>>
>> While it is likely we will monkey patch viewvc to be less wasteful; in
>> the meantime I have removed use.local.desc from sources.gentoo.org
>> (and also anoncvs, because they share the same repo.) I hope this is a
>> short term (order of weeks) hack.
>>
>> -A
>
> --
> Andreas K. Huettel
> Gentoo Linux developer
> kde, sci, arm, tex, printing
>
>



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [gentoo-dev] sources.gentoo.org instability
  2011-12-05 16:27   ` Alec Warner
@ 2011-12-05 22:20     ` Chí-Thanh Christopher Nguyễn
  2011-12-09  5:30       ` Alec Warner
  0 siblings, 1 reply; 5+ messages in thread
From: Chí-Thanh Christopher Nguyễn @ 2011-12-05 22:20 UTC (permalink / raw
  To: gentoo-dev

Alec Warner schrieb:
>> Seriously, what do we gain from crawlers accessing sources.gentoo.org?  I cant
>> really remember seeing it once in a google query result...
> 
> We want the site searchable.

>>> The majority of the expensive requests are related to package.mask and
>>> use.local.desc queries by crawlers. Like crawling the entire 13000 rev
>>> history for package.mask (or similar.)

Would it be feasible to use mod_rewrite to direct the most expensive
requests to a static copy, which is re-generated every
${REASONABLE_TIMEFRAME}?


Best regards,
Chí-Thanh Christopher Nguyễn



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [gentoo-dev] sources.gentoo.org instability
  2011-12-05 22:20     ` Chí-Thanh Christopher Nguyễn
@ 2011-12-09  5:30       ` Alec Warner
  0 siblings, 0 replies; 5+ messages in thread
From: Alec Warner @ 2011-12-09  5:30 UTC (permalink / raw
  To: gentoo-dev

2011/12/5 Chí-Thanh Christopher Nguyễn <chithanh@gentoo.org>:
> Alec Warner schrieb:
>>> Seriously, what do we gain from crawlers accessing sources.gentoo.org?  I cant
>>> really remember seeing it once in a google query result...
>>
>> We want the site searchable.
>
>>>> The majority of the expensive requests are related to package.mask and
>>>> use.local.desc queries by crawlers. Like crawling the entire 13000 rev
>>>> history for package.mask (or similar.)
>
> Would it be feasible to use mod_rewrite to direct the most expensive
> requests to a static copy, which is re-generated every
> ${REASONABLE_TIMEFRAME}?

For now user-agents that look like a bot get sent to
sources2.gentoo.org (via HTTP-302, not a perm redirect) and humans are
good on sources.gentoo.org. Assuming the crawlers and indexing systems
follow the spec; hopefully all our search resutls do not get rewritten
to sources2.gentoo.org (that would surprise me greatly...wait no it
wouldn't ;p)

Robin added a caching layer for some segments of the application; I am
looking at cprofile dumps and discussing pain points with upstream.

-A

>
>
> Best regards,
> Chí-Thanh Christopher Nguyễn
>



^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2011-12-09  5:31 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-12-05  7:10 [gentoo-dev] sources.gentoo.org instability Alec Warner
2011-12-05 11:48 ` Andreas K. Huettel
2011-12-05 16:27   ` Alec Warner
2011-12-05 22:20     ` Chí-Thanh Christopher Nguyễn
2011-12-09  5:30       ` Alec Warner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox