[gentoo-user] [OT] Differences between wget and browser file retrieval?

public inbox for gentoo-user@lists.gentoo.org
 help / color / mirror / Atom feed

* [gentoo-user] [OT] Differences between wget and browser file retrieval?
@ 2021-01-14 20:49 Walter Dnes
  2021-01-14 21:10 ` Jack
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Walter Dnes @ 2021-01-14 20:49 UTC (permalink / raw
  To: Gentoo Users List

  I'm bored, so I do a regular daily report at the DSL Reports "CanChat"
sub-forum, on the Covid-19 case counts for Ontario, using provincial
data.  I download 2 files daily as source data.  One of them is a PDF
file, which is run through "pdftotext" and then parsed by a bash script
(don't ask).  Today, the command...

  wget https://files.ontario.ca/moh-covid-19-report-en-2021-01-14.pdf

...returns a zero-byte file.  *BUT*, sticking the URL into the URL bar
of Pale Moon and Google Chrome (and I assume Firefox/etc) brings up the
PDF file just fine.  Is "wget" being blocked?  I have to do extra steps
to get from the browser-invoked PDF to get the PDF file saved to the
standard work area where my script expects it to be, so it can work its
magic and parse out the daily breakdown by PHU (Public Health Unit).
BTW, today's posts requiring the PDF file are...
https://www.dslreports.com/forum/r33002718-
https://www.dslreports.com/forum/r33002752-

  I've tried setting --user-agent= with my browser's string as shown by
https://www.whatismybrowser.com/detect/what-is-my-user-agent  but no
luck.  Is there some way to get around this?  I have not updated this
past week, so I don't think the problem is at my end.

-- 
Walter Dnes <waltdnes@waltdnes.org>
I don't run "desktop environments"; I run useful applications

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [gentoo-user] [OT] Differences between wget and browser file retrieval?
  2021-01-14 20:49 [gentoo-user] [OT] Differences between wget and browser file retrieval? Walter Dnes
@ 2021-01-14 21:10 ` Jack
  2021-01-14 21:36   ` Andreas Fink
  2021-01-14 22:00 ` David Haller
  2021-01-15 16:28 ` [gentoo-user] Re: [OT SOLVED] " Walter Dnes
  2 siblings, 1 reply; 8+ messages in thread
From: Jack @ 2021-01-14 21:10 UTC (permalink / raw
  To: gentoo-user

On 2021.01.14 15:49, Walter Dnes wrote:
>   I'm bored, so I do a regular daily report at the DSL Reports  
> "CanChat"
> sub-forum, on the Covid-19 case counts for Ontario, using provincial
> data.  I download 2 files daily as source data.  One of them is a PDF
> file, which is run through "pdftotext" and then parsed by a bash  
> script
> (don't ask).  Today, the command...
> 
>   wget https://files.ontario.ca/moh-covid-19-report-en-2021-01-14.pdf
> 
> ...returns a zero-byte file.  *BUT*, sticking the URL into the URL bar
> of Pale Moon and Google Chrome (and I assume Firefox/etc) brings up  
> the
> PDF file just fine.  Is "wget" being blocked?  I have to do extra  
> steps
> to get from the browser-invoked PDF to get the PDF file saved to the
> standard work area where my script expects it to be, so it can work  
> its
> magic and parse out the daily breakdown by PHU (Public Health Unit).
> BTW, today's posts requiring the PDF file are...
> https://www.dslreports.com/forum/r33002718-
> https://www.dslreports.com/forum/r33002752-
> 
>   I've tried setting --user-agent= with my browser's string as shown  
> by
> https://www.whatismybrowser.com/detect/what-is-my-user-agent  but no
> luck.  Is there some way to get around this?  I have not updated this
> past week, so I don't think the problem is at my end.

I just copy/pasted that wget command into my terminal, and it got me a  
1.7M PDF doc.  I'm in the US, but I have no idea if location/IP is an  
issue or not.

Jack


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [gentoo-user] [OT] Differences between wget and browser file retrieval?
  2021-01-14 21:10 ` Jack
@ 2021-01-14 21:36   ` Andreas Fink
  0 siblings, 0 replies; 8+ messages in thread
From: Andreas Fink @ 2021-01-14 21:36 UTC (permalink / raw
  To: gentoo-user

On Thu, 14 Jan 2021 16:10:09 -0500
Jack <ostroffjh@users.sourceforge.net> wrote:

> On 2021.01.14 15:49, Walter Dnes wrote:
> >   I'm bored, so I do a regular daily report at the DSL Reports
> > "CanChat"
> > sub-forum, on the Covid-19 case counts for Ontario, using provincial
> > data.  I download 2 files daily as source data.  One of them is a PDF
> > file, which is run through "pdftotext" and then parsed by a bash
> > script
> > (don't ask).  Today, the command...
> >
> >   wget https://files.ontario.ca/moh-covid-19-report-en-2021-01-14.pdf
> >
> > ...returns a zero-byte file.  *BUT*, sticking the URL into the URL bar
> > of Pale Moon and Google Chrome (and I assume Firefox/etc) brings up
> > the
> > PDF file just fine.  Is "wget" being blocked?  I have to do extra
> > steps
> > to get from the browser-invoked PDF to get the PDF file saved to the
> > standard work area where my script expects it to be, so it can work
> > its
> > magic and parse out the daily breakdown by PHU (Public Health Unit).
> > BTW, today's posts requiring the PDF file are...
> > https://www.dslreports.com/forum/r33002718-
> > https://www.dslreports.com/forum/r33002752-
> >
> >   I've tried setting --user-agent= with my browser's string as shown
> > by
> > https://www.whatismybrowser.com/detect/what-is-my-user-agent  but no
> > luck.  Is there some way to get around this?  I have not updated this
> > past week, so I don't think the problem is at my end.
>
> I just copy/pasted that wget command into my terminal, and it got me a
> 1.7M PDF doc.  I'm in the US, but I have no idea if location/IP is an
> issue or not.
>
> Jack
>

I could download the file too with the wget command that you posted. If
you still have trouble, you could try using curl and pretend that
you're a firefox:
curl 'https://files.ontario.ca/moh-covid-19-report-en-2021-01-14.pdf' -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:84.0) Gecko/20100101 Firefox/84.0' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8' -H 'Accept-Language: en,de;q=0.7,en-US;q=0.3' --compressed -H 'DNT: 1' -H 'Connection: keep-alive' -H 'Upgrade-Insecure-Requests: 1' -H 'Pragma: no-cache' -H 'Cache-Control: no-cache' > moh-covid-19-report-en-2021-01-14.pdf

Andreas


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [gentoo-user] [OT] Differences between wget and browser file retrieval?
  2021-01-14 20:49 [gentoo-user] [OT] Differences between wget and browser file retrieval? Walter Dnes
  2021-01-14 21:10 ` Jack
@ 2021-01-14 22:00 ` David Haller
  2021-01-15  7:40   ` Philip Webb
  2021-01-15  8:24   ` Walter Dnes
  2021-01-15 16:28 ` [gentoo-user] Re: [OT SOLVED] " Walter Dnes
  2 siblings, 2 replies; 8+ messages in thread
From: David Haller @ 2021-01-14 22:00 UTC (permalink / raw
  To: Gentoo Users List

Hello,

On Thu, 14 Jan 2021, Walter Dnes wrote:
>  I'm bored, so I do a regular daily report at the DSL Reports "CanChat"
>sub-forum, on the Covid-19 case counts for Ontario, using provincial
>data.  I download 2 files daily as source data.  One of them is a PDF
>file, which is run through "pdftotext" and then parsed by a bash script
>(don't ask).  Today, the command...
>
>  wget https://files.ontario.ca/moh-covid-19-report-en-2021-01-14.pdf
>
>...returns a zero-byte file.  *BUT*, sticking the URL into the URL bar
>of Pale Moon and Google Chrome (and I assume Firefox/etc) brings up the
>PDF file just fine.  Is "wget" being blocked?
[..]
>  I've tried setting --user-agent= with my browser's string as shown by
>https://www.whatismybrowser.com/detect/what-is-my-user-agent  but no
>luck.  Is there some way to get around this?  I have not updated this
>past week, so I don't think the problem is at my end.

I could download that file just fine just now[1]. Try running 'wget'
with the '-S' option. Oh and:

[..]
WARNING: cannot verify files.ontario.ca's certificate, issued by
[..]

If you sent stderr to /dev/null ...

So, try:

    wget -S --no-check-certificate -U 'Mozilla/5.0 ...' \
        https://files.ontario.ca/moh-covid-19-report-en-2021-01-14.pdf

BTW: you know that you can let date format that URL? e.g.:

    wget -S --no-check-certificate -U 'Mozilla/5.0 ...' \
      "$(date '+https://files.ontario.ca/moh-covid-19-report-en-%Y-%m-%d.pdf')"

There just are no unescaped '%' allowed besides the format strings for
the date/time. So if an URL contains one, you need to escape those
with another '%', as in e.g.
    $(date '+foo%%20bar-%Y-%m-%d.pdf')
                ^^ this fella

In your case, the URL is clean ;)

HTH,
-dnh

[1] $ TZ=America/Toronto date
    Thu Jan 14 16:50:15 EST 2021

-- 
"Airplane travel is nature's way of making you look like your passport
photo."                                                     -- Al Gore


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [gentoo-user] [OT] Differences between wget and browser file retrieval?
  2021-01-14 22:00 ` David Haller
@ 2021-01-15  7:40   ` Philip Webb
  2021-01-15 15:09     ` Walter Dnes
  2021-01-15  8:24   ` Walter Dnes
  1 sibling, 1 reply; 8+ messages in thread
From: Philip Webb @ 2021-01-15  7:40 UTC (permalink / raw
  To: gentoo-user

210114 David Haller wrote:
> On Thu, 14 Jan 2021, Walter Dnes wrote:
>> I download daily a PDF.  Today, the command ...
>>  wget https://files.ontario.ca/moh-covid-19-report-en-2021-01-14.pdf
>> returns a zero-byte file.  *BUT*, sticking the URL into the URL bar
> >of Pale Moon and Google Chrome brings up the PDF file just fine.
>> Is "wget" being blocked ?
> I could download that file just fine just now[1].
> Try running 'wget' with the '-S' option.
> Oh and :
>> WARNING: cannot verify files.ontario.ca's certificate, issued by
> So, try:
>   wget -S --no-check-certificate -U 'Mozilla/5.0 ...' \
>    https://files.ontario.ca/moh-covid-19-report-en-2021-01-14.pdf
> BTW: you know that you can let date format that URL? e.g.:
>   wget -S --no-check-certificate -U 'Mozilla/5.0 ...' \
>    "$(date '+https://files.ontario.ca/moh-covid-19-report-en-%Y-%m-%d.pdf')"
 
Here in Toronto, I get the same result as Walter via his URL
& similar results from the  2  longer versions above,
except that the escaped version give "ERROR 403: Forbidden".

When I drop Walter's URL into the address bar of Firefox, no problem :
a  1,75 MB  PDF which appears to have all the info.

It looks as if the site is refusing 'wget' requests from Ontario,
but allowing them from eg Germany (!).

What Walter is doing is well worthwhile.  Press reports are very shallow
& the Ontario government doesn't appear to have any clear idea
just where & how the virus is being spread between humans.  HTH.

-- 
========================,,============================================
SUPPORT     ___________//___,   Philip Webb
ELECTRIC   /] [] [] [] [] []|   Cities Centre, University of Toronto
TRANSIT    `-O----------O---'   purslowatcadotinterdotnet



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [gentoo-user] [OT] Differences between wget and browser file retrieval?
  2021-01-14 22:00 ` David Haller
  2021-01-15  7:40   ` Philip Webb
@ 2021-01-15  8:24   ` Walter Dnes
  1 sibling, 0 replies; 8+ messages in thread
From: Walter Dnes @ 2021-01-15  8:24 UTC (permalink / raw
  To: gentoo-user

On Thu, Jan 14, 2021 at 11:00:38PM +0100, David Haller wrote

> So, try:
> 
>     wget -S --no-check-certificate -U 'Mozilla/5.0 ...' \
>         https://files.ontario.ca/moh-covid-19-report-en-2021-01-14.pdf

  No luck.  For DNS, I use my ISP's servers (Teksavvy) with fallback to
Google 8.8.8.8.

########################################################################
[i3][waltdnes][/dev/shm]  wget -S --no-check-certificate -U 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:83.0) Gecko/20100101 Firefox/83.0' https://files.ontario.ca/moh-covid-19-report-en-2021-01-14.pdf
--2021-01-15 02:15:30--  https://files.ontario.ca/moh-covid-19-report-en-2021-01-14.pdf
Resolving files.ontario.ca... 13.33.160.117, 13.33.160.123, 13.33.160.45, ...
Connecting to files.ontario.ca|13.33.160.117|:443... connected.
HTTP request sent, awaiting response... 
  HTTP/1.1 200 OK
  Content-Type: application/pdf
  Content-Length: 0
  Connection: keep-alive
  Date: Thu, 14 Jan 2021 15:15:50 GMT
  Last-Modified: Thu, 14 Jan 2021 15:15:50 GMT
  ETag: "d41d8cd98f00b204e9800998ecf8427e"
  x-amz-meta-ctime: 1610637349
  x-amz-meta-mode: 33188
  x-amz-meta-gid: 500
  x-amz-meta-uid: 500
  x-amz-meta-mtime: 1610637349
  Accept-Ranges: bytes
  Server: AmazonS3
  X-Cache: Hit from cloudfront
  Via: 1.1 47dbad48e25df8c5ccf2822e46c2aaa6.cloudfront.net (CloudFront)
  X-Amz-Cf-Pop: YTO50-C3
  X-Amz-Cf-Id: ARgHfF6QMVfUtkxqkr0AL5ljxIfE7Yd5xPmA4eDMx46NdPXOwIftnQ==
  Age: 57573
Length: 0 [application/pdf]
Saving to: 'moh-covid-19-report-en-2021-01-14.pdf'

moh-covid-19-report     [ <=>                ]       0  --.-KB/s    in 0s      

2021-01-15 02:15:30 (0.00 B/s) - 'moh-covid-19-report-en-2021-01-14.pdf' saved [0/0]
########################################################################


> BTW: you know that you can let date format that URL? e.g.:
> 
>     wget -S --no-check-certificate -U 'Mozilla/5.0 ...' \
>       "$(date '+https://files.ontario.ca/moh-covid-19-report-en-%Y-%m-%d.pdf')"

  Nice, but civil servants get stat holidays off.  I downloaded Dec 25th
and 26th PDFs on the 26th.  Monday Dec 28th was a lieu day for Boxing
day, so I downloaded the 28th and 29th PDFs on the 29th.  And of course
Jan 1st and 2nd PDFs on Jan 2nd.  That's why I can't automate the date.
I have a script "getone"...

[i3][waltdnes][~/covid] cat getone 
#!/bin/bash
wget https://files.ontario.ca/moh-covid-19-report-en-2021-01-${1}.pdf

  On the 14th it was invoked as "../getone 14" (called from the working
directory, one level below the main "covid" directory).  I tweak the
script once a month to match year+month.  In a worst-case scenario. I
can go to
https://covid-19.ontario.ca/covid-19-epidemiologic-summaries-public-health-ontario#daily
to manually retrieve a daily PDF.  Note that on this page, they list
the date that the report is up to.  The report issued 10:15 AM on the
14th shows up in the listing as "COVID-19 in Ontario: January 13, 2021".
That's because it contains data up to the 13th.

-- 
Walter Dnes <waltdnes@waltdnes.org>
I don't run "desktop environments"; I run useful applications


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [gentoo-user] [OT] Differences between wget and browser file retrieval?
  2021-01-15  7:40   ` Philip Webb
@ 2021-01-15 15:09     ` Walter Dnes
  0 siblings, 0 replies; 8+ messages in thread
From: Walter Dnes @ 2021-01-15 15:09 UTC (permalink / raw
  To: gentoo-user

On Fri, Jan 15, 2021 at 02:40:51AM -0500, Philip Webb wrote
>  
> Here in Toronto, I get the same result as Walter via his URL
> & similar results from the  2  longer versions above,
> except that the escaped version give "ERROR 403: Forbidden".

  I get "ERROR 403: Forbidden" when downloading a non-existant file,
e.g. when I make a typo, or when the government site is late updating
and they haven't posted the file by the time I request it.

-- 
Walter Dnes <waltdnes@waltdnes.org>
I don't run "desktop environments"; I run useful applications


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [gentoo-user] Re: [OT SOLVED] Differences between wget and browser file retrieval?
  2021-01-14 20:49 [gentoo-user] [OT] Differences between wget and browser file retrieval? Walter Dnes
  2021-01-14 21:10 ` Jack
  2021-01-14 22:00 ` David Haller
@ 2021-01-15 16:28 ` Walter Dnes
  2 siblings, 0 replies; 8+ messages in thread
From: Walter Dnes @ 2021-01-15 16:28 UTC (permalink / raw
  To: gentoo-user

  It looks like a temporary server hiccup yesterday. wget correctly
pulled down the PDF file for the 15th today.  I checked and it also
pulled down the file for the 14th.

-- 
Walter Dnes <waltdnes@waltdnes.org>
I don't run "desktop environments"; I run useful applications


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-01-15 16:28 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-01-14 20:49 [gentoo-user] [OT] Differences between wget and browser file retrieval? Walter Dnes
2021-01-14 21:10 ` Jack
2021-01-14 21:36   ` Andreas Fink
2021-01-14 22:00 ` David Haller
2021-01-15  7:40   ` Philip Webb
2021-01-15 15:09     ` Walter Dnes
2021-01-15  8:24   ` Walter Dnes
2021-01-15 16:28 ` [gentoo-user] Re: [OT SOLVED] " Walter Dnes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox