public inbox for gentoo-user@lists.gentoo.org
 help / color / mirror / Atom feed
* [gentoo-user] creating local copies of web pages
@ 2005-12-02  1:41 Robert Persson
  2005-12-02  5:25 ` Shawn Singh
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Robert Persson @ 2005-12-02  1:41 UTC (permalink / raw
  To: gentoo-user

I have been trying all afternoon to make local copies of web pages from a 
netscape bookmark file. I have been wrestling with httrack (through 
khttrack), pavuk and wget, but none of them work. httrack and pavuk seem to 
claim they can do the job, but they can't, or at least not in any way an 
ordinary mortal could be expected to work out. They do things like pretending 
to download hundreds of files without actually saving them to disk, crashing 
suddenly and frequently, and popping up messages saying that I haven't 
contributed enough code to their project to expect the thing to work 
properly. I don't want to do anything hideously complicated. I just want to 
make local copies of some bookmarked pages. What tools should I be using?

I would be happy to use a windows tool in wine if it worked. I would be happy 
to reboot into Windows if I could get this job done.

One option would be to feed wget a list of urls. The trouble is I don't know 
how to turn an html bookmark file into a simple list of urls. I imagine I 
could do it in sed if I spent enough time to learn sed, but my afternoon has 
gone now and I don't have the time.

Many thanks
Robert
-- 
Robert Persson

"Don't use nuclear weapons to troubleshoot faults."
(US Air Force Instruction 91-111, 1 Oct 1997)

-- 
gentoo-user@gentoo.org mailing list



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [gentoo-user] creating local copies of web pages
  2005-12-02  1:41 [gentoo-user] creating local copies of web pages Robert Persson
@ 2005-12-02  5:25 ` Shawn Singh
  2005-12-02  9:37   ` Martins Steinbergs
  2005-12-02  9:05 ` Neil Bothwick
  2005-12-02 15:42 ` Billy Holmes
  2 siblings, 1 reply; 17+ messages in thread
From: Shawn Singh @ 2005-12-02  5:25 UTC (permalink / raw
  To: gentoo-user

[-- Attachment #1: Type: text/plain, Size: 1676 bytes --]

I guess I'm not exactly sure what you're trying to do, but when I want to
get a local copy of a website I do this:

nohup wget -m http://www.someUrL.org &

Shawn

On 12/2/05, Robert Persson <ireneshusband@yahoo.co.uk> wrote:
>
> I have been trying all afternoon to make local copies of web pages from a
> netscape bookmark file. I have been wrestling with httrack (through
> khttrack), pavuk and wget, but none of them work. httrack and pavuk seem
> to
> claim they can do the job, but they can't, or at least not in any way an
> ordinary mortal could be expected to work out. They do things like
> pretending
> to download hundreds of files without actually saving them to disk,
> crashing
> suddenly and frequently, and popping up messages saying that I haven't
> contributed enough code to their project to expect the thing to work
> properly. I don't want to do anything hideously complicated. I just want
> to
> make local copies of some bookmarked pages. What tools should I be using?
>
> I would be happy to use a windows tool in wine if it worked. I would be
> happy
> to reboot into Windows if I could get this job done.
>
> One option would be to feed wget a list of urls. The trouble is I don't
> know
> how to turn an html bookmark file into a simple list of urls. I imagine I
> could do it in sed if I spent enough time to learn sed, but my afternoon
> has
> gone now and I don't have the time.
>
> Many thanks
> Robert
> --
> Robert Persson
>
> "Don't use nuclear weapons to troubleshoot faults."
> (US Air Force Instruction 91-111, 1 Oct 1997)
>
> --
> gentoo-user@gentoo.org mailing list
>
>


--
Shawn Singh

[-- Attachment #2: Type: text/html, Size: 2079 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [gentoo-user] creating local copies of web pages
  2005-12-02  1:41 [gentoo-user] creating local copies of web pages Robert Persson
  2005-12-02  5:25 ` Shawn Singh
@ 2005-12-02  9:05 ` Neil Bothwick
  2005-12-02 13:39   ` Robert Persson
  2005-12-02 15:42 ` Billy Holmes
  2 siblings, 1 reply; 17+ messages in thread
From: Neil Bothwick @ 2005-12-02  9:05 UTC (permalink / raw
  To: gentoo-user

[-- Attachment #1: Type: text/plain, Size: 527 bytes --]

On Thu, 1 Dec 2005 17:41:36 -0800, Robert Persson wrote:

> One option would be to feed wget a list of urls. The trouble is I don't
> know how to turn an html bookmark file into a simple list of urls. I
> imagine I could do it in sed if I spent enough time to learn sed, but
> my afternoon has gone now and I don't have the time.

wget will accept most files containing URLs, it doesn't have to be a
straight list. Try feeding it your bookmark file as is.


-- 
Neil Bothwick

Excuse for the day: daemons did it

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [gentoo-user] creating local copies of web pages
  2005-12-02  5:25 ` Shawn Singh
@ 2005-12-02  9:37   ` Martins Steinbergs
  2005-12-02 13:42     ` Robert Persson
  0 siblings, 1 reply; 17+ messages in thread
From: Martins Steinbergs @ 2005-12-02  9:37 UTC (permalink / raw
  To: gentoo-user

[-- Attachment #1: Type: text/plain, Size: 2339 bytes --]

On Friday 02 December 2005 07:25, Shawn Singh wrote:
> I guess I'm not exactly sure what you're trying to do, but when I want to
> get a local copy of a website I do this:
>
> nohup wget -m http://www.someUrL.org &
>
> Shawn
>
> On 12/2/05, Robert Persson <ireneshusband@yahoo.co.uk> wrote:
> > I have been trying all afternoon to make local copies of web pages from a
> > netscape bookmark file. I have been wrestling with httrack (through
> > khttrack), pavuk and wget, but none of them work. httrack and pavuk seem
> > to
> > claim they can do the job, but they can't, or at least not in any way an
> > ordinary mortal could be expected to work out. They do things like
> > pretending
> > to download hundreds of files without actually saving them to disk,
> > crashing
> > suddenly and frequently, and popping up messages saying that I haven't
> > contributed enough code to their project to expect the thing to work
> > properly. I don't want to do anything hideously complicated. I just want
> > to
> > make local copies of some bookmarked pages. What tools should I be using?
> >
> > I would be happy to use a windows tool in wine if it worked. I would be
> > happy
> > to reboot into Windows if I could get this job done.
> >
> > One option would be to feed wget a list of urls. The trouble is I don't
> > know
> > how to turn an html bookmark file into a simple list of urls. I imagine I
> > could do it in sed if I spent enough time to learn sed, but my afternoon
> > has
> > gone now and I don't have the time.
> >
> > Many thanks
> > Robert
> > --
> > Robert Persson
> >
> > "Don't use nuclear weapons to troubleshoot faults."
> > (US Air Force Instruction 91-111, 1 Oct 1997)
> >
> > --
> > gentoo-user@gentoo.org mailing list
>
> --
> Shawn Singh

i use httrack linux and windows versions, generally without problems, 
sometimes fails parse dinamic content websites but man httrack has plenty 
options described. in previous work (windows only) i run daily task with 
httrack to get fresh rar files with database updates.
if there realy no files and dirs created in ~/websites folder, try to check 
write permissions or is there any space left.


-- 
Linux 2.6.15-rc2 AMD Athlon(tm) 64 Processor 3200+
 11:18:28 up 45 min,  7 users,  load average: 0.00, 0.00, 0.00

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [gentoo-user] creating local copies of web pages
  2005-12-02  9:05 ` Neil Bothwick
@ 2005-12-02 13:39   ` Robert Persson
  0 siblings, 0 replies; 17+ messages in thread
From: Robert Persson @ 2005-12-02 13:39 UTC (permalink / raw
  To: gentoo-user

On December 2, 2005 01:05 am Neil Bothwick was like:
> wget will accept most files containing URLs, it doesn't have to be a
> straight list. Try feeding it your bookmark file as is.

Tried that. It borked.  :-(
-- 
Robert Persson

"Don't use nuclear weapons to troubleshoot faults."
(US Air Force Instruction 91-111, 1 Oct 1997)

-- 
gentoo-user@gentoo.org mailing list



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [gentoo-user] creating local copies of web pages
  2005-12-02  9:37   ` Martins Steinbergs
@ 2005-12-02 13:42     ` Robert Persson
  2005-12-02 14:40       ` Martins Steinbergs
  0 siblings, 1 reply; 17+ messages in thread
From: Robert Persson @ 2005-12-02 13:42 UTC (permalink / raw
  To: gentoo-user

On December 2, 2005 01:37 am Martins Steinbergs was like:
> if there realy no files and dirs created in ~/websites folder, try to check
> write permissions or is there any space left.

Permissions are fine and there is quite a bit of space on the disk. httrack 
creates directories  in ~/websites, but no other files, despite the fact that 
it claims to be downloading bucketloads of them.
-- 
Robert Persson

"Don't use nuclear weapons to troubleshoot faults."
(US Air Force Instruction 91-111, 1 Oct 1997)

-- 
gentoo-user@gentoo.org mailing list



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [gentoo-user] creating local copies of web pages
  2005-12-02 13:42     ` Robert Persson
@ 2005-12-02 14:40       ` Martins Steinbergs
  2005-12-03  7:04         ` Robert Persson
  0 siblings, 1 reply; 17+ messages in thread
From: Martins Steinbergs @ 2005-12-02 14:40 UTC (permalink / raw
  To: gentoo-user

[-- Attachment #1: Type: text/plain, Size: 609 bytes --]

On Friday 02 December 2005 15:42, Robert Persson wrote:
>
> Permissions are fine and there is quite a bit of space on the disk. httrack
> creates directories  in ~/websites, but no other files, despite the fact
> that it claims to be downloading bucketloads of them.
> --
> Robert Persson
>
> "Don't use nuclear weapons to troubleshoot faults."
> (US Air Force Instruction 91-111, 1 Oct 1997)

if httrack is runing as root all stuff goes to /root/websites/ , explored 
there?
-- 
Linux 2.6.15-rc2 AMD Athlon(tm) 64 Processor 3200+
 16:38:31 up  6:05,  2 users,  load average: 0.08, 0.49, 0.82

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [gentoo-user] creating local copies of web pages
  2005-12-02  1:41 [gentoo-user] creating local copies of web pages Robert Persson
  2005-12-02  5:25 ` Shawn Singh
  2005-12-02  9:05 ` Neil Bothwick
@ 2005-12-02 15:42 ` Billy Holmes
  2005-12-03  6:56   ` Robert Persson
  2 siblings, 1 reply; 17+ messages in thread
From: Billy Holmes @ 2005-12-02 15:42 UTC (permalink / raw
  To: gentoo-user

Robert Persson wrote:
> I have been trying all afternoon to make local copies of web pages from a 
> netscape bookmark file. I have been wrestling with httrack (through 

wget -r http://$site/

have you tried that, yet?
-- 
gentoo-user@gentoo.org mailing list



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [gentoo-user] creating local copies of web pages
  2005-12-02 15:42 ` Billy Holmes
@ 2005-12-03  6:56   ` Robert Persson
  2005-12-03 15:40     ` Matthew Cline
                       ` (3 more replies)
  0 siblings, 4 replies; 17+ messages in thread
From: Robert Persson @ 2005-12-03  6:56 UTC (permalink / raw
  To: gentoo-user

On December 2, 2005 07:42 am Billy Holmes was like:
> Robert Persson wrote:
> > I have been trying all afternoon to make local copies of web pages from a
> > netscape bookmark file. I have been wrestling with httrack (through
>
> wget -r http://$site/
>
> have you tried that, yet?

The trouble is that I have a bookmark file with several hundred entries. wget 
is supposed to be fairly good at extracting urls from text files, but it 
couldn't handle this particular file.

Robert

-- 
Robert Persson

"Don't use nuclear weapons to troubleshoot faults."
(US Air Force Instruction 91-111, 1 Oct 1997)

-- 
gentoo-user@gentoo.org mailing list



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [gentoo-user] creating local copies of web pages
  2005-12-02 14:40       ` Martins Steinbergs
@ 2005-12-03  7:04         ` Robert Persson
  2005-12-03 13:40           ` Martins Steinbergs
  0 siblings, 1 reply; 17+ messages in thread
From: Robert Persson @ 2005-12-03  7:04 UTC (permalink / raw
  To: gentoo-user

On December 2, 2005 06:40 am Martins Steinbergs was like:
> if httrack is runing as root all stuff goes to /root/websites/ , explored
> there?

I wasn't running it as root. The strange thing is that httrack did start 
creating a directory structure in ~/websites consisting of a couple of dozen 
directories or so (e.g. 
~/websites/politics/www.fromthewilderness.com/free/ww3/), but it didn't 
actually store any html or other site content, despite the fact that it was 
taking a very long time to do this and was claiming to have downloaded 
hundreds of files.
-- 
Robert Persson

"Don't use nuclear weapons to troubleshoot faults."
(US Air Force Instruction 91-111, 1 Oct 1997)

-- 
gentoo-user@gentoo.org mailing list



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [gentoo-user] creating local copies of web pages
  2005-12-03  7:04         ` Robert Persson
@ 2005-12-03 13:40           ` Martins Steinbergs
  2005-12-03 18:09             ` Robert Persson
  0 siblings, 1 reply; 17+ messages in thread
From: Martins Steinbergs @ 2005-12-03 13:40 UTC (permalink / raw
  To: gentoo-user

[-- Attachment #1: Type: text/plain, Size: 1015 bytes --]

On Saturday 03 December 2005 09:04, Robert Persson wrote:
> I wasn't running it as root. The strange thing is that httrack did start
> creating a directory structure in ~/websites consisting of a couple of
> dozen directories or so (e.g.
> ~/websites/politics/www.fromthewilderness.com/free/ww3/), but it didn't
> actually store any html or other site content, despite the fact that it was
> taking a very long time to do this and was claiming to have downloaded
> hundreds of files.
> --
> Robert Persson
>
> "Don't use nuclear weapons to troubleshoot faults."
> (US Air Force Instruction 91-111, 1 Oct 1997)

if there isn't any files or folders under /websites then it isn't problem with 
httrack. if mirroring goes wrong, then there at least should be project 
folder containing hts-cash folder and hts-log.txt; index.html files. sorry, 
not much help from here.
martins

-- 
Linux 2.6.15-rc2 AMD Athlon(tm) 64 Processor 3200+
 15:20:24 up  1:03,  3 users,  load average: 0.27, 0.13, 0.08

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [gentoo-user] creating local copies of web pages
  2005-12-03  6:56   ` Robert Persson
@ 2005-12-03 15:40     ` Matthew Cline
  2005-12-03 18:00     ` [gentoo-user] " Harry Putnam
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 17+ messages in thread
From: Matthew Cline @ 2005-12-03 15:40 UTC (permalink / raw
  To: gentoo-user

On 12/3/05, Robert Persson <ireneshusband@yahoo.co.uk> wrote:
>
> The trouble is that I have a bookmark file with several hundred entries. wget
> is supposed to be fairly good at extracting urls from text files, but it
> couldn't handle this particular file.
>

I don't know what the exact format of your particular text file is,
but why don't you just sed and friends to convert the text file into a
format that wget can use?


HTH,

Matt

-- 
gentoo-user@gentoo.org mailing list



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [gentoo-user]  Re: creating local copies of web pages
  2005-12-03  6:56   ` Robert Persson
  2005-12-03 15:40     ` Matthew Cline
@ 2005-12-03 18:00     ` Harry Putnam
  2005-12-05 17:32     ` [gentoo-user] " Billy Holmes
  2005-12-05 17:33     ` Billy Holmes
  3 siblings, 0 replies; 17+ messages in thread
From: Harry Putnam @ 2005-12-03 18:00 UTC (permalink / raw
  To: gentoo-user

Robert Persson <ireneshusband@yahoo.co.uk> writes:

> The trouble is that I have a bookmark file with several hundred entries. wget 
> is supposed to be fairly good at extracting urls from text files, but it 
> couldn't handle this particular file.

There is a program in portage called `urlview'.  I haven't used it for
a couple of yrs but it used to be able to create clickable urls from
ones found in text files.  Maybe it can output something wget can use.


-- 
gentoo-user@gentoo.org mailing list



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [gentoo-user] creating local copies of web pages
  2005-12-03 13:40           ` Martins Steinbergs
@ 2005-12-03 18:09             ` Robert Persson
  2005-12-30  5:24               ` Robert Persson
  0 siblings, 1 reply; 17+ messages in thread
From: Robert Persson @ 2005-12-03 18:09 UTC (permalink / raw
  To: gentoo-user

On December 3, 2005 05:40 am Martins Steinbergs was like:
> if there isn't any files or folders under /websites then it isn't problem
> with httrack. if mirroring goes wrong, then there at least should be
> project folder containing hts-cash folder and hts-log.txt; index.html
> files. sorry, not much help from here.
> martins

But that's not what I've been saying, Martins. httrack +does+ create 
directories in ~/websites, including hts-cache. It also creates hts-log.txt, 
index.html, a lock file and a couple of gifs. However hts-cache is the only 
one of those directories with anything in it (aside from subdirectories and 
sub-subdirectories), and index.html is an empty file. What there is in 
hts-cache is a file called new.dat which contains a lot of the html that 
ought to have been put into the folders, all rolled into one huge file.

-- 
Robert Persson

"Don't use nuclear weapons to troubleshoot faults."
(US Air Force Instruction 91-111, 1 Oct 1997)

-- 
gentoo-user@gentoo.org mailing list



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [gentoo-user] creating local copies of web pages
  2005-12-03  6:56   ` Robert Persson
  2005-12-03 15:40     ` Matthew Cline
  2005-12-03 18:00     ` [gentoo-user] " Harry Putnam
@ 2005-12-05 17:32     ` Billy Holmes
  2005-12-05 17:33     ` Billy Holmes
  3 siblings, 0 replies; 17+ messages in thread
From: Billy Holmes @ 2005-12-05 17:32 UTC (permalink / raw
  To: gentoo-user

Robert Persson wrote:
> The trouble is that I have a bookmark file with several hundred entries. wget 
> is supposed to be fairly good at extracting urls from text files, but it 
> couldn't handle this particular file.

Try this:

emerge HTML-Tree

then as a normal user, run this script like so (where $file is your 
bookmark file)

$ perl listhref.pl $file > list.txt

[snip]
#!/usr/bin/perl
use HTML::Tree;
print join("\n",(map { $_->attr('href') } 
HTML::TreeBuilder->new()->parse_file(shift)->look_down("_tag","A",sub { 
$_[0]->attr('href') ne "" }) ))."\n";
exit;
[snip]

Then you can process your urls like so:

xargs wget -m < list.txt
-- 
gentoo-user@gentoo.org mailing list



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [gentoo-user] creating local copies of web pages
  2005-12-03  6:56   ` Robert Persson
                       ` (2 preceding siblings ...)
  2005-12-05 17:32     ` [gentoo-user] " Billy Holmes
@ 2005-12-05 17:33     ` Billy Holmes
  3 siblings, 0 replies; 17+ messages in thread
From: Billy Holmes @ 2005-12-05 17:33 UTC (permalink / raw
  To: gentoo-user

Robert Persson wrote:
> The trouble is that I have a bookmark file with several hundred entries. wget 
> is supposed to be fairly good at extracting urls from text files, but it 
> couldn't handle this particular file.

my previous message assumes that your bookmark file is in reality a HTML 
file.
-- 
gentoo-user@gentoo.org mailing list



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [gentoo-user] creating local copies of web pages
  2005-12-03 18:09             ` Robert Persson
@ 2005-12-30  5:24               ` Robert Persson
  0 siblings, 0 replies; 17+ messages in thread
From: Robert Persson @ 2005-12-30  5:24 UTC (permalink / raw
  To: gentoo-user

On December 3, 2005 10:09 am Robert Persson was like:
> On December 3, 2005 05:40 am Martins Steinbergs was like:
> > if there isn't any files or folders under /websites then it isn't problem
> > with httrack. if mirroring goes wrong, then there at least should be
> > project folder containing hts-cash folder and hts-log.txt; index.html
> > files. sorry, not much help from here.
> > martins
>
> But that's not what I've been saying, Martins. httrack +does+ create
> directories in ~/websites, including hts-cache. It also creates
> hts-log.txt, index.html, a lock file and a couple of gifs. However
> hts-cache is the only one of those directories with anything in it (aside
> from subdirectories and sub-subdirectories), and index.html is an empty
> file. What there is in hts-cache is a file called new.dat which contains a
> lot of the html that ought to have been put into the folders, all rolled
> into one huge file.

It seems that the problem has been simply that httrack is incredibly slow at 
downloading pages, and that it also caches a lot of them before committing 
them to wherever they are supposed to end up. This is why it looked like 
nothing was happening whenever I tried to use it.

Many thanks to everyone who has helped me here.

Robert
-- 
Robert Persson

"Don't use nuclear weapons to troubleshoot faults."
(US Air Force Instruction 91-111, 1 Oct 1997)

-- 
gentoo-user@gentoo.org mailing list



^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2005-12-30  5:28 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-12-02  1:41 [gentoo-user] creating local copies of web pages Robert Persson
2005-12-02  5:25 ` Shawn Singh
2005-12-02  9:37   ` Martins Steinbergs
2005-12-02 13:42     ` Robert Persson
2005-12-02 14:40       ` Martins Steinbergs
2005-12-03  7:04         ` Robert Persson
2005-12-03 13:40           ` Martins Steinbergs
2005-12-03 18:09             ` Robert Persson
2005-12-30  5:24               ` Robert Persson
2005-12-02  9:05 ` Neil Bothwick
2005-12-02 13:39   ` Robert Persson
2005-12-02 15:42 ` Billy Holmes
2005-12-03  6:56   ` Robert Persson
2005-12-03 15:40     ` Matthew Cline
2005-12-03 18:00     ` [gentoo-user] " Harry Putnam
2005-12-05 17:32     ` [gentoo-user] " Billy Holmes
2005-12-05 17:33     ` Billy Holmes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox