Web Site Download Tool [Archive] - Glock Talk

PDA

View Full Version : Web Site Download Tool


Team Greenbaum
12-04-2004, 19:28
Does anyone know a good web site download tool? One that downloads media and archive files? I've tried a few, most recently "7 Download Service". It worked great on html and image files but I never could get it to download .mpg, .mp3 or .zip files...

chevrofreak
12-04-2004, 19:57
I like Net Transport from Xi

Team Greenbaum
12-05-2004, 06:15
Is there a way to make it download the whole web site? I could only get it to download 1 file at a time. I'm looking for something that will crawl a site and download the whole thing.

Sinister Angel
12-05-2004, 10:01
Do they have a port of wget for windows?

Team Greenbaum
12-05-2004, 11:00
Yep! WGET for Windows (http://www.interlog.com/~tcharron/wgetwin.html)
I'll check it out.
Thanks!

Sinister Angel
12-05-2004, 11:35
Originally posted by Scottbert
Yep! WGET for Windows (http://www.interlog.com/~tcharron/wgetwin.html)
I'll check it out.
Thanks!

Glad to help! I know it works wonders on my linux box.

lomfs24
12-05-2004, 23:35
Originally posted by Sinister Angel
Glad to help! I know it works wonders on my linux box. How do you make wget pull an entire website? I have worked with it a little but mostly as a single file transport tool.

grantglock
12-06-2004, 08:02
Originally posted by lomfs24
How do you make wget pull an entire website? I have worked with it a little but mostly as a single file transport tool.


wget -r http://glocktalk.com

lomfs24
12-06-2004, 09:20
Originally posted by grantglock
wget -r http://glocktalk.com
kewl. Thanks.

Team Greenbaum
12-14-2004, 07:55
Wget rocks! Almost.

It worked great on the first site I tried it on. It downloaded everything, including media files and changed all URLs to local links. However, on the second site I tried, it immediately gets a 302 redirect to a completely different site. It's as if the web server is recognizing that I'm using wget instead of a browser and responding with the 302 redirect. Any ideas on what I can do to fix this?

Here are the options I'm using:
wget --output-file="wget.log" --recursive --level=inf --timestamping --convert-links --wait=1 --random-wait https://user:password@www.website.com

HerrGlock
12-14-2004, 08:42
Originally posted by Scottbert
It's as if the web server is recognizing that I'm using wget instead of a browser and responding with the 302 redirect. Any ideas on what I can do to fix this?

The web site is seeing that you have a getter instead of a browser, you are right.

There are things you can do, but most of the people who really don't want you to slurp their site already know them and have something to counter that too.

There is, however, a plugin for firefox/mozilla that would work as it's a browser doing the slurping.

Just something to think about.
DanH

Sinister Angel
12-14-2004, 09:17
Actually, you can have WGET send a forged AGENT header or maybe its a refer header as well.

--referer=url
Include `Referer: url' header in HTTP request. Useful for retrieving documents with server-side processing that assume they are always being retrieved by interactive web browsers and only come out properly when Referer is set to one of the pages that point to them.


and


-U agent-string
--user-agent=agent-string
Identify as agent-string to the HTTP server.

The HTTP protocol allows the clients to identify themselves using a "User-Agent" header field. This enables distinguishing the WWW software, usually for statistical purposes or for tracing of protocol violations. Wget normally identifies as Wget/version, version being the current version number of Wget.

However, some sites have been known to impose the policy of tailoring the output according to the "User-Agent"-supplied information. While conceptually this is not such a bad idea, it has been abused by servers denying information to clients other than "Mozilla" or Microsoft "Internet Explorer". This option allows you to change the "User-Agent" line issued by Wget. Use of this option is discouraged, unless you really know what you are doing.

Hope this helps!