Ultimate Command Line Downloader


Ultimate Command Line Downloader (WGET)

There are a few powerful graphical download manager exits for Linux and UNIX like operating systems:

d4x: Downloader for X is a Linux/Unix userfriendly program with nice X interface to download files from the Internet. It suppotrs both FTP and HTTP protocols, supports resuming

kget: KGet is a versatile and user-friendly download manager for KDE desktop system.

gwget / gwget2: Gwget is a download manager for the Gnome Desktop

However, when it comes to command wget the non-interactive downloader rules. It supports http, ftp, https protocols along with authentication facility, and tons of other options. Here are some tips to get most out of it 

Downloading Single and multiple files with Wget

Single file


$ wget http://ubuntu.com/10.10/ubuntu-10.10-desktop-i386.iso


Multiple files


$ wget http://ubuntu-10.10-i386.iso ftp://ftp.redhat.com/1rc-i386.rpm


How to read the Url’s from a file ?


$vi /cyber/text.txt


Insert a list of Urls in the text

Type the wget command as follows:


$ wget -i /tmp/download.txt


Continue the Incomplete Download Using wget –c

You can also force wget to get a partially-downloaded file. This is useful when you want to finish up a download started by a previous instance of wget, or by another program:


$ wget -c http://www.cyberciti.biz/download/lsst.tar.gz


Please note that the -c option only works with FTP / HTTP servers that support the “range” header.

How Do I Limit the Download Speed?
You can limit the download speed to amount bytes per second. Amount may be expressed in bytes, kilobytes with the k suffix, or megabytes with the m suffix. For example, –limit-rate=100k will limit the retrieval rate to 100KB/s. This is useful when, for whatever reason, you don’t want Wget to consume the entire available bandwidth. This is useful when you want to download a large file file, such as an ISO image:


$ wget -c -o /tmp/susedvd.log –limit-rate=50k ftp://ftp.novell.co/pub/dvd1.iso


Use m suffix for megabytes (–limit-rate=1m). The above command will limit the retrieval rate to 50KB/s. It is also possible to specify disk quota for automatic retrievals to avoid disk DoS attack. The following command will be aborted when the quota is (100MB+) exceeded.


$ wget -cb -o /tmp/download.log -i /tmp/download.txt –quota=100m


What to do if only certain file types are needed?

Use the -A option

To download only pdf and jpg use.


$ wget -r -A pdf,jpg http://www.site.com


Well now suppose that there is the need to follow external links, usually wget does not do this, here we can use -H option.


$ wget -r -H -A pdf,jpg http://www.site.com


This is a little bit dangerous as it could end up downloading a lot much files that the ones needed, so we could limit the sites to follow, we will use -D for this.


$ wget -r -H -A pdf,jpg -Dfiles.site.com http://www.site.com


By default wget will follow 5 levels when using -r option, we can change this behaviour with the -l option.


$ wget -r -l 2 http://www.site.com


Wget has a very handy -U option for sites like this. Use -U My-browser to tell the site you are using some commonly accepted browser:

$ wget -r -p -U Mozilla http://www.stupidsite.com/ht.html


The most important command line options are –limit-rate= and –wait=. You should add –wait=20 to pause 20 seconds between retrievals, this makes sure you are not manually added to a blacklist. –limit-rate defaults to bytes, add K to set KB/s. Example:


$ wget –wait=20 –limit-rate=20K -r -p -U Mozilla http://www.stupid.com


A web-site owner will probably get upset if you attempt to download his entire site using a simple wget http://foo.bar command. However, the web-site owner will not even notice you if you limit the download transfer rate and pause between fetching files.

Use –no-parent

–no-parent is a very handy option that guarantees wget will not download anything from the folders beneath the folder you want to acquire. Use this to make sure wget does not fetch more than it needs to if just just want to download the files in a folder.

Download and Store With a Different File name Using wget -O

By default wget will pick the filename from the last word after last forward slash, which may not be appropriate always.

Wrong: Following example will download and store the file with name: download_script.php?src_id=7701


$ wget http://www.vim.org/scripts/download_script.php?src_id=7701


Even though the downloaded file is in zip format, it will get stored in the file as shown below.

$ ls
download_script.php?src_id=7701

Correct: To correct this issue, we can specify the output file name using the -O option as:


$ wget -O taglist.zip http://www.vim.org/scripts/download_script.php?src_id=7701


Advanced wget Techniques


$wget -r -l1 -H -t1 -nd -N -np -A.ppt -erobots=off –I ~/download.txt


And here’s what this all means:

-r -H -l1 -np These options tell wget to download recursively. That means it goes to a URL, downloads the page there, then follows every link it finds. The -H tells the app to span domains, meaning it should follow links that point away from the blog. And the -l1 (a lowercase L with a numeral one) means to only go one level deep; that is, don’t follow links on the linked site. In other words, these commands work together to ensure that you don’t send wget off to download the entire Web — or at least as much as will fit on your hard drive. Rather, it will take each link from your list of blogs, and download it. The -np switch stands for “no parent”, which instructs wget to never follow a link up to a parent directory.

We don’t, however, want all the links — just those that point to audio files we haven’t yet seen. Including -A.ppt tells wget to only download files that end with the .ppt extension. And -N turns on timestamping, which means wget won’t download something with the same name unless it’s newer.

To keep things clean, we’ll add -nd, which makes the app save every thing it finds in one directory, rather than mirroring the directory structure of linked sites. And -erobots=off tells wget to ignore the standard robots.txt files. Normally, this would be a terrible idea, since we’d want to honor the wishes of the site owner. However, since we’re only grabbing one file per site, we can safely skip these and keep our directory much cleaner. Also, along the lines of good net citizenship, we’ll add the -w5 to wait 5 seconds between each request as to not pound the poor blogs.

-r makes it recursive
-l2 makes it 2 levels
-nd is no directories
-Nc only downloads files you have not already downloaded
-A.ppt means all ppt files on page

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Create a free website or blog at WordPress.com.

Up ↑

%d bloggers like this: