Ultimate Command Line Downloader (WGET)
There are a few powerful graphical download manager exits for Linux and UNIX like operating systems:
d4x: Downloader for X is a Linux/Unix userfriendly program with nice X interface to download files from the Internet. It suppotrs both FTP and HTTP protocols, supports resuming
kget: KGet is a versatile and user-friendly download manager for KDE desktop system.
gwget / gwget2: Gwget is a download manager for the Gnome Desktop
However, when it comes to command wget the non-interactive downloader rules. It supports http, ftp, https protocols along with authentication facility, and tons of other options. Here are some tips to get most out of it
Downloading Single and multiple files with Wget
How to read the Url’s from a file ?
Insert a list of Urls in the text
Type the wget command as follows:
Continue the Incomplete Download Using wget –c
You can also force wget to get a partially-downloaded file. This is useful when you want to finish up a download started by a previous instance of wget, or by another program:
Please note that the -c option only works with FTP / HTTP servers that support the “range” header.
How Do I Limit the Download Speed?
You can limit the download speed to amount bytes per second. Amount may be expressed in bytes, kilobytes with the k suffix, or megabytes with the m suffix. For example, –limit-rate=100k will limit the retrieval rate to 100KB/s. This is useful when, for whatever reason, you don’t want Wget to consume the entire available bandwidth. This is useful when you want to download a large file file, such as an ISO image:
Use m suffix for megabytes (–limit-rate=1m). The above command will limit the retrieval rate to 50KB/s. It is also possible to specify disk quota for automatic retrievals to avoid disk DoS attack. The following command will be aborted when the quota is (100MB+) exceeded.
What to do if only certain file types are needed?
Use the -A option
To download only pdf and jpg use.
Well now suppose that there is the need to follow external links, usually wget does not do this, here we can use -H option.
This is a little bit dangerous as it could end up downloading a lot much files that the ones needed, so we could limit the sites to follow, we will use -D for this.
By default wget will follow 5 levels when using -r option, we can change this behaviour with the -l option.
Wget has a very handy -U option for sites like this. Use -U My-browser to tell the site you are using some commonly accepted browser:
The most important command line options are –limit-rate= and –wait=. You should add –wait=20 to pause 20 seconds between retrievals, this makes sure you are not manually added to a blacklist. –limit-rate defaults to bytes, add K to set KB/s. Example:
A web-site owner will probably get upset if you attempt to download his entire site using a simple wget http://foo.bar command. However, the web-site owner will not even notice you if you limit the download transfer rate and pause between fetching files.
–no-parent is a very handy option that guarantees wget will not download anything from the folders beneath the folder you want to acquire. Use this to make sure wget does not fetch more than it needs to if just just want to download the files in a folder.
Download and Store With a Different File name Using wget -O
By default wget will pick the filename from the last word after last forward slash, which may not be appropriate always.
Wrong: Following example will download and store the file with name: download_script.php?src_id=7701
Even though the downloaded file is in zip format, it will get stored in the file as shown below.
Correct: To correct this issue, we can specify the output file name using the -O option as:
Advanced wget Techniques
And here’s what this all means:
-r -H -l1 -np These options tell wget to download recursively. That means it goes to a URL, downloads the page there, then follows every link it finds. The -H tells the app to span domains, meaning it should follow links that point away from the blog. And the -l1 (a lowercase L with a numeral one) means to only go one level deep; that is, don’t follow links on the linked site. In other words, these commands work together to ensure that you don’t send wget off to download the entire Web — or at least as much as will fit on your hard drive. Rather, it will take each link from your list of blogs, and download it. The -np switch stands for “no parent”, which instructs wget to never follow a link up to a parent directory.
We don’t, however, want all the links — just those that point to audio files we haven’t yet seen. Including -A.ppt tells wget to only download files that end with the .ppt extension. And -N turns on timestamping, which means wget won’t download something with the same name unless it’s newer.
To keep things clean, we’ll add -nd, which makes the app save every thing it finds in one directory, rather than mirroring the directory structure of linked sites. And -erobots=off tells wget to ignore the standard robots.txt files. Normally, this would be a terrible idea, since we’d want to honor the wishes of the site owner. However, since we’re only grabbing one file per site, we can safely skip these and keep our directory much cleaner. Also, along the lines of good net citizenship, we’ll add the -w5 to wait 5 seconds between each request as to not pound the poor blogs.
-r makes it recursive
-l2 makes it 2 levels
-nd is no directories
-Nc only downloads files you have not already downloaded
-A.ppt means all ppt files on page