How To Download A Website With Wget The Right Way
Overview
To download an entire website from Linux it is often recommended to
use wget
, however, it must be done using the right parameters or the
downloaded website won’t be similar to the original one, with probably
relative broken links. This tutorial explores the right combination to
download a website:
- converting relative links to full paths so they can be browsed offline.
- preventing requesting too many web pages too fast, overloading the server and possibly being blocked from requesting more.
- avoid overwriting or creating duplicates of already downloaded files.
Alternatives
The download can be made using a recursive traversal approach or visiting each URL of the sitemap.
1. Recursive traversal
For this we use the well known command wget.
GNU Wget is a free utility for non-interactive download of files from the Web
Wget needed parameters
The wget
command is very popular in Linux and present in most
distributions.
To download an entire website we use the following Wget download options:
--wait=2
Wait the specified number of seconds between the retrievals.
. In this case 2 seconds.
--limit-rate=20K
Limit the download speed to amount bytes per second.
--recursive
Turn on recursive retrieving. The default maximum depth is 5. If the website has more levels than 5, then you can specify it with
--level=depth
--page-requisites
download all the files that are necessary to properly display a given HTML page. This includes such things as inlined images, sounds, and referenced stylesheets.
--user-agent=Mozilla
Identify as Mozilla to the HTTP server.
--no-parent
Do not ever ascend to the parent directory when retrieving recursively.
--convert-links
After the download is complete, convert the links in the document to make them suitable for local viewing.
--adjust-extension
If a file of type application/xhtml+xml or text/html is downloaded and the URL does not end with the regexp
\.[Hh][Tt][Mm][Ll]?
, this option will cause the suffix .html to be appended to the local filename.--no-clobber
When running Wget with -r, re-downloading a file will result in the new copy simply overwriting the old. Adding -nc will prevent this behavior, instead causing the original version to be preserved and any newer copies on the server to be ignored.
-e robots=off
turn off the robot exclusion
--level
Specify recursion maximum depth level depth. Use
inf
as the value for inifinite.
Summary
Summarizing, these are the needed parameters:
wget --wait=2 \
--level=inf \
--limit-rate=20K \
--recursive \
--page-requisites \
--user-agent=Mozilla \
--no-parent \
--convert-links \
--adjust-extension \
--no-clobber \
-e robots=off \
https://example.com
Or in one line:
wget --wait=2 --level=inf --limit-rate=20K --recursive --page-requisites --user-agent=Mozilla --no-parent --convert-links --adjust-extension --no-clobber -e robots=off https://example.com
Example
Let’s try to download the https://example.com website (single page)
to see how verbose is wget
and how it behaves.
$ wget --wait=2 --limit-rate=20K --recursive --page-requisites --user-agent=Mozilla --no-parent --convert-links --adjust-extension --no-clobber https://example.com --2017-06-30 19:48:46-- https://example.com/ Resolving example.com (example.com)... 93.184.216.34 Connecting to example.com (example.com)|93.184.216.34|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 1270 (1,2K) [text/html] Saving to: ‘example.com/index.html’ example.com/index.html 100%[===========================================================>] 1,24K --.-KB/s in 0,003s 2017-06-30 19:48:46 (371 KB/s) - ‘example.com/index.html’ saved [1270/1270] FINISHED --2017-06-30 19:48:46-- Total wall clock time: 0,6s Downloaded: 1 files, 1,2K in 0,003s (371 KB/s) Converting links in example.com/index.html... nothing to do. Converted links in 1 files in 0 seconds. $ tree example.com/ example.com/ └── index.html 0 directories, 1 file
Wget mirror
Wget
already comes with a handy --mirror
paramater that is the
same to use -r -l inf -N
. That is:
- recursive download
- with infinite depth
- turn on time-stamping.
2. Using website’s sitemap
Another approach is to avoid doing a recursive traversal of the
website and download all the URLs present in website’s sitemap.xml
.
Filtering url from sitemap
A sitemap file typically has the form:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd" xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://marcanuy.com/en/projects/conversions</loc>
<lastmod>2014-09-15T00:00:00-03:00</lastmod>
</url>
<url>
<loc>https://marcanuy.com/en/projects/games-for-kids</loc>
<lastmod>2014-09-15T00:00:00-03:00</lastmod>
</url>
</urlset>
We need to get all the URLs present in sitemap.xml
, using grep
:
grep “
Removing loc tags
Now to remove the superfluous tags: sed -e ’s/<[^>]*>//g’`
Putting it all together
After the previous two command we have a list of URLs, and that is the
parameter read by wget -i
:
wget -i grep "<loc>" sitemap.xml| sed -e 's/<[^>]*>//g'
And wget will start downloading them sequentially.
Conclusion
wget
is a fantastic command line tool, it has everything you will
ever need without having to use any other GUI tool, just be sure to
browse its manual for the right parameters you want.
The above parameters combination will make you have a browseable website locally.
You should be careful to check that .html
extensions works for your
case, sometimes you may want that wget generates them based on the
Content Type but sometimes you should avoid wget generating them as is
the case when using pretty urls.
Reference
wget
manual https://www.gnu.org/software/wget/
- Find out IP addresses from MACs in a Local Area NetworkMay 10, 2023
- Choose any key as the modifier in i3wm in 6 stepsJanuary 20, 2021
- Adding a swap memory to Linux from command line in 6 stepsApril 2, 2020
- Free up space in Linux (Ubuntu)March 27, 2020
- Switch between languages in Linux. Write in multiple languages with same keyboard.March 21, 2020
- How to make Ubuntu display emojisFebruary 12, 2020
- Detect and mount USB devices in Linux from consoleJanuary 24, 2019
- How to make screencasts in Ubuntu LinuxJanuary 21, 2019
- Using i3 window manager in LinuxJanuary 7, 2019
- Setting Up A Fresh Linux ServerAugust 25, 2018
- How To Download A Website With Wget The Right Way
- Replicate Installed Package Selections From One Ubuntu System To AnotherApril 24, 2017
- Using Clamav Antivirus In UbuntuJanuary 25, 2017
- How to Type Spanish Characters, Accents and Symbols in LinuxJune 6, 2016
Ubuntu
- How to activate tap to click touchpad's feature in Ubuntu in 4 stepsMarch 4, 2021
- Difference between suspend and hibernate in Ubuntu and how to execute them from command lineApril 12, 2020
- Solving Google Chrome's gpu-process error message in Ubuntu LinuxJanuary 7, 2019
- Solving Google Chrome's secret service operation error message in Ubuntu LinuxJanuary 7, 2019
- Start Emacs In Ubuntu The Right WayJune 10, 2017
Unix Shell
- Connect to a Bluetooth device from command line in Ubuntu LinuxJune 23, 2020
- Add Infolinks Script To An Existing Website From Console With Sed CommandApril 4, 2017
- How to change all files permissions to 644 and directories to 755January 10, 2017
- Shell Redirect Output And Errors To The Null Device In BashDecember 9, 2016
- Prevent Running Of Duplicate Cron JobsDecember 8, 2016
- Delete All Backup Files Recursively In BashNovember 28, 2016
- Bash Script to Find Out If MySQL Is Running Or NotNovember 9, 2016
Articles
Subcategories
Except as otherwise noted, the content of this page is licensed under CC BY-NC-ND 4.0 . Terms and Policy.
Powered by SimpleIT Hugo Theme
·