Authors Top

If you have a few years of experience in the Linux ecosystem, and you’re interested in sharing that experience with the community, have a look at our Contribution Guidelines.

1. Introduction

The Wayback Machine is the web interface used by the Internet Archive. The archived sites represent a kind of “snapshots” collected by the Internet Archive indexing software. It’s usually challenging for a user to download these website snapshots.

We can retrieve only the static content, partial or total, accessible via the Wayback Machine on a given date. Any logic that existed on the server-side is not recoverable in the case of dynamic sites.

This tutorial is focused on Bash scripting with the help of wget, perl, find, and possibly other commands. It’s a step-by-step approach that ensures we completely understand what we’re doing.

Any recent Linux distribution is fine for the described commands.

2. wget and Bash Scripting

First, there is no universal method or script always valid to retrieve websites from the Wayback Machine. Archived websites may require targeted actions, and their pages could be incomplete or refer to external scripts.

In addition, we may have different recovery needs about the directory tree or the file extensions.

That said, we can follow a few basic guidelines:

  1. Save all the website files we want to recover with wget, using options helpful for the Wayback Machine.
  2. Manually check all HTML pages saved by wget.
  3. Remove by regex the code added by the Wayback Machine from all pages.
  4. Analyze HTML code of all the saved pages to check which corrections are appropriate (with particular attention to all the links).
  5. Create and apply all regex necessary to do what we decided in the previous step.
  6. Check the correctness of all internal and external links.
  7. Repeat steps 4, 5, and 6 if necessary.
  8. Make various modifications at our discretion, according to what we will do with the saved website.

We will try to retrieve a website, step by step, whose Internet Archive URL is: https://web.archive.org/web/20210426061752/https://sites.google.com/site/archiviodigiulioripa/. Unfortunately, this not-for-profit, lightweight, and copyright-free (CC BY 4.0 license) website disappeared from the Internet in 2021.

The snapshot URL contains the date (20210426) and the original, no longer existing URL (https://sites.google.com/site/archiviodigiulioripa/). On the “advanced URL locator hints and tips“, we can check out the structure of the URLs.

2.1. wget or wget2?

GNU Wget2 is the successor of GNU Wget, a file and recursive website downloader designed and written from scratch. Our tests demonstrated that Wget2 is much faster than Wget at mirroring sites, but unfortunately, the results are not as desired with the Internet Archive. In a nutshell, at the moment, Wget2 is not suitable for retrieving sites from the Internet Archive, so we will use the classic and slower Wget.

With a stable and fast Internet connection, mirroring a lightweight website could take over an hour. In contrast, heavier websites could take tens of hours.

This slowness is due both to the constant overloading of the Internet Archive and because wget will download duplicate files multiple times. For example, these eight different links represent the same file, saved by the Internet Archive on various dates: link1, link2, link3, link4, link5, link6, link7, and link8.

In this example, while mirroring, wget will download the same file eight times and save only the most recent, comparing the timestamps provided by the server.

Usually, the version of wget that comes with our Linux distribution is fine. In case of bugs, we can compile and use the latest available version. For any technical assistance, the “Issues” page in the Wget GitLab project acts as a bug tracker.

3. Mirroring Internet Archive Websites With wget

Suppose we wanted to recover the previously chosen website to put it back in operation on an Apache server with a Linux file system. In that case, the most appropriate command is:

$ mkdir archiviodigiulioripa
$ cd archiviodigiulioripa/
$ wget -e robots=off -r -nH -nd --page-requisites --content-disposition --convert-links --adjust-extension \
--user-agent="Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:95.0) Gecko/20100101 Firefox/95.0" \
--timestamping --accept-regex='archiviodigiulioripa|ssl\.gstatic\.com|www\.google\.com/images\
|88x31\.png|bundle-playback\.js|wombat\.js|banner-styles\.css|iconochive\.css|\
standard-css-ember-ltr-ltr\.css|jot_min_view__it\.js|tree_ltr\.gif|apple-touch-icon\.png\
|sites-16\.ico|filecabinet\.css|record\.css' \
--reject-regex='accounts\.google\.com|reportAbuse\.html|showPrintDialog|docs\.google\.com|\
youtube\.com|\.mp4' \
https://web.archive.org/web/20210426061752/https://sites.google.com/site/archiviodigiulioripa/

We saved the complete wget log and the mirror thus obtained for reference.

Complete documentation for wget, more extensive than that provided by the man page, can be found in the GNU Wget Manual. We can distinguish between two types of wget parameters: those valid in most cases and those we must customize every time.

3.1. wget Parameters Valid in Most Cases

Ignore robots.txt since honoring it doesn’t make much sense when downloading a snapshot from the Internet Archive:

-e robots=off

Enable recursive downloading, with a depth of 5 by default, changeable by –level=depth:

-r

Disable generation of host-prefixed directories. This choice makes sense because we decided to run the recovered website again on Apache, so there’s no need for additional directories:

-nH

In the case of websites without files referenced in external hosts, the original directory structure is recovered by the option –cut-dirs=3 without -nd. In fact, the Internet Archive adds three directories to be removed.

However, in this case, we do not create a hierarchy of directories when retrieving recursively to keep things simple:

-nd

Cause wget to download all the necessary files to correctly display a given HTML page:

--page-requisites

It’s helpful if Content-Disposition headers contain files names, though it may not be necessary with the Internet Archive:

--content-disposition

When it’s strictly necessary to get a working mirror, we can ask wget to convert all links to make them suitable for local viewing:

--convert-links

Add the .html extension in the case of HTML content without an extension, and correct the wrong extensions if possible:

--adjust-extension

Hide wget identity, pretending to be a recent version of Firefox, circumventing possible blockages:

--user-agent="Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:95.0) Gecko/20100101 Firefox/95.0"

Save only the most recent version of every file, ignoring its duplicates with an older timestamp:

--timestamping

3.2. wget Site-Dependent Parameters

We need to customize two options for each website to decide what we want to save and what we don’t. Unfortunately, customization can take a lot of time and several attempts before we get what we want, or at least a mirror that comes as close as possible to our expectations:

--accept-regex='archiviodigiulioripa|ssl\.gstatic\.com|www\.google\.com/images\
|88x31\.png|bundle-playback\.js|wombat\.js|banner-styles\.css|iconochive\.css|\
standard-css-ember-ltr-ltr\.css|jot_min_view__it\.js|tree_ltr\.gif|apple-touch-icon\.png\
|sites-16\.ico|filecabinet\.css|record\.css'
--reject-regex='accounts\.google\.com|reportAbuse\.html|showPrintDialog|docs\.google\.com|\
youtube\.com|\.mp4'

Wrong options risk making wget download only a small part of the website, or making it download too much, following unnecessary external links.

–accept-regex and –reject-regex specify a POSIX regular expression to accept or reject the complete URLs.

We followed the method of looking at a few pages of the website and checking the network flow with Firefox developer tools, monitoring the various loaded URLs. We also checked the visible links and made decisions about them.

3.3. Initial Check of What We Downloaded

Opening the index.html file with Firefox development tools open and browsing the local mirror, we notice that:

  • It looks good, as the original website and without the Internet Archive toolbar.
  • It’s slow because some files are still loaded from the Internet Archive instead of their local copies.
  • Some trackers are highlighted in red by the uBlock Origin extension.
  • Almost all content is locally available, except for the links we planned to remove.
  • Having a look at the source code, we notice in all the pages some unnecessary information, added by the Internet Archive under the tag </html>.

So, let’s proceed to do some cleanup.

3.4. Code Cleaning

In summary, we need to manually inspect HTML files to figure out which regexes to apply. The two documents “Regular Expressions” and “Perl Command Switches” allow us to look deeper. When working with HTML files, perl is helpful because the -0777 option lets us easily apply regexes to the whole file.

Code cleanup consists of several steps that are generally valid for most websites recovered from the Wayback Machine and still other steps specific to this website.

This Bash script does what it takes to turn the previous mirror downloaded with wget into a clean site ready to be revived and put online again. Some choices are arbitrary, but overall, it indicates a possible path to follow:

# /bin/bash

cd archiviodigiulioripa

# STEP 1a: removes useless files, which were part of the Google Sites login
rm account*
rm ServiceLogin*
rm *kids*
rm kid*
rm device*
rm phone*
rm security*
rm shield*
rm signin*
rm usb*
rm web_and*
rm who_will*
rm you_tube*
# STEP 1b: removes duplicated and useless files, maybe accidentally added by wget
rm *.gif.html
rm *.png.html
rm *.svg.html

for file in *.html; do
    # STEP 2: removes the HTML code added by the Wayback Machine
    perl -0777 -i -pe 's/<script.*Rewrite JS Include -->//igs' "$file"
    perl -0777 -i -pe 's/<\/html>.*/<\/html>/igs' "$file"

    # STEP 3: converts all absolute URLs pointing to the Wayback Machine to local URLs relative to the current path
    perl -0777 -i -pe 's/https:\/\/web\.archive\.org\/[^\s"]*\///igs' "$file"

    # STEP 4: deletes all queries from file names (and makes the suffix consistent for jpeg images)
    perl -0777 -i -pe 's/\.gif?[^\s"<]*/\.gif/igs' "$file"
    perl -0777 -i -pe 's/\.png?[^\s"<]*/\.png/igs' "$file"
    perl -0777 -i -pe 's/\.jpg?[^\s"<]*/\.jpg/igs' "$file"
    perl -0777 -i -pe 's/\.jpeg?[^\s"<]*/\.jpg/igs' "$file"
    perl -0777 -i -pe 's/\.css?[^\s"<]*/\.css/igs' "$file"

    # STEP 5: removes the nofollow signal to search engines added by the Wayback Machine
    perl -0777 -i -pe 's/rel="nofollow"//igs' "$file"

    # STEP 6: removes the Google Sites login footer
    perl -n -i -pe 's/<div.*sites-adminfooter.*<\/div>//igs' "$file"

    # STEP 7: removes invalid links and fixes link to the Creative Commons license
    perl -n -i -pe 's/<a href=.*>Visualizza<\/a>//igs' "$file"
    perl -n -i -pe 's/<a href="viewer.*">/<a id="invalidLink">/igs' "$file"
    perl -n -i -pe 's/<a href="deed\.it"/<a href="https:\/\/creativecommons\.org\/licenses\/by\/4.0\/deed.it"/igs' "$file"
    perl -n -i -pe 's/<a href="goog_60630410">/<a id="invalidLink">/igs' "$file"

    echo "$file code cleaned"
done

# STEP 8: renames all files with queries (compare with STEP 4)
for file in {*.gif,*.png,*.jpg,*.jpeg,*.css}\?*; do
    withoutQueries=`echo $file | cut -d? -f1`
    mv "$file" "$withoutQueries"
    echo "Removed query strings from $withoutQueries"
done

# STEP 9: makes the file extension consistent for jpeg images (compare with STEP 4)
find . -type f -name '*.jpeg' -print0 | xargs -0 rename 's/\.jpeg/\.jpg/'

The final result is downloadable here.

3.5. Deploying on a Web Server and Final Check

Finally, we put the site thus recovered at the address: https://archiviodigiulioripa.sytes.net/.

It looks as expected. To check the links, we used the W3C Link Checker. It confirmed that all links are valid.

As for the external links in the video-prodotti.html file, we manually restored them one by one. For minor, specific fixes such as this and concerning only one file, it’s easier and faster to edit the HTML code manually than to spend a lot of time writing the right regexes.

4. Conclusion

This article has shown how to download a static copy of a website archived in the Wayback Machine using wget, followed by code cleaning with Bash scripting.

Manual steps give us awareness and control of what we’re doing.

In general, the recovery of a website is a reasonably lengthy operation. Therefore, we must evaluate it on a case-by-case basis.

Authors Bottom

If you have a few years of experience in the Linux ecosystem, and you’re interested in sharing that experience with the community, have a look at our Contribution Guidelines.

Comments are closed on this article!