1. Overview

When working in a shell environment, there are times when we need to verify whether a URL is valid before we proceed to use it for other operations.

In this tutorial, we’ll learn some simple and effective ways to check if a URL exists, directly from our shell.

2. Core URL Verification Methods

Within the shell, we have two primary tools at our disposal to check URL existence. These tools are curl and wget.

2.1. Using curl

curl is a command line tool that we can use to transfer data to or from servers using various protocols (including HTTP and HTTPS). Among its many features, we can use curl to check if a URL points to an actual, accessible resource.

Now, let’s explore a few different ways to use curl.

First, let’s see a simple script to check if a URL exists using curl:

#!/bin/bash

if curl --head --silent http://www.baeldung.com/ > /dev/null 2>&1; then
    echo "URL exists"
else
    echo "URL doesn't exist or isn't reachable"
fi

Let’s break down the key part of this script:

  • curl: is the command line tool we’re using to interact with the website
  • –head: tells curl to use only fetch the “header” information from the website, rather than downloading the entire webpage
  • –silent: keeps things tidy by hiding curl‘s usual progress and status output
  • > /dev/null 2>&1: sends both standard output (stdout) and standard error (stderr) to /dev/null, effectively discarding any output curl might produce

In addition, the if statement checks the exit code of the curl command. If it’s 0 (success), the URL exists and is reachable. But, if the exit code is non-zero (error), the URL may not exist or there may be a connection issue.

Next, let’s see how we can store the HTTP response code in a variable for further error handling:

#!/bin/bash

result=$(curl --head --silent --write-out "%{http_code}" --output /dev/null https://www.google.com/)
if [[ $result -eq 200 ]]; then
    echo "URL exists"
else
    echo "URL doesn't exist or is not reachable"
fi

In the code block above, the key addition is –write-out “%{http_code}”. This tells curl to include the website’s HTTP status code in its output. Then, we capture this output in the result variable.

In addition, the if statement checks the value of the result variable. If it’s 200, it means the URL exists. However, if result is not 200, there might be a problem or the URL may not exist.

2.2. Using wget

wget is a tool that we can use to download files from the web. It also provides a convenient way to verify URL existence.

Now, let’s see how wget works:

#!/bin/bash

if wget --spider https://www.facebook.com/ > /dev/null 2>&1; then
    echo "URL exists"
else
    echo "URL does not exist or is not reachable"
fi

In the above script:

  • wget: tries to download files from the website
  • –spider: tells wget to check if a file or resource exists, without actually downloading it
  • > /dev/null 2>&1: redirects all output to /dev/null to discard it

As with curlwget uses exit codes to communicate the results of an operation. Generally, a 0 exit code means success (the URL exists), while other codes indicate an error condition or non-existence of the URL.

Also, we can completely mutate the output from wget, resulting in a cleaner script execution. Let’s find out how this option works:

#!/bin/bash

if wget --spider -q https://www.google.com; then
    echo "URL exists"
else
    echo "URL does not exist or is not reachable"
fi

In the above example, -q instructs wget to run in quiet mode without printing any output to the console. The rest of the script works the same way as the last example.

Therefore, while wget is primarily for downloading files, its –spider mode is a quick and easy way to confirm if a website or resource exists.

3. Advanced Considerations

Simply knowing whether a URL exists is great. But sometimes, we may need our script to act differently depending on the reason a URL check fails. We may also want our scripts to handle situations where websites are simply slow to respond.

Now, let’s learn about ways to make our scripts more adaptable to these situations.

3.1. Handling Common HTTP Status Codes in Script Responses

Websites tell current conditions using HTTP status codes. So, by understanding these codes, we can tweak our scripts to make informed decisions about how to proceed.

Let’s learn about a few commonly encountered HTTP status codes:

  • 200 (OK): is the success code! It means the page or resources we requested exist and the server sent them back correctly
  • 404 (Not found): indicates that the specific webpage or resource we requested doesn’t exist on the server. The source of the error could be a typo in the URL, or the page may have been removed
  • 403 (Forbidden): is a common status code when we try to access a webpage that requires a login or special permissions
  • 301 (Moved permanently): signals that the resource we requested has been relocated to a new address. We need to update our script to reflect this change

In addition to these standard codes, you might encounter a 000 status code. This code isn’t standard and it’s used by tools like curl to indicate that no HTTP response was received.

000 status code could occur when there’s a network timeout, DNS issue, or a connection drop before the server could respond.

For example, let’s create a script that handles some common HTTP codes and the 000 code:

#!/bin/bash

url="https://www.google.com"

status_code=$(curl --head --silent --output /dev/null --write-out '%{http_code}' "$url")

case $status_code in
    200)
        echo "URL exists"
        ;;
    404)
        echo "Error 404: Not found."
        ;;
    403)
        echo "Error 403: Forbidden."
        ;;
    301)
        echo "Error 301: Moved permanently."
        ;;
    000)
        echo "No response received."
        ;;
    *)
        echo "Unexpected status code: $status_code. Further troubleshooting needed"
        ;;
esac

Now, let’s consider the above code snippet:

  • url: is a variable that stores the URL we want to check
  • status_code: uses curl to get the HTTP status code of the URL we want to check
  • case $status_code in: is a case statement to check the value of status_code:
    • 200): means the URL exists
    • 404): is an indication that curl could communicate with the server, but the server couldn’t find what was requested
    • 403): says we don’t have permission to access the URL
    • 301): suggests that the page has been moved to a new location permanently
    • 000): indicates that no response was received
    • *): is a “catch-all” for any other codes, indicating that we may have to do additional troubleshooting
  • esac: is a keyword indicating the end of the case statement

Furthermore, by handling HTTP errors correctly, we can turn simple URL existence checkers into smart scripts that act differently depending on the situation.

3.2. Timeout Settings

When working with URLs, we may come across slow or unresponsive websites. However, by setting timeouts in our scripts, we can ensure the scripts don’t hang indefinitely.

Let’s see a simple example:

#!/bin/bash

url="https://www.google.com"

response=$(curl --connect-timeout 10 --max-time 15 --silent --head --write-out "%{http_code}" --output /dev/null "$url")
echo "HTTP status code: $response"

In the above example:

  • –connect-timeout 10: tells curl to try connecting for a maximum of 10 seconds. Also, if it can’t connect within that time, the script will give up and curl will return a status code of 28
  • –max-time 15: sets a maximum of 15 seconds for the whole request (including connecting and getting a response)

Hence, timeout helps to keep our scripts from getting stuck on slow or unresponsive websites. Also, by using curl‘s –connect-timeout and –max-time options, we can make scripts operate more smoothly.

4. Conclusion

In this article, we explored how to verify if URLs exist from the shell environment. Using tools like curl and wget, we created scripts that check website availability.

Also, we learned about HTTP status codes, which enables us to build even smarter scripts. This script responds differently to status codes like missing pages (404) or access restrictions (403).

Finally, by setting timeouts, we ensured our script didn’t freeze up when websites were slow or unresponsive.

These skills are essential for automating tasks, building monitoring tools, or creating more efficient scripts to interact with web resources.

Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments