1. Overview

Often, we need to perform text processing in a Linux shell environment. One related task is to extract a complete URL from a string. Moreover, this operation is essential for different applications such as link validation, Web scraping, data analysis, and many others.

In this tutorial, we’ll explore different methods to extract a complete URL from a string.

2. Problem Statement

Unlike structured data where we neatly define URLs, text data such as strings often contain URLs in diverse forms and at different locations. Considering this, our task is to find effective methods for identifying and extracting a complete URL from a string.

To demonstrate further, let’s use a sample string:

Welcome to our website https://www.baeldung.com or visit https://example.com

The above string contains two URLs but we might want to extract https://www.baeldung.com only.

Now, let’s look at another example:

Check out these URLs: https://www.baeldung.com?q=1, https://example.com/path,
  https://sub.example.org/page1?, http://www.example.net/page2?q=2,
  http://sub.example.org/page3

In another case, the string may contain different URLs which we need to extract and display on separate lines. Critically, the terminator of the URL varies according to the provided string.

Notably, our examples only involve HTTP and HTTPS URLs which we’ll extract via several commands.

3. Using the grep Command

grep is primarily used to search and match the given pattern within data streams or text files. Additionally, we can also execute it to extract a complete URL from a string in the shell.

3.1. Extract a URL With grep

Now, let’s extract a URL with grep. Let’s consider an example:

$ echo Welcome to our website https://www.baeldung.com or visit https://example.com |
  grep -o 'http[s]\?://[^ ]\+' | head -1
https://www.baeldung.com

In the above script:

  • pipe (|) operator redirects the output of echo to grep
  • grep extracts the URL from a specified string using the http[s]\?://[^ ]\+ pattern
  • with grep, the -o option only displays the matched parts of a string

Furthermore, http[s]\?://[^ ]\+ represents a regex pattern to recognize URLs:

  • http[s]\? matches the HTTP or HTTPS protocol and \? makes the s character optional
  • :// matches the colon and the double slashes that are common in URLs
  • [^ ] matches any character that isn’t a space, the chosen terminator in this case
  • \+ matches one or more (of the non-space) characters

At the end, the head command limits the output to the first URL that grep finds and displays it on the terminal.

3.2. Extract Different URLs With grep

In this particular case, we’ll use grep to extract different URLs from a string and separate them on different lines:

$ echo "Check out these URLs: https://www.baeldung.com?q=1, https://example.com/path,
  https://sub.example.org/page1?, http://www.example.net/page2?q=2,
  http://sub.example.org/page3" |
  grep -o -E 'https?://[^,]+' | tr ',' '\n'
https://www.baeldung.com?q=1
https://example.com/path
https://sub.example.org/page1?
http://www.example.net/page2?q=2
http://sub.example.org/page3

Here, the -E option with grep enables extended regular expressions for pattern matching. Notably, the terminator of strings here is either the end of the file or a comma. In case of a different terminator, we should adjust accordingly within the [^,] construct.

The pipe (|) operator redirects grep output to tr. Then, the tr command replaces all commas with\n newline characters in the extracted URLs. As a result, we can see the extracted URLs on separate lines.

4. Using the awk Command

In addition to grep, we can also use awk to define patterns and operations and extract information from text data.

4.1. Extract a URL With awk

In this scenario, we’ll use awk for extracting a complete URL from a string:

$ echo Welcome to our website https://www.baeldung.com or visit https://example.com |
  awk '{
    for (i = 1; i <= NF; i++) {
      if ($i ~ /^http[s]?:\/\/[^\ ]+$/) {
        print $i
        break
      }
    }
  }'
https://www.baeldung.com

In the script above, the awk command processes the sample string word by word with the help of a for loop. Then, an if statement checks whether the current word $i matches the URL pattern. Lastly, the print statement shows the first matched URL from the string.

Same as before, the space character is the terminator.

4.2. Extract Different URLs With awk

Additionally, the awk command can also extract different URLs from a string:

$ echo "Check out these URLs: https://www.baeldung.com?q=1, https://example.com/path,
  https://sub.example.org/page1?, http://www.example.net/page2?q=2,
  http://sub.example.org/page3" |
  awk '{
    for (i = 1; i <= NF; i++) {
      if ($i ~ /^https?:\/\/[^,]+/) {
        gsub(/,$/, "", $i);
        print $i
      }
    }
  }'
https://www.baeldung.com?q=1
https://example.com/path
https://sub.example.org/page1?
http://www.example.net/page2?q=2
http://sub.example.org/page3

Here, the regex pattern /https?:\/\/[^,]+/ matches URLs that start with http:// or https:// and end with a comma. After that, the gsub() function removes the trailing comma from each matched URL.

As a result, the print statement displays each extracted URL on a separate line.

5. Using the sed Command

In Unix-like operating systems, the sed command is used for processing and transforming text.

5.1. Extract a URL With sed

More specifically, we’ll use sed for extracting a complete URL from a string:

$ echo Welcome to our website https://www.baeldung.com |
  sed -n 's/.*\(http[s]\?:\/\/[^ ]\+\).*/\1/p'
https://www.baeldung.com

Now, we can discuss the inner workings of the script:

  • sed operates the input string and searches for the specified URL pattern
  • with sed, the -n suppresses automatic printing
  • s/.*\(http[s]\?:\/\/[^ ]\+\).*/\1/p defines a substitution operation within sed to locate and extract the complete URL

So let’s break down the pattern:

  • s/ indicates the start of a substitution command
  • first .* matches any characters at the start of a string
  • http[s]\?:\/\/ matches the HTTP or HTTPS protocol and the common :// part of a URL
  • [^ ]\+ captures one or more non-whitespace characters
  • final .* continues matching any remaining characters in the string
  • \1 refers to the first capture group, which is the complete URL
  • p instructs sed to print the extracted URL on the terminal

Consequently, we get an equivalent result via sed as well.

5.2. Extract Different URLs With sed

As we explore further, let’s try to extract different URLs from a string with sed and separate them on new lines:

$ echo "Check out these URLs:
  https://www.baeldung.com?q=1, 
  https://example.com/path,
  https://sub.example.org/page1?,
  http://www.example.net/page2?q=2,
  http://sub.example.org/page3" |
  sed -n 's/.*\(http[s]\?:\/\/[^,]\+\).*/\1/p'
https://www.baeldung.com?q=1
https://example.com/path
https://sub.example.org/page1?
http://www.example.net/page2?q=2
http://sub.example.org/page3

In addition, the p option in the sed command prints each matched URL on a separate line.

6. Conclusion

In this article, we’ve learned methods to extract a complete URL from a string. These methods included the grep, awk, and sed commands.

Naturally, we can use the grep command for basic URL extraction tasks. Meanwhile, the awk command is considered a great choice in more advanced scenarios when we need to perform more complex text processing. Moreover, we can also utilize the sed command for extracting URLs with specific patterns accordingly.

Ultimately, we can use any of these effective methods based on the specific requirement of the URL extraction task.

Comments are closed on this article!