1. Overview

In this tutorial, we’ll explain how to use a regular expression to check if an English string is a valid URL.

2. Problem Description

Let s be an English string. Our regular expression should return true if \boldsymbol{s} is a valid URL and false otherwise.

For example, it should output true for http://artifical-tech.com and false for http://artificial-tech. An example of a use case is checking if an image’s source attribute is a valid URL.

3. URL Regex

Let’s start with an example elucidating the high-level structure of a URL:

URL Regex

A URL is made up of several components: scheme, authority, path, query, and fragment. We’ll show a regular expression for each part.

3.1. Regex for the Scheme

A URL starts with the scheme name. It’s a mandatory component we can match with:

SCHEME = [a-zA-Z][a-zA-Z0-9+.-]*

The SCHEME regular expression tells us to match a letter and zero or more combinations of letters, digits, pluses, periods, and hyphens.

3.2. Regex for the Authority and Path

Both the authority and the path components are optional. The path has to start with the slash character if the authority is present in the URL.

Another rule is that the path can’t start with two slashes if the authority isn’t there.

The HIERPART regular expression represents these rules:

HIERPART = // AUTHORITY PATH

It starts with a double slash. After it, we have the AUTHORITY and PATH sub-expressions.

3.3. Regex for the Authority

Authority can be described as a registered name or server address. The double slash comes before the authority part. The slash, question mark, number sign character, or the end of the URL come after the authority component. It contains three sub-components: the user info, host, and port:

AUTHORITY = (USERINFO@)?HOST(:PORT)?

The @ character can follow the user info component to separate it from the host. We separate the port from the host with a colon.

The user info component contains the username and optional scheme-specific information:

USERINFO = [-a-zA-Z0-9._~!$&\"*+,;=:(PCTENCODED)]*

PCTENCODED stands for octet data octet when the original octet’s sequence of characters isn’t in the allowed set or is used as a delimiter. The corresponding regular expression is:

PCTENCODED = % HEXDIG HEXDIG

where HEXDIG is a hexadecimal digit.

The host component can be an IPv6 literal address enclosed in square brackets, an IPv4 address in its dotted decimal form, or a registered name. We’ll focus on the case with a registered name:

HOST = [-a-zA-Z0-9._~!$&\"*+,;=:(PCTENCODED)]*

A registered name is a sequence of strings, the domain labels, separated by dots. Each domain label begins with a letter and can contain dashes. The registered name’s regular expression is the same as for the user info.

Finally, the port is an optional subcomponent of the authority. It’s a decimal number:

PORT = [0-9]*

For example, the default port for the HTTP scheme is 80.

3.4. Regex for the Path

The path is most commonly organized in a hierarchical form. We use it to locate a resource. It has four options for matching:

PATH = PATH-ABEMPTY | PATH-ABSOLUTE | PATH-NOSCHEME | PATH-ROOTLESS | PATH-EMPTY

The PATH-ABEMPTY is a regular expression that represents a path that begins with / or is empty:

PATH-ABEMPTY = (\/SEGMENT)*

It has one or more segments:

SEGMENT = [-a-zA-Z0-9._~!$&\"*+,;=:@(PCTENCODED)]*

PATH-ABSOLUTE  represents a path that begins with / and not //:

PATH-ABSOLUTE= \/(SEGMENT-NZ (\/ SEGMENT)*)?

Its first segment can’t be an empty string:

SEGMENT-NZ = [-a-zA-Z0-9._~!$&\"*+,;=:@(PCTENCODED)]+

PATH-NOSCHEME matches a path that begins with a non-colon segment:

PATH-NOSCHEME = SEGMENT-NZ-NC (\/ SEGMENT)*

The SEGMENT-NZ-NC regular expression is the same as SEGMENT-NZ, only without the colon character.

Finally, PATH-ROOTLESS stands for a path that begins with a segment that can contain a colon:

PATH-ROOTLESS = SEGMENT-NZ (\/ SEGMENT)*

3.5. Regex for the Query

The query contains non-hierarchal data and is used with the path to locate a resource. The data is usually a sequence of attribute-value pairs separated by a delimiter.

The corresponding regular expression is:

QUERY = [-a-zA-Z0-9._~!$&\"*+,;=:@\/?(PCTENCODED)]*

The query is preceded by the question mark. After it, we can have a hashtag.

3.6. Regex for the Fragment

The fragment contains data to identify a secondary resource:

FRAGMENT = [-a-zA-Z0-9._~!$&\"*+,;=:@\/?(PCTENCODED)]*

3.7. The Complete Regular Expression for a URL

The complete regular expression for the URL combines all the previous regular expressions:

URL = SCHEME:HIERPART(\?QUERY)?(#FRAGMENT)?

4. Examples

For example, all these are valid URLs, and our regex would recognize them as such:

  • http://google.com/
  • http://google.com
  • http://google.com/search?q=what+is+a+domain+name/

On the other hand, the following strings don’t match the regular expression because they aren’t valid URLs:

  • 1ttp://google.com/ – This example fails since the scheme must start with a letter.
  • http://google.com/searchq=what+is+a+domain+name/ – No match since the path must be followed by ?, #, or an empty string.
  • htt@p://google.com/ – This example fails since the scheme can only contain letters, digits, the +, dot, or – characters.

5. Complexity

Since the regular expression contains greedy quantifiers  (* and +), it can process input strings of any length.

The length of the regex is a constant with respect to the input string’s length n. So, since each input character is read only once, the complexity of matching is O(n).

6. Conclusion

In this article, we showed a regular expression for checking whether a string is a valid URL. The URL consists of a scheme, followed by optional authority, path, query, and segment components.

Comments are closed on this article!