1. Introduction

Extracting the filename from a path, Universal Resource Locator (URL), or other Universal Resource Identifier (URI) is a common task when handling files in general. Because of this, there are different ways to accomplish this task. Further, some of them can employ the AWK programming language as implemented by the awk interpreter.

In this tutorial, we’ll use awk to get the filename from a given path. First, we explain how the usual standard tool for this purpose works. After that, we’ll review a more universal but less accurate way of doing what we need. Finally, we explore several solutions that AWK provides for extracting the filename from a path.

We tested the code in this tutorial on Debian 11 (Bullseye) with GNU Bash 5.1.4. It should work in most POSIX-compliant environments unless otherwise specified.

2. How basename Works

Indeed, the basename command is perhaps the most standard way to extract the last part of a path:

$ basename /dir/subdir/file.ext
file.ext
$ basename ./file.ext
file.ext

Notably, the command works for both absolute and relative paths.

Being in the POSIX standard, basename can even handle a URL as long as it doesn’t have query parameters:

$ basename https://gerganov.com/dir/subdir/file.ext
file.ext
$ basename https://gerganov.com/dir/subdir/file.ext?param=value
file.ext?param=value

Even then, we can usually leverage other standard commands like cut with the ? question mark –delimiter (-d to only extract the part before its last occurrence by selecting the relevant –fields or -f:

$ basename https://gerganov.com/dir/subdir/file.ext?param=value | cut --delimiter='?' --fields=1
file.ext

Still, this may present a problem since all characters except / forward slash and NULL are legal in filenames.

Moreover, basename can even remove a suffix like the extension of a file:

$ basename --suffix='.ext' /dir/subdir/file.ext
file

Still, we can perform the main functionality in other ways as well.

3. basename with Regular Expressions and Substitution

Since most of what basename does comes down to returning what comes after the last / forward slash of a path, we can emulate its behavior using a Bash regular expression (regexp).

First, let’s store some paths in variables:

$ ABSPATH='/dir/subdir/file.ext'
$ RELPATH='./subdir/file.ext'

Now, we can leverage a regular expression to extract their last parts:

$ echo ${ABSPATH//*\//}
file.ext
$ echo ${RELPATH//*\//}
file.ext

In this case, we use the ${VAR//REGEXP/REP/} Bash regexp syntax to replace *\/ (REGEXP), i.e., * everything before and including the last / forward slash, with an empty string (REP) in the $RELPATH ($VAR) variable.

Alternatively, we can employ variable substitution:

$ echo ${ABSPATH##*/}
file.ext
$ echo ${RELPATH##*/}
file.ext

Here, ${VAR##PAT} is a shorthand way of removing a pattern (PAT) from the beginning of a variable ($VAR).

Although such methods are available in Bash and other programming languages, they exhibit pitfalls for edge cases:

+---------------------------------------------------+
| $PATHVAR     | basename | ${PATHVAR##*/} | Match? |
|              | $PATHVAR |                |        |
|--------------+----------+----------------+--------|
| /dir/subdir/ | subdir   |                | no     |
|--------------+----------+----------------+--------|
| /            | /        |                | no     |
|--------------+----------+----------------+--------|
| /dir         | dir      | dir            | yes    |
|--------------+----------+----------------+--------|
| dir          | dir      | dir            | yes    |
|--------------+----------+----------------+--------|
| dir/         | dir      |                | no     |
|--------------+----------+----------------+--------|
| dir/file     | file     | file           | yes    |
|--------------+----------+----------------+--------|
| dir//        | dir      |                | no     |
|--------------+----------+----------------+--------|
| ..           | ..       | ..             | yes    |
|--------------+----------+----------------+--------|
| .            | .        | .              | yes    |
+---------------------------------------------------+

In fact, these apply to both basename and dirname:

+--------------------------------------------------+
| $PATHVAR     | dirname  | ${PATHVAR%/*} | Match? |
|              | $PATHVAR |               |        |
|--------------+----------+---------------+--------|
| /dir/subdir/ | /dir     | /dir/subdir   | no     |
|--------------+----------+---------------+--------|
| /            | /        |               | no     |
|--------------+----------+---------------+--------|
| /dir         | /        |               | no     |
|--------------+----------+---------------+--------|
| dir          | .        | dir           | -      |
|--------------+----------+---------------+--------|
| dir/         | .        | dir           | -      |
|--------------+----------+---------------+--------|
| dir/file     | dir      | dir           | yes    |
|--------------+----------+---------------+--------|
| dir//        | .        | dir/          | -      |
|--------------+----------+---------------+--------|
| ..           | .        | ..            | no     |
|--------------+----------+---------------+--------|
| .            | .        | .             | yes    |
+--------------------------------------------------+

Thus, it’s usually best to stick to basename and dirname, which include logic that avoids problems with specific paths.

4. Using AWK to Extract Filename

Let’s see some AWK solutions for our task that cover all pitfalls.

4.1. system()

As discussed above, any primitive solution with AWK might miss some edge cases. So, to avoid these problems, we can use basename if it’s available on the system:

$ awk -v PATHVAR='/dir/subdir/file.ext' 'BEGIN{system("basename "PATHVAR)}'
file.ext

In this case, we use the AWK system() function to call the shell basename command with the preset variable $PATHVAR as its argument. While we could use the built-in AWK FILENAME variable, the awk command will error out if the passed file path doesn’t exist.

Since basename is defined by POSIX, most major shells and Linux distributions include it. Still, we might need an AWK-only solution.

4.2. AWK Native basename

Even if we omit the added options, basename is actually a little more complex than it might seem.

Let’s port the minimalistic OpenBSD basename to the basename.awk AWK script:

function basename(path) {
    # store argument in a new variable
    ret = path;

    # check for no or empty argument and return current direc    tory
    if (!ret || ret == "\0") {
        return ".";
    }

    # remove trailing slashes
    gsub(/\/*$/, "", ret);

    # if we end up with an empty string, return / forward sla    sh
    if (!ret) {
        return "/";
    }

    # get basename
    gsub(/.*\//, "", ret);

    return ret;
}

BEGIN {
    print(basename(PATHVAR));
}

In this script, we define a basename() function that takes a single path argument. Inside the function, we assign path to the ret variable. First, we check whether ret is empty or NULL, in which case we return a . period. When the argument does contain non-NULL characters, we remove all trailing slashes. If it turns out the argument only contained slashes and ret is now empty, we return a / forward slash. Otherwise, we remove everything up to the last slash with a greedy gsub() match and return the result.

To invoke the script, we use the awk interpreter :

$ awk -v PATHVAR='/dir/subdir/file.ext' -W exec basename.awk
file.ext

Naturally, this is a relatively extreme way to perform the operation but it uses AWK alone.

5. Summary

In this article, we talked about ways to extract the filename from a path via AWK.

In conclusion, even though the awk interpreter doesn’t provide a native basename function, we can implement one from scratch.

Comments are closed on this article!