Finding Duplicate Files With the Same Name

1. Overview

In Linux, unique filenames help us to identify data efficiently without looking at its content.

In this tutorial, we’ll learn how to find duplicate files with the same name in different letter cases inside a directory.

2. Scenario Setup

Let’s start by using the exa command to look at the directory structure for the scenario:

$ exa --tree my_dir
my_dir
├── abc.jpeg
├── aBc.jpeg
├── ABC.jpeg
├── def.jpeg
├── sub_dir1
│  └── abc.jpeg
└── sub_dir2
   ├── abc.jpeg
   └── uvw.jpeg

From the output, we notice multiple files having some variation of abc.jpeg as the filename, such as abc.jpeg, aBc.jpeg, and ABC.jpeg. Our immediate goal is to recursively find all such files inside the my_dir directory.

3. Using find With -iname Option

Using the -iname option available with the find command, we can do a case-insensitive search inside the my_dir directory. Let’s go ahead and see this in action:

$ find my_dir/ -iname 'abc.jpeg'
my_dir/abc.jpeg
my_dir/aBc.jpeg
my_dir/sub_dir2/abc.jpeg
my_dir/ABC.jpeg
my_dir/sub_dir1/abc.jpeg

Great! We’ve got all the file paths referring to files with the name abc.jpeg in different letter cases.

4. Using find With grep

Alternatively, we can use the find command for file search and delegate the responsibility of the -iname option to the grep command with the –ignore-case option.

Let’s use the find and grep commands to find files with duplicate names:

$ find my_dir/ | grep -i -E ".*/abc.jpeg"
my_dir/abc.jpeg
my_dir/aBc.jpeg
my_dir/sub_dir2/abc.jpeg
my_dir/ABC.jpeg
my_dir/sub_dir1/abc.jpeg

Perfect! We got the desired result as expected.

5. Using find With awk

Awk is a robust programming language that we can use for data extraction. In this section, we’ll use awk to do a case-insensitive search to solve our use case.

Let’s write the find_duplicates.awk script that takes the output of the find command and extracts the duplicate filenames:

$ cat find_duplicates.awk
{
    split($0, current_array, "/");
    size=length(current_array);
    if(tolower(current_array[size])==tolower(SEARCH_FILE)) {
        print $0;
    }
}

Firstly, we split the file path strings based on the “/” delimiter and stored the result in the current_array variable. Next, we used the tolower() function for a case-insensitive comparison between SEARCH_FILE and the filename available in the current line. Lastly, we should note that we’ll pass the SEARCH_FILE parameter to the find_duplicates.awk script.

Moving on, let’s conclude by validating the functionality of the find_duplicates.awk script:

$ awk -v SEARCH_FILE="abc.jpeg" -f find_duplicates.awk <(find my_dir/)
my_dir/abc.jpeg
my_dir/aBc.jpeg
my_dir/sub_dir2/abc.jpeg
my_dir/ABC.jpeg
my_dir/sub_dir1/abc.jpeg

It looks like we’ve got this right!

6. Using find With sed

Like awk, sed is an impressive utility for doing text operations. In this section, we’ll use sed commands to solve our use case.

6.1. sed Script

Let’s start by identifying the logic and relevant sed commands we want to add to our find_duplicates_abc_jpeg.sed script.

First, we must keep track of each file’s absolute path and name. So let’s see how we can make use of the group substitution feature in sed to achieve this:

$ find my_dir | sed -E -n -e 's#(.*)/(.*)#\2@&#p'
abc.jpeg@my_dir/abc.jpeg
aBc.jpeg@my_dir/aBc.jpeg
sub_dir2@my_dir/sub_dir2
abc.jpeg@my_dir/sub_dir2/abc.jpeg
uvw.jpeg@my_dir/sub_dir2/uvw.jpeg
ABC.jpeg@my_dir/ABC.jpeg
sub_dir1@my_dir/sub_dir1
abc.jpeg@my_dir/sub_dir1/abc.jpeg
def.jpeg@my_dir/def.jpeg

We created two groups represented by \1 and \2, wherein the second group contains the filename. Later, we concatenated the filename (\2) and the original string (&), denoting the absolute path.

Now, we can add our first sed substitution command to the script, followed by the hold(h) command to save the substitution output in the hold space:

s#(.*)/(.*)#\2@&#
h

Next, we can write the logic to initiate filename matching:

s/^abc.jpeg@.*$//i
t match_begin
b next

Over here, we use the case-insensitive substitution flag (i) to check if the path ends with abc.jpeg as the filename. Further, we use the t command to do a conditional branching to the match_begin label and the b command for an unconditional branching to the next label.

Continuing this approach, let’s define the match_begin, match_success, and the next labels:

:match_begin
s/^$//
t match_success
b next

:match_success
g
s/.+@(.*)/\1/i
p
b next

:next

Let’s break this down to understand the nitty gritty of the logic. In the match_begin block, we check if the pattern space is empty and branch to the match_success block for the positive matches. Further, in the match_success block, we retrieve the original string from the hold space and derive the file’s absolute path. Lastly, the next block is a no-op to resume processing the next entry.

6.2. sed Script in Action

Before executing our find_duplicates_abc_jpeg.sed script, let’s see its entire code:

$ cat find_duplicates_abc_jpeg.sed
s#(.*)/(.*)#\2@&#
h
s/^abc.jpeg@.*$//i
t match_begin
b next

:match_begin
s/^$//
t match_success
b next

:match_success
g
s/.+@(.*)/\1/i
p
b next

:next

Next, let’s validate our logic by running the script:

$ find my_dir/ | sed -n -E -f find_duplicates_abc_jpeg.sed
my_dir/abc.jpeg
my_dir/aBc.jpeg
my_dir/sub_dir2/abc.jpeg
my_dir/ABC.jpeg
my_dir/sub_dir1/abc.jpeg

Great! We’ve nailed this!

6.3. sed Script Generator

In this section, we’ll write a Bash script to generate the sed script dynamically, so we can later reuse it for any filename.

Let’s go ahead and use the tee command to write the sed_script_generator.sh:

# cat sed_script_generator.sh
#!/bin/bash
SED_SCRIPT_NAME="find_duplicates_${1/\./_}.sed"
tee -a ${SED_SCRIPT_NAME} 1>/dev/null <<END_SCRIPT
# same as find_duplicates_abc_jpeg.sed, except using $1 in place of abc.jpeg
END_SCRIPT
echo ${SED_SCRIPT_NAME}

We must note that the script uses the positional argument ($1) to use it for different filenames.

Finally, let’s use it to generate the sed script by passing abc.jpeg as the first positional argument:

./sed_script_generator.sh abc.jpeg
find_duplicates_abc_jpeg.sed

7. Using find in a Bash Script

In this section, we’ll learn to write a Bash script that uses the find command and array iteration to determine the files with duplicate names.

Let’s go ahead and see the find_duplicates.sh Bash script in its entirety:

$ cat find_duplicates.sh
#!bin/bash
set -e
SEARCH_FILE="$1"
SRC_DIR_ABS_PATH="$(realpath my_dir)"

files=("$(find $SRC_DIR_ABS_PATH -type f)")

shopt -s nocasematch
for file in ${files[@]}
do
    filename=$(basename $file)
    if [[ $filename == "${SEARCH_FILE}" ]]
    then
        echo -e $file
    fi
done

shopt -u nocasematch

exit 0

Next, let’s understand the logical flow of the script.

First, we define the SEARCH_FILE variable to store the filename passed using the first positional argument ($1). Next, we use the find command to initialize the files array with the absolute file paths. Lastly, we iterate over the files array for performing case-insensitive comparisons supported by the nocasematch option of the shopt built-in command.

Great! We’re now ready to run our script and verify the results::

$ ./find_duplicates.sh abc.jpeg
/my_dir/abc.jpeg
/my_dir/aBc.jpeg
/my_dir/sub_dir2/abc.jpeg
/my_dir/ABC.jpeg
/my_dir/sub_dir1/abc.jpeg

8. Using ls in a Bash Script

We can also use the ls command to get the file paths and later process the results to determine the files with duplicate names.

The foremost thing to understand in this approach is the output format of the ls command. Let’s run the ls command to analyze the output format:

$ ls -lRo --quoting-style=c my_dir
"my_dir":
total 8
-rw-r--r-- 1 root    0 Mar 22 05:57 "ABC.jpeg"
-rw-r--r-- 1 root    0 Mar 22 05:57 "aBc.jpeg"
-rw-r--r-- 1 root    0 Mar 22 05:57 "abc.jpeg"
-rw-r--r-- 1 root    0 Mar 22 05:57 "def.jpeg"
drwxr-xr-x 2 root 4096 Mar 22 05:57 "sub_dir1"
drwxr-xr-x 2 root 4096 Mar 22 05:57 "sub_dir2"

"my_dir/sub_dir1":
total 0
-rw-r--r-- 1 root 0 Mar 22 05:57 "abc.jpeg"

"my_dir/sub_dir2":
total 0
-rw-r--r-- 1 root 0 Mar 22 05:57 "abc.jpeg"
-rw-r--r-- 1 root 0 Mar 22 05:57 "uvw.jpeg"

We must note that we used the –quoting-style option to surround the filenames with double quotes so that it’s easy to parse them later. Further, the files are grouped under directories, and we don’t get the absolute file path in the same line. As a result, we’ll need to look up the path prefix to get the complete file paths after processing.

Now, let’s go ahead and look at the ls_duplicates.sh script in its entirety:

$ cat ls_duplicates.sh
#!/bin/bash

FILE_ENTRIES=($(ls -lRo --quoting-style=c my_dir | grep -ino -E '"'.*$SEARCH_FILE$'"'))
PREFIX_ENTRIES=($(ls -lRo my_dir | grep -ni "^.*:$"))

dir_prefix=
for file_entry in "${FILE_ENTRIES[@]}"
do
    file_entry_line=$(echo $file_entry | cut -d ":" -f 1)
    file_entry_filename=$(echo $file_entry | cut -d ":" -f 2 | xargs -n 1)
    for index in "${!PREFIX_ENTRIES[@]}"
    do
        prefix_entry_line=$(echo ${PREFIX_ENTRIES[$index]} | cut -d ":" -f 1)
        prefix_entry_dir=$(echo ${PREFIX_ENTRIES[$index]} | cut -d ":" -f 2)

        if [[ ${prefix_entry_line} -lt ${file_entry_line} ]]
        then
            dir_prefix=$prefix_entry_dir
            continue
        fi
    done
    echo "$(realpath ${dir_prefix:-.}/${file_entry_filename})"
    dir_prefix=
done

exit 0

Next, let’s pause a bit to follow the script’s logic. First, we initialized the FILE_ENTRIES and PREFIX_ENTRIES arrays with the file paths and directory prefixes. While doing so, we also included the line number from the output using the -n option of the grep command. Next, we looked for the correct directory prefix for each file entry by using the last directory prefix that appeared in the output of the ls command. Lastly, we concatenated the directory prefix with the filename for each entry.

Finally, let’s see our ls_script.sh script in action:

# ./ls_script.sh
/my_dir/ABC.jpeg
/my_dir/aBc.jpeg
/my_dir/abc.jpeg
/my_dir/sub_dir1/abc.jpeg
/my_dir/sub_dir2/abc.jpeg

We got the correct results. Nevertheless, we should realize that using the ls command instead of the find command adds an unnecessary overhead of finding the directory prefixes.

9. Finding All Files With Duplicate Names

In this section, we’ll write a Bash script to find all duplicate occurrences for each filename within the my_dir directory.

Let’s start by taking a comprehensive view of the find_all_duplicates.sh Bash script:

$ cat find_all_duplicates.sh
#!/bin/bash
LOWERCASE_FILENAMES=($(find my_dir -type f | sed -n -E -e 's#(.*)/(.*)#\2#p' | tr '[A-Z]' '[a-z]' | sort -u))
SCRIPT="${1}"
EXTENSION="${SCRIPT##*.}"

for file in "${LOWERCASE_FILENAMES[@]}"
do
    if [[ $EXTENSION == "sh" ]]
    then
        CMD=$(echo "$SCRIPT $file my_dir")
    elif [[ $EXTENSION == "sed" ]]
    then
        SCRIPT=$(./sed_script_generator.sh $file)
        CMD="$(echo 'find "my_dir/" | sed -n -E -f '$SCRIPT'')"
    elif [[ $EXTENSION == "awk" ]]
    then
        CMD=$(echo "find my_dir | awk -v SEARCH_FILE=$file -f $SCRIPT")
    fi

    echo "Using $SCRIPT script to find duplicates for $file"
    eval ${CMD}
done

Essentially, we can notice that the find_all_scripts.sh script accepts an argument ($1) to define a script-specific command (CMD). Then, it uses the eval command to evaluate the CMD variable for each file from the LOWERCASE_FILENAMES array.

Finally, let’s see how to invoke the find_all_duplicates.sh for various scripts:

$ ./find_all_duplicates.sh find_duplicates.sh
$ ./find_all_duplicates.sh find.sed
$ ./find_all_duplicates.sh find_duplicates.awk
$ ./find_all_duplicates.sh ls_duplicates.sh
# output omitted

We must note that since sed scripts are generated on the fly, we can pass any filename with a .sed extension.

10. Conclusion

In this article, we learned how to find files with duplicate names in any letter case using the find, ls, sed, awk, grep, and cut commands. Additionally, we learned to write awk, sed, and Bash scripts to solve the use case.

Learn Java Collections

Learn Spring

Learn Maven

View All Courses

Administration

Scripting

Networking

Files

Processes

Full Archive

About Baeldung