1. Overview

Showing a character at a specific position in a file in Linux can be useful in various situations, particularly when we need to manipulate or extract specific data from a file.

In this tutorial, we’ll explore various command-line utilities to show characters at specific positions in a file.

2. Understanding the Scenario

Let’s imagine we’ve got a set of sample PDF files, namely, sample1.pdf, sample2.pdf, and sample3.pdf. If we check the first line from each of these files, we notice an interesting similarity, irrespective of the content within these files:

$ head -1 sample1.pdf sample2.pdf sample3.pdf
==> sample1.pdf <==
%PDF-1.3

==> sample2.pdf <==
%PDF-1.4

==> sample3.pdf <==
%PDF-1.5

Each of the files contains a PDF version in the first line. It’s part of the header information that precedes the actual content of the file. Further, the version information identifies the version of the PDF specification so that PDF reader applications can render them correctly.

Our goal is to extract the version value from the header information for the given set of PDF files. For this purpose, we’ll extract the characters at the 6th, 7th, and 8th positions and show them sequentially.

3. Using head and tail

Using the -c option in the head command, we can get the first N bytes from the input stream. Similarly, we can use the tail command with the -c option to get the last N bytes:

$ head -c N
$ tail -c N

To get the version value from the header, first, we can get the first 8 characters using the head command. From these eight characters, we can extract the last 3 characters using the tail command:

$ head -c8 sample1.pdf | tail -c3
1.3
$ head -c8 sample2.pdf | tail -c3
1.4
$ head -c8 sample3.pdf | tail -c3
1.5

Perfect! We got this one right.

4. Using read

In this section, let’s learn how we can use the read command for reading characters at specific positions in the file. To get the version information, we use the -n option with the read command and execute our approach in three stages.

In the first stage, we read the first 8 characters (bytes) from the sample1.pdf file into the pdf_version_header variable:

$ IFS= read -n 8 pdf_version_header < sample1.pdf

At this step, let’s also verify the contents of the pdf_version_header variable:

$ echo $pdf_version_header
%PDF-1.3

Now, in the second stage, we read the first 5 characters from the sample1.pdf file into the version_prefix variable:

$ IFS= read -n 5 version_prefix < sample1.pdf

Again, let’s check the contents of the version_prefix variable:

$ echo $version_prefix
%PDF-

Lastly, in the third stage, we apply parameter substitution using the ${parameter#prefix} syntax to remove version_prefix from pdf_version_header:

$ echo ${pdf_version_header#$version_prefix}
1.3

That’s it! We got the expected results.

Lastly, we must note that we set IFS= in all read operations for setting an empty string as the input field separator so that read doesn’t trim any whitespace in our file.

5. Using od

In this section, let’s explore the od command-line utility for reading characters at specific positions from the files.

Let’s start by seeing the -j and -N options that we can use to specify how many bytes to skip and read, respectively:

$ od -j <skip_bytes> -N <read_bytes>

Now, let’s use the -c option to show printable characters along with the -j and -N options for reading bytes at specific bytes from sample1.pdf:

$ od -c -j5 -N 3 sample1.pdf
0000005   1   .   3
0000010

At this stage, we see the version information, but there is also an additional column with an address offset in the first column. Additionally, the characters in the version are separated by whitespaces.

Lastly, let’s use the -An option to suppress the address offset and remove the unnecessary whitespace using the tr command:

$ od -An -c -j5 -N 3 sample1.pdf | tr -d ' '
1.3

The output looks as expected.

6. Using awk

Awk is a robust programming language for text-processing use cases. So, it’s natural to explore awk for solving our use case for getting the characters at a specific position in a file.

6.1. Getting the PDF Version

For our use case, we need to extract the characters at positions 6 to 8. So, let’s write an awk script and use the -v option to pass the start_position and end_positon parameters to it:

$ awk -v start_position=6 -v end_position=8 \
'{
    version=substr($0, start_position, end_position);
    print(version);
    exit;
}' sample1.pdf
1.3

We must note that the core logic relies on the substr() function to get the substring from the entire record ($0). Additionally, we did an early exit as we didn’t want to process the remaining content of the PDF file.

6.2. Generic Scenario

Let’s say we want to get characters at specific positions from the numbers.txt file:

$ cat numbers.txt
one
two
three
four
five
six
seven
eight
nine
ten

Our existing awk script doesn’t handle the scenario when the position of characters goes beyond the first line. Further, let’s reuse this dataset in the following sections, too.

However, we can make minor modifications to our script to handle the generic scenario:

$ cat show_chars_specific_positions.awk
{
    if (length >= start_position) {
        result=sprintf("%s%s", result, substr($0, start_position, end_position));
    }
    if(length >= end_position) {
        print(result);
        exit;
    }
    start_position-=length;
    end_position-=length;
}

We reduce the start_position and end_position by the length of the current line until we cross the end_position. Additionally, we concatenate the characters that fall within the range in the result variable.

Now, let’s verify our script with a few sample scenarios:

$ awk -v start_position=1 -v end_position=3 -f show_chars_specific_positions.awk numbers.txt
one
$ awk -v start_position=4 -v end_position=6 -f show_chars_specific_positions.awk numbers.txt
two
$ awk -v start_position=7 -v end_position=14 -f show_chars_specific_positions.awk numbers.txt
threefou

Fantastic! It looks like we’ve nailed this one.

7. Using sed and cut

Two alternative popular text-processing utilities, sed and cut, can help solve the current use case of getting the PDF version with ease. However, for more advanced use cases, it’s recommended to use one of the earlier approaches.

7.1. With sed

For writing a sed script to get the version information, we can represent the header line using a regular expression:

.{5}(.{3})$

We use the dot (.) symbol to denote any character and quantify its occurrence with the {} symbol. Further, we group the last three characters before $ (end of line) using parentheses.

Next, let’s write a one-liner sed script and see it in action:

$ sed -n -E "s/.{5}(.{3})$/\1/p; q;" sample1.pdf
1.3

Great! We solved our use case conveniently using substitution with grouping.

7.2. With cut

We can use the -c option available with the cut command to show characters at Nth position:

$ cut -c N

Let’s extract the PDF header from the sample1.pdf file into the header_sample1 variable:

header_sample1="$(head -1 sample1.pdf)"

It’s important to note that the cut command operates line-by-line, so we use it specifically for the PDF header.

Now, we can write a for loop to get the 6th, 7th, and 8th character from the header string:

$ for pos in 6 7 8
do
    echo "$header_sample1" | cut -c $pos | tr -d '\n'
done
1.3

We got the correct result for the sample1.pdf file. Similarly, we can get the version value for the sample2.pdf and sample3.pdf files. Further, we must note that we used the tr command to remove the newline characters, as each invocation of the cut command adds a new line in the output.

In this approach, we called the cut command for each position. However, we can do this more efficiently by specifying a range of positions with the -c option. Let’s see this in action by fetching the characters in at the positions in the [6-8] range:

$ echo "$header_sample1" | cut -c 6-8
1.3

It looks convenient and far more efficient than our initial approach.

8. Conclusion

In this article, we learned how to show a character at a specific position in a file. Further, we applied the learning for a practical use case of identifying the version of PDF specification for a set of PDF files.

Lastly, we explored command-line utilities, such as head, tail, sed, awk, od, cut, and read, while solving the use case.

Comments are open for 30 days after publishing a post. For any issues past this date, use the Contact form on the site.