Baeldung Pro – Linux – NPI EA (cat = Baeldung on Linux)
announcement - icon

Learn through the super-clean Baeldung Pro experience:

>> Membership and Baeldung Pro.

No ads, dark-mode and 6 months free of IntelliJ Idea Ultimate to start with.

Partner – Orkes – NPI EA (tag=Kubernetes)
announcement - icon

Modern software architecture is often broken. Slow delivery leads to missed opportunities, innovation is stalled due to architectural complexities, and engineering resources are exceedingly expensive.

Orkes is the leading workflow orchestration platform built to enable teams to transform the way they develop, connect, and deploy applications, microservices, AI agents, and more.

With Orkes Conductor managed through Orkes Cloud, developers can focus on building mission critical applications without worrying about infrastructure maintenance to meet goals and, simply put, taking new products live faster and reducing total cost of ownership.

Try a 14-Day Free Trial of Orkes Conductor today.

1. Overview

Showing a character at a specific position in a file in Linux can be useful in various situations, particularly when we need to manipulate or extract specific data from a file.

In this tutorial, we’ll explore various command-line utilities to show characters at specific positions in a file.

2. Understanding the Scenario

Let’s imagine we’ve got a set of sample PDF files, namely, sample1.pdf, sample2.pdf, and sample3.pdf. If we check the first line from each of these files, we notice an interesting similarity, irrespective of the content within these files:

$ head -1 sample1.pdf sample2.pdf sample3.pdf
==> sample1.pdf <==
%PDF-1.3

==> sample2.pdf <==
%PDF-1.4

==> sample3.pdf <==
%PDF-1.5

Each of the files contains a PDF version in the first line. It’s part of the header information that precedes the actual content of the file. Further, the version information identifies the version of the PDF specification so that PDF reader applications can render them correctly.

Our goal is to extract the version value from the header information for the given set of PDF files. For this purpose, we’ll extract the characters at the 6th, 7th, and 8th positions and show them sequentially.

3. Using head and tail

Using the -c option in the head command, we can get the first N bytes from the input stream. Similarly, we can use the tail command with the -c option to get the last N bytes:

$ head -c N
$ tail -c N

To get the version value from the header, first, we can get the first 8 characters using the head command. From these eight characters, we can extract the last 3 characters using the tail command:

$ head -c8 sample1.pdf | tail -c3
1.3
$ head -c8 sample2.pdf | tail -c3
1.4
$ head -c8 sample3.pdf | tail -c3
1.5

Perfect! We got this one right.

4. Using read

In this section, let’s learn how we can use the read command for reading characters at specific positions in the file. To get the version information, we use the -n option with the read command and execute our approach in three stages.

In the first stage, we read the first 8 characters (bytes) from the sample1.pdf file into the pdf_version_header variable:

$ IFS= read -n 8 pdf_version_header < sample1.pdf

At this step, let’s also verify the contents of the pdf_version_header variable:

$ echo $pdf_version_header
%PDF-1.3

Now, in the second stage, we read the first 5 characters from the sample1.pdf file into the version_prefix variable:

$ IFS= read -n 5 version_prefix < sample1.pdf

Again, let’s check the contents of the version_prefix variable:

$ echo $version_prefix
%PDF-

Lastly, in the third stage, we apply parameter substitution using the ${parameter#prefix} syntax to remove version_prefix from pdf_version_header:

$ echo ${pdf_version_header#$version_prefix}
1.3

That’s it! We got the expected results.

Lastly, we must note that we set IFS= in all read operations for setting an empty string as the input field separator so that read doesn’t trim any whitespace in our file.

5. Using od

In this section, let’s explore the od command-line utility for reading characters at specific positions from the files.

Let’s start by seeing the -j and -N options that we can use to specify how many bytes to skip and read, respectively:

$ od -j <skip_bytes> -N <read_bytes>

Now, let’s use the -c option to show printable characters along with the -j and -N options for reading bytes at specific bytes from sample1.pdf:

$ od -c -j5 -N 3 sample1.pdf
0000005   1   .   3
0000010

At this stage, we see the version information, but there is also an additional column with an address offset in the first column. Additionally, the characters in the version are separated by whitespaces.

Lastly, let’s use the -An option to suppress the address offset and remove the unnecessary whitespace using the tr command:

$ od -An -c -j5 -N 3 sample1.pdf | tr -d ' '
1.3

The output looks as expected.

6. Using awk

Awk is a robust programming language for text-processing use cases. So, it’s natural to explore awk for solving our use case for getting the characters at a specific position in a file.

6.1. Getting the PDF Version

For our use case, we need to extract the characters at positions 6 to 8. So, let’s write an awk script and use the -v option to pass the start_position and end_positon parameters to it:

$ awk -v start_position=6 -v end_position=8 \
'{
    version=substr($0, start_position, end_position);
    print(version);
    exit;
}' sample1.pdf
1.3

We must note that the core logic relies on the substr() function to get the substring from the entire record ($0). Additionally, we did an early exit as we didn’t want to process the remaining content of the PDF file.

6.2. Generic Scenario

Let’s say we want to get characters at specific positions from the numbers.txt file:

$ cat numbers.txt
one
two
three
four
five
six
seven
eight
nine
ten

Our existing awk script doesn’t handle the scenario when the position of characters goes beyond the first line. Further, let’s reuse this dataset in the following sections, too.

However, we can make minor modifications to our script to handle the generic scenario:

$ cat show_chars_specific_positions.awk
{
    if (length >= start_position) {
        result=sprintf("%s%s", result, substr($0, start_position, end_position));
    }
    if(length >= end_position) {
        print(result);
        exit;
    }
    start_position-=length;
    end_position-=length;
}

We reduce the start_position and end_position by the length of the current line until we cross the end_position. Additionally, we concatenate the characters that fall within the range in the result variable.

Now, let’s verify our script with a few sample scenarios:

$ awk -v start_position=1 -v end_position=3 -f show_chars_specific_positions.awk numbers.txt
one
$ awk -v start_position=4 -v end_position=6 -f show_chars_specific_positions.awk numbers.txt
two
$ awk -v start_position=7 -v end_position=14 -f show_chars_specific_positions.awk numbers.txt
threefou

Fantastic! It looks like we’ve nailed this one.

7. Using sed and cut

Two alternative popular text-processing utilities, sed and cut, can help solve the current use case of getting the PDF version with ease. However, for more advanced use cases, it’s recommended to use one of the earlier approaches.

7.1. With sed

For writing a sed script to get the version information, we can represent the header line using a regular expression:

.{5}(.{3})$

We use the dot (.) symbol to denote any character and quantify its occurrence with the {} symbol. Further, we group the last three characters before $ (end of line) using parentheses.

Next, let’s write a one-liner sed script and see it in action:

$ sed -n -E "s/.{5}(.{3})$/\1/p; q;" sample1.pdf
1.3

Great! We solved our use case conveniently using substitution with grouping.

7.2. With cut

We can use the -c option available with the cut command to show characters at Nth position:

$ cut -c N

Let’s extract the PDF header from the sample1.pdf file into the header_sample1 variable:

header_sample1="$(head -1 sample1.pdf)"

It’s important to note that the cut command operates line-by-line, so we use it specifically for the PDF header.

Now, we can write a for loop to get the 6th, 7th, and 8th character from the header string:

$ for pos in 6 7 8
do
    echo "$header_sample1" | cut -c $pos | tr -d '\n'
done
1.3

We got the correct result for the sample1.pdf file. Similarly, we can get the version value for the sample2.pdf and sample3.pdf files. Further, we must note that we used the tr command to remove the newline characters, as each invocation of the cut command adds a new line in the output.

In this approach, we called the cut command for each position. However, we can do this more efficiently by specifying a range of positions with the -c option. Let’s see this in action by fetching the characters in at the positions in the [6-8] range:

$ echo "$header_sample1" | cut -c 6-8
1.3

It looks convenient and far more efficient than our initial approach.

8. Conclusion

In this article, we learned how to show a character at a specific position in a file. Further, we applied the learning for a practical use case of identifying the version of PDF specification for a set of PDF files.

Lastly, we explored command-line utilities, such as head, tail, sed, awk, od, cut, and read, while solving the use case.