Terminals are usually the preferred way to use Linux. Despite their basic text-based command-line interface (CLI) or terminal user interface (TUI) in contrast to a graphical user interface (GUI), we might still encounter problems in case of a bad encoding setting.
In this tutorial, we look at the locale and ways to see the encoding set for the current terminal. First, we go over the basic idea of a locale. Next, we understand how Linux configures and uses it. After that, we explore the main command to check locale settings. Finally, we show how to check the encoding in different contexts.
We tested the code in this tutorial on Debian 11 (Bullseye) with GNU Bash 5.1.4. It should work in most POSIX-compliant environments.
When it comes to software, a locale is a group of settings usually related to regional specifics:
- number formats like 1666000 versus 1,666,000
- character formats like ъ versus X
- date formats like 20.10.2010 versus 2010/10/20
- time formats like 16:56 versus 04:56PM
- currency formats like лв versus $
- paper sizes like A4 versus letter
- other settings
In most instances, we can just use a country and language code to define a set of the above characteristics. For example, bg_BG might define the first from each set of examples in the items above, while us_US might define the second.
2.1. POSIX Format and Standardization
Here, we see the abovementioned country code as territory and language code as language. However, there are two other optional parameters:
- codeset specifies the encoding
- modifier is a name for even more specific or custom variants of a locale
In detail, code sets contain the encoding values for a character set.
Basically, to encode a character means to assign it a numerical value, also called a code point. Multiple code points can get grouped into code pages, otherwise known as a character map.
Simply assigning 1 to a, 2 to b, and continuing from there is a possible encoding. Its overly-simplistic nature would make its usage in computers inefficient.
When it comes to software character encodings, there are many:
- basic ASCII encoding
- Unicode encodings like UTF-8, UTF-16, UTF-32
- ISO encodings like ISO 8859-5
- extended Cyrillic KOI-8 encodings
- Windows encodings like Windows-1251
For example, a character in Unicode might not exist or be completely different in another encoding. Hence, knowing the context of a given numeric value can change its character translation and visual appearance.
3. Linux Locale
Like any other operating system (OS), Linux offers options to change its locale via a set of environment variables:
- $LANG – general language specification
- $LC_CTYPE – character map, lowercase, uppercase, and alphanumeric detection
- $LC_NUMERIC – number formats
- $LC_TIME – date and time formats
- $LC_COLLATE – collation control and string comparison
- $LC_MONETARY – currency formatting
- $LC_MESSAGES – message control
- $LC_PAPER – paper sizes and formats
- $LC_NAME – person naming convention
- $LC_ADDRESS – format of addresses
- $LC_TELEPHONE – format of telephone numbers
- $LC_MEASUREMENT – measurement units and formats
- $LC_IDENTIFICATION – further customization
Further, the values of the $LC_ALL and $LANG variables are used in that other when other $LC_* values are missing. Also, the $LANGUAGE variable with a similar function is independent and can even override $LC_ALL.
The encoding of any context can depend on many factors:
- user settings
- desired language
- current region
- device and software capability
On the last point, both the GUI and CLI can pose limitations when it comes to character presentation.
4. The locale Command
Indeed, the main Linux command to provide locale information is locale.
By default, locale returns the variable values we talked about earlier:
$ locale LANG=en_US.UTF-8 LANGUAGE= LC_CTYPE="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" LC_MEASUREMENT="en_US.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" LC_ALL=
For example, here, we have the fairly common UTF-8 encoding in an English-based (en) environment with the US-region formats.
To get a list of supported locales, we can use the -a or –all-locales flag:
$ locale --all-locales C C.UTF-8 en_US.utf8 POSIX
Notably, this is a very limited choice, but it covers all of UTF-8 in two of the four cases. The other two locales, American National Standards Institute (ANSI) C and POSIX, are predefined with 7-bit ASCII and values for the other $LC_* variables.
Notable, we can still represent values via escape sequences, but our current context might not be able to show them.
To see other supported encodings, -m or –charmaps is useful:
$ locale --charmaps ANSI_X3.110-1983 ANSI_X3.4-1968 ARMSCII-8 ASMO_449 BIG5 BIG5-HKSCS BRF [...]
In fact, in this case, we can pick from 236 character maps.
Finally, we’re able to add a list of space-separated category or keyword names to get targeted output:
$ locale LC_TIME Sun;Mon;Tue;Wed;Thu;Fri;Sat Sunday;Monday;Tuesday;Wednesday;Thursday;Friday;Saturday Jan;Feb;Mar;Apr;May;Jun;Jul;Aug;Sep;Oct;Nov;Dec January;February;March;April;May;June;July;August;September;October;November;December AM;PM %a %d %b %Y %r %Z [...] $ locale --category-name --keyword-name am_pm date_fmt LC_TIME am_pm="AM;PM" LC_TIME date_fmt="%a %d %b %Y %r %Z"
In the last example, -c or –category-name prepends a line with the category name before each output block, while -k or –keyword-name prepends the name of the keyword for the values, e.g., am_pm. Using locale –keyword-name with an LC_* category name, we can acquire lists of possible keyword names.
5. Encoding by Context
After looking at the locale, its constituents, as well as the main command to output their values, let’s continue with encoding checks in different contexts.
5.1. File Encoding
Let’s see a simple example with file:
$ file --mime /etc/hosts /etc/hosts: text/plain; charset=us-ascii
Here, we verify the encoding of /etc/hosts is ASCII.
Of course, tools like vi can also tell us the same.
5.2. GUI Encoding
In most window management systems like the X Window system, we can configure our encoding. Consequently, text on any visual elements uses our setting by default.
Importantly, most GUI environments also have one or more terminal emulators. To set the encoding of a terminal emulator, we modify its settings in the interface or directly via a file.
For example, the GNOME Terminal has the Edit -> Preferences -> Encodings settings with their /org/gnome/terminal/legacy/encodings file counterparts.
5.3. Terminal Encoding
Of course, we can always manually echo a specific $LC_* variable or $LANG:
$ echo $LC_CTYPE $ echo $LANG en_US.UTF-8
Notably, the value of $LC_CTYPE is empty. As we discussed, any empty $LC_* variables use the values of $LC_ALL, $LANG, and $LANGUAGE.
Moreover, the locale command can be useful to clear up any inheritance confusion:
$ locale --category-name --keyword-name charmap LC_CTYPE charmap="UTF-8"
Here, we verify the encoding as UTF-8.
$ perl -e 'use Term::Encoding; print Term::Encoding::get_encoding();' utf-8 $ python -c "import sys; print(sys.stdout.encoding)" UTF-8
In this article, we talked about locales, how they are used in Linux, and checking the encoding in different environments.
In conclusion, since they control the visual representation of information, knowing the current locale and encoding can be vital, especially in a terminal.