1. Overview

All of us today have great expectations of data-storage reliability. That’s why many users do not even consider the possibility of losing data due to hardware damage. But hard drive failures are a well-documented issue. In this tutorial, we will see how to check the health of our disks in Linux.

We will not address how to recover information from damaged disks, which is a different topic. Instead, we are interested in whether it is time to change our hard drives because they are in a pre-fail state or some malfunction has already occurred.

2. How to Identify Disks and Partitions

Everything is a file in Unix and its derivatives, such as Linux. All our hardware components correspond to device files, and the system communicates with them using these files.

Most device files are part of a virtual file system traditionally mounted on /dev. For instance, /dev/sda is usually the first disk in most Linux installations.

Device files can also match specific resources of devices, such as disk partitions. So, if our disk is /dev/sda, its first partition is /dev/sda1.

In the rest of this tutorial, we need to know which device files identify our disks and their partitions. Let’s look at some examples: /dev/hd[a-z], /dev/sd[a-z], /dev/vd[a-z], /dev/nst*, /dev/sg*, /dev/twe[0-9], and so on.

There are several methods to find all storage devices. First, let’s identify the disks via lshw:

# lshw -class disk
       description: ATA Disk
       product: SanDisk SD8SN8U2
       logical name: /dev/sda
       size: 238GiB (256GB)
       description: ATA Disk
       product: HGST HTS721010A9
       logical name: /dev/sdb
       size: 931GiB (1TB)
       description: SCSI Disk
       product: 024 HN-M101M
       logical name: /dev/sdc
       size: 931GiB (1TB)

In this case, we have three disks. The detailed information provided by lshw helps us not to be mistaken.

Then, for each disk, we can list its partitions with lsblk, for example:

sda                 sda                          238,5G
├─sda1              sda1  vfat        /boot/efi    512M
├─sda2              sda2  ext4        /boot        732M
└─sda3              sda3  crypto_LUKS            237,3G
  └─sda3_crypt      dm-0  LVM2_member            237,2G
    ├─vgmint-root   dm-1  ext4        /          236,3G
    └─vgmint-swap_1 dm-2  swap        [SWAP]       976M

In this case, /dev/sda contains an LVM with encrypted partitions. This tree-like view of partitions is quite intuitive. The difference between NAME and KNAME is that the former indicates the device name, while the latter means the internal kernel device name.

3. smartctl and gsmartcontrol

Disk drive manufacturers have taken a giant step toward predicting reliability in disk drives through Self-Monitoring, Analysis and Reporting Technology (SMART), included in most modern ATA/SATA, SCSI/SAS, and NVMe disks.

Its primary function is to detect and report various indicators of drive reliability to anticipate imminent hardware failures. It also carries out different types of drive self-tests.

The smartmontools package contains two utility programs (smartctl and smartd) to control and monitor storage systems using the SMART.

We will focus on smartctl, which is meant for interactive use, while smartd is a daemon to monitor SMART continuously.

gsmartcontrol is a graphical user interface for smartctl.

3.1. SMART Attributes With Thresholds

SMART technology includes a series of attributes or diagnostics chosen for each drive model. Attribute individualism is important because drive architectures vary from model to model.

Let’s see an example using the –all flag, which prints all SMART information about a physical disk. Looking for the same information on a virtual disk does not make sense, and it’s not possible.

Since smartctl is very verbose, we’ve filtered the following output to highlight the most relevant information:

# smartctl --all /dev/sdc
Model Family:     Seagate Samsung SpinPoint M8 (AF)
Device Model:     ST1000LM024 HN-M101MBB
Serial Number:    S31CJ9DF228176
LU WWN Device Id: 5 0004cf 20c9ecae7
Firmware Version: 2BA30001
User Capacity: bytes [1,00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      2.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 6
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Thu May 19 12:15:17 2022 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
SMART overall-health self-assessment test result: PASSED
Auto Offline Data Collection: Disabled.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 215) minutes.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
  1 Raw_Read_Error_Rate     0x002f   100   100   051    Pre-fail  Always       -       46
  2 Throughput_Performance  0x0026   252   252   000    Old_age   Always       -       0
  3 Spin_Up_Time            0x0023   090   090   025    Pre-fail  Always       -       3158
  4 Start_Stop_Count        0x0032   078   078   000    Old_age   Always       -       22556
  5 Reallocated_Sector_Ct   0x0033   252   252   010    Pre-fail  Always       -       0

First, the “start of information section” informs us that SMART support is available and enabled. That’s good.

After that, we read that the “SMART overall-health self-assessment test result” is “PASSED”. That’s fine, it means that the disk is still usable.

“Auto Offline Data Collection” periodically updates specific SMART attributes, but it could impact performance. It’s disabled, and we can leave it that way. However, if we want to enable it, we can run “smartctl -s on -o on -S on /dev/sdc”.

Further on, we see the estimated times for performing the self-tests: 2 minutes for the short one and nearly 4 hours (215 minutes) for the extended one.

In the continuation of this tutorial, we will do a short test. We should proceed with caution with the extended tests because their duration may impact disk wear. On larger disks, they can take up to ten hours or more.

The “Vendor Specific SMART Attributes with Thresholds” table brings us back to a fact of reality: the disk is old and in a pre-fail state.

Let’s look at the first attribute, the “Raw_Read_Error_Rate.” It appears only in the traditional hard drives since it doesn’t affect the modern flash storage drives. It indicates problems with the disk surface that can lead to permanent data loss.

We can consult the “Known ATA SMART attributes” chart for more information.

3.2. Smartmontools’ Database

The Vendor Specific SMART Attributes with Thresholds table reported by smartctl in the previous section relies on the Smartmontools’ database for ATA/SATA drives. Instead, SCSI/SAS and NVMe drives do not provide ATA/SATA-like SMART Attributes, so the database doesn’t contain any entries for these drives.

The -P show option displays if the drive is in the database, and if so, its presets, then exits:

# smartctl -P show /dev/sdc
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.0-110-generic] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

Drive found in smartmontools Database.  Drive identity strings:
MODEL:              ST1000LM024 HN-M101MBB
FIRMWARE:           2BA30001
match smartmontools Drive Database entry:
MODEL REGEXP:       ST(250|320|500|640|750|1000)LM0[012][124] HN-M[0-9]*MBB|APPLE HDD ST500LM012
MODEL FAMILY:       Seagate Samsung SpinPoint M8 (AF)
ATTRIBUTE OPTIONS:  None preset; no -v options are required.

However, let’s suppose our ATA/SATA drive is not in the database. In that case, the attributes’ names displayed in the ATTRIBUTE_NAME column and the format of the raw attributes shown in the RAW_VALUE column may be incorrect. This issue is primarily cosmetic.

Luckily, the essential drive health monitoring/testing functionality of Smartmontools does not depend upon the database.

3.3. Self-Tests

Let’s do a short self-test. Unfortunately, there’s no way to check progress, so we have to keep the command running until the results show up:

# smartctl -t short /dev/sdc
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.0-110-generic] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Thu May 19 12:42:54 2022 CEST
Use smartctl -X to abort test.

After about two minutes, we get the test results. Two results will appear because we had previously performed the same test:

# smartctl -l selftest /dev/sdc
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.0-110-generic] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      2575         -
# 2  Short offline       Completed without error       00%      2533         -

The good news is the error-free completion status, but let’s remember that the disk is in a pre-fail state.

3.4. gsmartcontrol

We will not delve into the use of gsmartcontrol because it provides us with the same information and functionalities as smartctl. It’s just more convenient to use because it has a graphical interface. However, we should remember to run it with root permissions.

When we open gsmartcontrol, we can choose which disks to analyze. We have the essential information to recognize the disk and the ability to enable or disable SMART and Auto Offline Data Collection.

In this example, we choose the same disk we saw earlier:

gsmartcontrol screenshot 1Then let’s see all the basic SMART information:

gsmartcontrol screenshot 2We can note that the speed of the SATA interface is lower than the expected one, as reported in the row “SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)”. This is a fairly common problem and is discussed in various forums and articles. However, it is not a problem, in this case, considering the maximum “Media to/from Buffer” speed of 145MB/s, as stated in the datasheet provided by the manufacturer:

Hard Disk datasheet exampleMoreover, we can easily run short and extended tests:

gsmartcontrol screenshot 3SMART attributes are also easy to consult:

gsmartcontrol screenshot 4As we saw in the previous screenshots, the best benefit of gsmartcontrol is that it guards us against having to specify smartctl options. In many cases, it will be fine. However, gsmartcontrol provides the ability to specify custom options to smartctl, as reported on the troubleshooting page.

4. Gnome Disk Utility

We can install and run Gnome Disk Utility in desktop environments other than Gnome, such as XFCE, Cinnamon, etc. So, despite its name, using Gnome is not mandatory. Let’s remember to use root permissions.

It has a lot of functionality. However, what we are interested in is its health status. Let’s select the disk we want to check from the list of storage devices on the left. Gnome Disk Utility will show the disk’s information, partitions, and status. Then, “SMART Data & Self-Tests” is available in the three-dot menu button. Hopefully, the “Overall Assessment” should report that “Disk is OK.” In this window, we can start a self-test:

Gnome Disk Utility (disk health status. SMART info)We will not elaborate further on this tool because what we previously told about SMART applies.

5. Badblocks

SMART information is not available on all commonly used devices. For example, USB flash drives do not have it.

In such cases, the first thing to consider is that inexpensive devices such as USB sticks are not suitable for the permanent storage of our data. More likely, they help temporary exchange data from one computer to another without claiming to be durable devices. In fact, compared to hard drives, they tend to break more easily. There are high chances of their internal components failing before NAND memories reach their max read and write limit.

When there is no SMART information, we can do a device health check with badblocks. It is part of the e2fsprogs project and part of e2fsck when passing the option -c to scan for bad blocks and prevent data from being stored on them.

When run as a standalone program, badblocks gives a list of blocks with problems, if any. It might take a long time to complete.

We should remember to execute badblocks only on a partition that is not mounted. Root permissions are required, the -s flag shows the progress, and the -v flag outputs the detected bad sectors:

# badblocks -sv /dev/sdd1
Checking blocks 0 to 15247359
Checking for bad blocks (read-only test):   0.00% done, 0:00 elapsed. (0/0/0 err  0.53% done, 0:04 ela    0.81% done, 0:  1.08% done, 0:08 elapse  1.22% done, 0:09 done                                                 
Pass completed, 0 bad blocks found. (0/0/0 errors)

In this example, /dev/sdd1 is the only partition on the tested USB stick. This check was read-only. We can consult this documentation to perform a read-write test that could be destructive or non-destructive.

6. Conclusion

In this article, we saw how to check the health of our disks. Our best friends are SMART or, when not available, badblocks.

We should never think of our devices as eternally reliable memories. Instead, our suggestion for not losing data is to perform regular backups on multiple media.