When Was your Hard-disks’ Last Checkup?

stethoscope

Periodic Hard-Drive Checks using smartmontools

Which do you prefer: knowing about a problem before it happens or finding out after-the-fact?

I like to know ahead of time particularly if I can do something about it!

Let me introduce a few handy Linux tools in this article to help you monitor your hard-disks.

Tools that will (hopefully) alert you to potential problems, giving you time to replace your hardware.

Before bad things happen.

Install, Configure and Run smartmontools

We will start by installing a handy-dandy set of utilities and services called smartmontools.

See this article for details along with this article, this one, another one, and that one for additional set-up and configuration steps.

Begin by installing smartmontools:

sudo apt-get install smartmontools

Inspect one of your hard-drives:

sudo smartctl -i /dev/<drive_name>

In my case I used /dev/sda for my drive to be tested.

Enable SMART service:

sudo smartctl -s on /dev/<drive_name>

Repeat this command for any additional drives you wish to check with smartmontools such as <drive_name1>, <drive_name2>, etc.

Determine the amount of time it will take to run various tests:

sudo smartctl -c /dev/sda

Perform a short test:

sudo smartctl -short /dev/<drive_name>

Perform a long test:

sudo smartctl -long /dev/<drive_name>

Once tests are completed, run the following command to get the result:

sudo smartctl -l selftest /dev/<drive_name>

I encountered issues while testing certain SSD, further work is needed to properly configure smartmontools for these types of hard-drives, see this post for details.

Run smartmontools as a Daemon

Activate the smartmontools service via:

sudo nano /etc/default/smartmontools

Un-comment the line start_smartd=yes, or, add this line if it is not already present:

# Modified on YYYY-DD-MM by <author>.

start_smartd=yes

Hit Cntl-O to save, Cntl-X to exit.

Configure the smartmontools service via:

sudo nano /etc/smartd.conf

In this file you can un-comment the DEVICESCAN line containing the command to run the smartmontools service, or, write your own commands if you wish to check individual drives:

# Modified on YYYY-DD-MM by <author>.

/dev/<drive_name1> -a -H -l error -l selftest -f -s(S/../../7/12|L/../01/./01) -m root -M exec /usr/share/smartmontools/smartd-runner

/dev/<drive_name2> -a -H -l error -l selftest -f -s(S/../../7/14|L/../02/./01) -m root -M exec /usr/share/smartmontools/smartd-runner

/dev/<drive_name3> -a -H -l error -l selftest -f -s(S/../../7/16|L/../03/./01) -m root -M exec /usr/share/smartmontools/smartd-runner

/dev/<drive_name4> -a -H -l error -l selftest -f -s(S/../../7/18|L/../04/./01) -m root -M exec /usr/share/smartmontools/smartd-runner

The first command in the above list will perform a short test on <drive_name1> starting between 12PM and 1PM on Sunday for <drive_name1>>, as well as a long test on the first of each month starting between 1AM and 2AM. Similar dates and times for subsequent drives. See this Linux man page for additional parameters.

Hit Cntl-O to save, Cntl-X to exit.

Restart smartmontools service:

sudo /etc/init.d/smartmontools restart

Scheduling Explained

The commands to scan a device may appear cryptic but do not fear: there is a logical explanation for most things computer-related. Let’s unpack the values for the -s parameter in the following:

/dev/<drive_name1> -a -H -l error -l selftest -f -s(S/../../7/12|L/../01/./01) -m root -M exec /usr/share/smartmontools/smartd-runner

The above command is telling smartmontools to schedule both a short and a long test. Scheduling parameters for these tests are as follows:

-s (S/MM/DD/d/HH|L/MM/DD/d/HH)

MM

Is the month of the year, expressed with two decimal digits. The range is from 01 (January) to 12 (December) inclusive.

Do not use a single decimal digit or the match will always fail!

DD

Is the day of the month, expressed with two decimal digits. The range is from 01 to 31 inclusive.

Do not use a single decimal digit or the match will always fail!

d

Is the day of the week, expressed with one decimal digit. The range is from 1 (Monday) to 7 (Sunday) inclusive.

HH

Is the hour of the day, written with two decimal digits, and given in hours after midnight. The range is 00 (midnight to just before 1AM) to 23 (11pm to just before midnight) inclusive.

Do not use a single decimal digit or the match will always fail!

More Fun with Scheduling

The above examples are all fine and good but what if you want to schedule a test for multiple times?

No problem.

To schedule a long self-test between 10-11PM on the first and eleventh day of each month, use:

-s L/../(01|11)/./22

To schedule a short self-test after every midnight, 5AM, noon,and 5PM, use:

-s S/../.././(00|05|12|17)

Run smartmontools as a Crontab

Here is a helpful article on how to configure smartmontools to run as a Crontab. Personally I prefer to configure recurring runs via the smartd.conf, what is your set-up like?

Send E-Mail

You can configure your system to forward email from to an external address for any errors that occur while scanning your drives. Errors sent to root can then be re-directed to an e-mail address of your choice.

The -m directive followed by root in our above example tells smartmontools to forward all error messages to the root user whereupon they can be forwarded to other addresses. I smell another article coming on how to set up a server to auto-forward e-mail to the mailbox of your choice!

The -M Directive

If you run your hard-disk tests using smartmontools more than once a day and errors are detected, there is a good chance only one message will be sent. I think this is because the developers do not want to bog people down with too many error messages.

You can specify the behaviour of when and how messages are sent using the -M directive in your smartd.conf which in the above examples is:

-M exec /usr/share/smartmontools/smartd-runner

Other options for -M can be found here.

Optional — GSmartControl

Install GSmartControl for graphical access to smartmontools including a summary of tests, see this article for details.

Troubleshooting

Periodically check to ensure the smartmontools service has run as scheduled:

sudo smartctl -l selftest /dev/<drive_name>

I encountered an issue with the SMART service when I realised it was not running on or around the times I specified in the config file. Was the service running properly? Was it a problem with how the service schedules its runs? I stumbled across the following page that described how to install and enable smartctl.

Rather than run systemctl to enable smartd, I thought ‘wait, how about checking the status of the service’?

Run the following command to get the status of your SMART service:

sudo systemctl status smartd

Hmmm. Results are not good. There appears to be a parameter issue.

Going back into the smartd configuration and adding a space in-between the -s and (..):

# Modified on YYYY-DD-MM by <author>.

dev/<drive_name1> -a -H -l error -l selftest -f -s (S/../../7/12|L/../01/./01) -m root -M exec /usr/share/smartmontools/smartd-runner

/dev/<drive_name2> -a -H -l error -l selftest -f -s (S/../../7/14|L/../02/./01) -m root -M exec /usr/share/smartmontools/smartd-runner

/dev/<drive_name3> -a -H -l error -l selftest -f -s (S/../../7/16|L/../03/./01) -m root -M exec /usr/share/smartmontools/smartd-runner

/dev/<drive_name4> -a -H -l error -l selftest -f -s (S/../../7/18|L/../04/./01) -m root -M exec /usr/share/smartmontools/smartd-runner

Hit Cntl-O to save, Cntl-X to exit.

Restart smartmontools service:

sudo /etc/init.d/smartmontools restart

Next, re-check status by running:

sudo systemctl status smartd

Ahh, that’s much better!

Now smartctl should run on-schedule without issues. Always a good idea to verify periodically via:

sudo smartctl -l selftest /dev/<drive_name>

More Troubleshooting

Temperature is a useful measurement when monitoring the health of your hard-disks.

Too much heat spells trouble!

Not to worry: hard-disk temperature is of the things reported by smartmontools.

If you happen to check some of the temperature values you may be in for a shock.

Like, what is with all these crazy-high values for tempurature being reported???

Do not panic, see this post for details. The values being reported are not the temperature in any regular unit but rather a number between 255 (cold) and 0 (hot).

Better Temperature Monitoring

If in doubt of what temperature value your disks are running at, try hddtemp per this article.

Whoah slow down there champ, looks like hddtemp is not available starting at Ubuntu 22.04 LTS!

Not to worry.

You can opt for other packages both GUI and command-line to help monitor temperature!

Install lm-sensors to check temperatures from the command line.

You can install Psensor from Ubuntu Software if you want to run a GUI.

See this post and this one for details.

Are my HDD Tests Operational?

Another useful thing to check is whether or not your hard-drive tests are running correctly. A bit of time had passed since I last received any warning or errors resulting from one of my tests, so the first thing I did was ensure the daemon was in fact running on my machines:

sudo systemctl status smartd

Next, I inspected the log for additional details on most recent tests as per this post:

grep "smartd" /var/log/syslog*

Hmm, interesting.

I am getting a lot of:

skip scheduled Short Self-Test; 10% remaining of current Self-Test.

I think my next step will be to change when my tests are scheduled to run; perhaps separating each test by two hours vice one hour will work better.

If you are still up for some fun with monitoring hard-disks there are some cool apps that you can use to improve the monitoring of your systems, munin is showing a lot of promise.

Postscript

Hopefully this article was helpful in getting things set-up to monitor your hard-disks in Linux.

Have you come across a monitoring problem with your storage? What are your thoughts on the subject?

Leave a Comment

Required fields are marked *