Periodic Hard-Drive Checks using smartmontools
Which do you prefer: knowing about a problem before it happens or finding out after-the-fact?
I like to know ahead of time particularly if I can do something about it!
Let me introduce a few handy Linux tools in this article to help you monitor your hard-disks.
Tools that will (hopefully) alert you to potential problems, giving you time to replace your hardware.
Before bad things happen.
Install, Configure and Run smartmontools
We will start by installing a handy-dandy set of utilities and services called smartmontools
.
See this article for details along with this article, this one, another one, and that one for additional set-up and configuration steps.
Begin by installing smartmontools
:
sudo apt-get install smartmontools
Inspect one of your hard-drives:
sudo smartctl -i /dev/<drive_name>
In my case I used /dev/sda for my drive to be tested.
Enable SMART service:
sudo smartctl -s on /dev/<drive_name>
Repeat this command for any additional drives you wish to check with smartmontools
such as <drive_name1>
, <drive_name2>
, etc.
Determine the amount of time it will take to run various tests:
sudo smartctl -c /dev/sda
Perform a short test:
sudo smartctl -short /dev/<drive_name>
Perform a long test:
sudo smartctl -long /dev/<drive_name>
Once tests are completed, run the following command to get the result:
sudo smartctl -l selftest /dev/<drive_name>
I encountered issues while testing certain SSD, further work is needed to properly configure smartmontools
for these types of hard-drives, see this post for details.
Run smartmontools as a Daemon
Activate the smartmontools
service via:
sudo nano /etc/default/smartmontools
Un-comment the line start_smartd=yes
, or, add this line if it is not already present:
# Modified on YYYY-DD-MM by <author>.
start_smartd=yes
Hit Cntl-O
to save, Cntl-X
to exit.
Configure the smartmontools
service via:
sudo nano /etc/smartd.conf
In this file you can un-comment the DEVICESCAN
line containing the command to run the smartmontools
service, or, write your own commands if you wish to check individual drives:
# Modified on YYYY-DD-MM by <author>.
/dev/<drive_name1> -a -H -l error -l selftest -f -s(S/../../7/12|L/../01/./01) -m root -M exec /usr/share/smartmontools/smartd-runner
/dev/<drive_name2> -a -H -l error -l selftest -f -s(S/../../7/14|L/../02/./01) -m root -M exec /usr/share/smartmontools/smartd-runner
/dev/<drive_name3> -a -H -l error -l selftest -f -s(S/../../7/16|L/../03/./01) -m root -M exec /usr/share/smartmontools/smartd-runner
/dev/<drive_name4> -a -H -l error -l selftest -f -s(S/../../7/18|L/../04/./01) -m root -M exec /usr/share/smartmontools/smartd-runner
The first command in the above list will perform a short test on <drive_name1>
starting between 12PM and 1PM on Sunday for <drive_name1>>
, as well as a long test on the first of each month starting between 1AM and 2AM. Similar dates and times for subsequent drives. See this Linux man page for additional parameters.
Hit Cntl-O
to save, Cntl-X
to exit.
Restart smartmontools
service:
sudo /etc/init.d/smartmontools restart
Scheduling Explained
The commands to scan a device may appear cryptic but do not fear: there is a logical explanation for most things computer-related. Let’s unpack the values for the -s
parameter in the following:
/dev/<drive_name1> -a -H -l error -l selftest -f -s(S/../../7/12|L/../01/./01) -m root -M exec /usr/share/smartmontools/smartd-runner
The above command is telling smartmontools to schedule both a short and a long test. Scheduling parameters for these tests are as follows:
-s (S/MM/DD/d/HH|L/MM/DD/d/HH)
MM
Is the month of the year, expressed with two decimal digits. The range is from 01 (January) to 12 (December) inclusive.
Do not use a single decimal digit or the match will always fail!
DD
Is the day of the month, expressed with two decimal digits. The range is from 01 to 31 inclusive.
Do not use a single decimal digit or the match will always fail!
d
Is the day of the week, expressed with one decimal digit. The range is from 1 (Monday) to 7 (Sunday) inclusive.
HH
Is the hour of the day, written with two decimal digits, and given in hours after midnight. The range is 00 (midnight to just before 1AM) to 23 (11pm to just before midnight) inclusive.
Do not use a single decimal digit or the match will always fail!
More Fun with Scheduling
The above examples are all fine and good but what if you want to schedule a test for multiple times?
No problem.
To schedule a long self-test between 10-11PM on the first and eleventh day of each month, use:
-s L/../(01|11)/./22
To schedule a short self-test after every midnight, 5AM, noon,and 5PM, use:
-s S/../.././(00|05|12|17)
Run smartmontools as a Crontab
Here is a helpful article on how to configure smartmontools
to run as a Crontab
. Personally I prefer to configure recurring runs via the smartd.conf
, what is your set-up like?
Send E-Mail
You can configure your system to forward email from to an external address for any errors that occur while scanning your drives. Errors sent to root can then be re-directed to an e-mail address of your choice.
The -m
directive followed by root
in our above example tells smartmontools
to forward all error messages to the root
user whereupon they can be forwarded to other addresses. I smell another article coming on how to set up a server to auto-forward e-mail to the mailbox of your choice!
The -M Directive
If you run your hard-disk tests using smartmontools
more than once a day and errors are detected, there is a good chance only one message will be sent. I think this is because the developers do not want to bog people down with too many error messages.
You can specify the behaviour of when and how messages are sent using the -M
directive in your smartd.conf
which in the above examples is:
-M exec /usr/share/smartmontools/smartd-runner
Other options for -M
can be found here.
Optional — GSmartControl
Install GSmartControl
for graphical access to smartmontools
including a summary of tests, see this article for details.
Troubleshooting
Periodically check to ensure the smartmontools
service has run as scheduled:
sudo smartctl -l selftest /dev/<drive_name>
I encountered an issue with the SMART service when I realised it was not running on or around the times I specified in the config file. Was the service running properly? Was it a problem with how the service schedules its runs? I stumbled across the following page that described how to install and enable smartctl
.
Rather than run systemctl
to enable smartd
, I thought ‘wait, how about checking the status of the service’?
Run the following command to get the status of your SMART service:
sudo systemctl status smartd
Hmmm. Results are not good. There appears to be a parameter issue.
Going back into the smartd configuration and adding a space in-between the -s
and (..)
:
# Modified on YYYY-DD-MM by <author>.
dev/<drive_name1> -a -H -l error -l selftest -f -s (S/../../7/12|L/../01/./01) -m root -M exec /usr/share/smartmontools/smartd-runner
/dev/<drive_name2> -a -H -l error -l selftest -f -s (S/../../7/14|L/../02/./01) -m root -M exec /usr/share/smartmontools/smartd-runner
/dev/<drive_name3> -a -H -l error -l selftest -f -s (S/../../7/16|L/../03/./01) -m root -M exec /usr/share/smartmontools/smartd-runner
/dev/<drive_name4> -a -H -l error -l selftest -f -s (S/../../7/18|L/../04/./01) -m root -M exec /usr/share/smartmontools/smartd-runner
Hit Cntl-O
to save, Cntl-X
to exit.
Restart smartmontools
service:
sudo /etc/init.d/smartmontools restart
Next, re-check status by running:
sudo systemctl status smartd
Ahh, that’s much better!
Now smartctl should run on-schedule without issues. Always a good idea to verify periodically via:
sudo smartctl -l selftest /dev/<drive_name>
More Troubleshooting
Temperature is a useful measurement when monitoring the health of your hard-disks.
Too much heat spells trouble!
Not to worry: hard-disk temperature is of the things reported by smartmontools
.
If you happen to check some of the temperature values you may be in for a shock.
Like, what is with all these crazy-high values for tempurature being reported???
Do not panic, see this post for details. The values being reported are not the temperature in any regular unit but rather a number between 255 (cold) and 0 (hot).
Better Temperature Monitoring
If in doubt of what temperature value your disks are running at, try hddtemp
per this article.
Whoah slow down there champ, looks like hddtemp
is not available starting at Ubuntu 22.04 LTS!
Not to worry.
You can opt for other packages both GUI and command-line to help monitor temperature!
Install lm-sensors to check temperatures from the command line.
You can install Psensor from Ubuntu Software if you want to run a GUI.
See this post and this one for details.
Are my HDD Tests Operational?
Another useful thing to check is whether or not your hard-drive tests are running correctly. A bit of time had passed since I last received any warning or errors resulting from one of my tests, so the first thing I did was ensure the daemon was in fact running on my machines:
sudo systemctl status smartd
Next, I inspected the log for additional details on most recent tests as per this post:
grep "smartd" /var/log/syslog*
Hmm, interesting.
I am getting a lot of:
skip scheduled Short Self-Test; 10% remaining of current Self-Test.
I think my next step will be to change when my tests are scheduled to run; perhaps separating each test by two hours vice one hour will work better.
If you are still up for some fun with monitoring hard-disks there are some cool apps that you can use to improve the monitoring of your systems, munin is showing a lot of promise.
Postscript
Hopefully this article was helpful in getting things set-up to monitor your hard-disks in Linux.
Have you come across a monitoring problem with your storage? What are your thoughts on the subject?