This repo contains a shell script and a configd action to collect SMART data from an NVMe drive and expose it to Prometheus via the Node Exporter plugin and textfile collector on an OPNsense router.
I've created a Grafana Dashboard for visualizing these Smartctl metrics, the most up-to-date version can be found in this repo as 24771.json
It can also be found here, however the Grafana site can take some time to update: https://grafana.com/grafana/dashboards/24771
- os-node_exporter To collect and expose metrics to Prometheus
- os-smart Required for
smartctl
smart-metrics.sh: The data collection script. It runssmartctl, parses the output with Python, and writes a.promfile to the Node Exporter textfile directory.actions_smartmetrics.conf: Configuration file allowing the script to be triggered via OPNsense'sconfigdsystem (and thus the Cron scheduler).
-
Grafana, Prometheus, and opnsense-exporter set up
-
Install the required plugins listed above
-
Add the script:
- Upload or copy-paste
smart-metrics.shto your OPNsense router, place it in/usr/local/bin/, and make it executable withchmod +x /usr/local/bin/smart-metrics.sh - Modify the
DEVICEvariable to match your NVMe drive path if necessary. - Modify the
SCRAPE_INTERVAL_MINUTESvariable to match whatever interval you decide to use in Cron below.
- Upload or copy-paste
-
Add the action configuration: Upload or copy-paste
actions_smartmetrics.confto/usr/local/opnsense/service/conf/actions.d/ -
Reload configd: To register the new action, restart the config daemon with
service configd restart
You can test if the action is registered correctly by running configctl smartmetrics collect
This should generate the file at /var/tmp/node_exporter/smart_metrics.prom
To collect metrics on a schedule:
- Go to System > Settings > Cron.
- Click the + button to add a new job.
- Command: Select
Collect SMART Metrics for Node Exporterfrom the dropdown. - Schedule: Set your interval; I use every 5 minutes. (You'll need to update the script
SCRAPE_INTERVAL_MINUTESvariable to match this interval in minutes) - Click Save.
The following metrics are exported and can be found in Prometheus. I use Grafana to visualize them. Each metric includes a device label (e.g. device="nvme0").
NOTE: These are all metrics found under the nvme_smart_health_information_log object of the JSON output of smartctl. If you're having problems, run smartctl -j -a /dev/nvme0 (or your device path) to see if your drive supports these metrics.
node_disk_smart_critical_warning— Critical warning state (0 = good).node_disk_smart_temperature_celsius— Current drive temperature (°C).node_disk_smart_available_spare_percent— Available spare capacity (%).node_disk_smart_available_spare_threshold_percent— Available spare threshold (%).node_disk_smart_percentage_used_percent— Percentage of drive life used (%).node_disk_smart_data_units_read_total— Total data units read (units reported by firmware).node_disk_smart_data_units_written_total— Total data units written (units reported by firmware).node_disk_smart_host_read_commands_total— Total host read commands.node_disk_smart_host_write_commands_total— Total host write commands.node_disk_smart_controller_busy_time_minutes_total— Controller busy time (minutes).node_disk_smart_power_cycles_total— Total power cycles.node_disk_smart_power_on_hours_total— Total power-on hours.node_disk_smart_unsafe_shutdowns_total— Total unsafe shutdowns.node_disk_smart_media_errors_total— Total media and data integrity errors.node_disk_smart_error_log_entries_total— Total error log entries.node_disk_smart_warning_temp_time_minutes_total— Time (minutes) the drive has been above the warning temperature.node_disk_smart_critical_comp_time_minutes_total— Time (minutes) the drive has been above the critical temperature.
- Add support for SATA drives.
- If you're running a SATA drive, please contribute a sample output to smartctl-outputs.md
- Collect Model and Serial Number?
- Collect self test info?
- Add param for drive TBW and calculate estimated remaining life?
