Skip to content

Pull etcd WAL fsync data from prometheus in clusters with rancher-monitoring installed #386

@axeal

Description

@axeal

For etcd IO issues the WAL fsync graph of the etcd grafana dashboard provides a useful measure of IO performance over time, in comparison to the point-in-time check of fio. Suggest collecting this data via the log collector where rancher-monitoring is installed, to save requesting users check the grafana dashboard manually. Below is just an example of collecting this data from the prometheus service, generated by gemini and based on the grafana dashboard query, which I tested in a lab:

# Define variables for clarity
PROMETHEUS_URL="http://10.43.250.198:9090"
PROMQL_QUERY='histogram_quantile(0.99, sum(rate(etcd_disk_wal_fsync_duration_seconds_bucket{job="kube-etcd"}[1m])) by (instance, le))'

# URL-encode the query (most shells can handle this, but for scripts use a utility)
ENCODED_QUERY=$(python -c "import urllib.parse; print(urllib.parse.quote('''${PROMQL_QUERY}'''))")

# Set time range (e.g., the last hour)
END_TIME=$(date +%s)
START_TIME=$((END_TIME - 3600))

# Make the API call
curl -G "${PROMETHEUS_URL}/api/v1/query_range" \
  --data-urlencode "query=${PROMQL_QUERY}" \
  --data "start=${START_TIME}" \
  --data "end=${END_TIME}" \
  --data "step=1m"

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions