Or, « What I’ve learned about Graphite configuration ».
Last week, I worked on configuring Graphite and had to understand how it stores and aggregates data. So here are a few facts.
Graphite Retention
The way our data will be stored is described in /opt/graphite/conf/storage-schemas.conf. As an example:
[default]
pattern = .*
retentions = 1s:30m,1m:1d,5m:2y
This worked great when I was looking at data from the last 30 minutes.
If I was trying to display last hour metrics: nothing.
Drawing null as zero was giving me a horizontal line at the bottom of the graph.
The magic of aggregation
This behaviour comes from the file /opt/graphite/conf/storage-aggregation.conf where we find the following lines:
[99_default_avg]
pattern = .*
xFilesFactor = 0.5
aggregationMethod = average
Our problem comes from xFilesFactor. It means that by default, we need at least 50% of the data to be non-null to store an average value. Think about it.
So here, I’m having a metric every second during 30 minutes. If Graphite doesn’t have something for a given second, the value is set to null. Fine, let’s move forward.
For interval higher than 30 minutes (and lower than a day), Graphite will gather data based on the aggregation configured. So it will average data and set the value null if it has less than 50% usable values (not null).
In our case, Graphite tries to average one minute of data (1m:1d) with the precision of 1s from the first retention rule (1s:30m). To understand why nothing is displayed, consider I’m Collectd is sending data to Graphite. On average, metrics are arriving every 3s. On a one minute interval, we gather 20 values but Graphite is considering 60 values, 40 being null. We only have 33% (0.33) metrics usable which is lower than 50% Graphite is waiting for so the averaged value is set to null.
The art of confusion
Now that we updated our configuration, set xFilesFactor to 0 to be sure, restart carbon-cache, everything should work fine…
But that’s not the case; no change.
In fact, previous configuration is still being used in wsp storage files. We can check it with whisper-info.py.
whisper-info.py /opt/graphite/storage/whisper/collectd/test-java01/cpu-0/cpu-user.wsp
maxRetention: 63072000
xFilesFactor: 0.5
aggregationMethod: average
fileSize: 2561812
Archive 0
retention: 1800
secondsPerPoint: 1
points: 1800
size: 21600
offset: 52
Archive 1
retention: 86400
secondsPerPoint: 60
points: 1440
size: 17280
offset: 21652
Archive 2
retention: 63072000
secondsPerPoint: 300
points: 210240
size: 2522880
offset: 38932
See, we still have xFilesFactor: 0.5.
If you don’t care about previous data, a good solution is to delete files so that the new parameters will be used (rm -rf /opt/graphite/storage/whisper/collectd/). Maybe it’s a little bit overkill, (but easy and fast).
The other solution consists in using whisper-resize.py to enforce the new configuration.
whisper-resize.py /opt/graphite/storage/whisper/collectd/test-java01/cpu-0/cpu-user.wsp 3s:30m,1m:1d,5m:2y –xFilesFactor=0.1
The above works fine, but this is the other way to configure how many metrics Graphite can keep. It has the format n:i, which means we store a measure every n seconds and we want i points to be stored (computed with interval / n).
Example: 3s:30m
30m = 1800s
1800 / 3 = 600
3:600
So 3s:30m,1m:1d,5m:2y gives us 3:600 60:1440 300:210380.
« An average Gregorian year is 365.2425 days = 52.1775 weeks = 8765.82 hours = 525949.2 minutes = 31556952 seconds (mean solar, not SI). » Wikipedia
Note
Thing to remember concerning storage-schemas.conf (taken from Graphite doc):
« Changing this file will not affect already-created .wsp files. Use whisper-resize.py to change those. »