Skip to content

Commit 4f9eb07

Browse files
committed
proposal[PROM-60]: Prometheus CT Storage
Signed-off-by: bwplotka <[email protected]>
1 parent 52d0bae commit 4f9eb07

File tree

1 file changed

+137
-0
lines changed

1 file changed

+137
-0
lines changed

proposals/0060-ct-storage.md

Lines changed: 137 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,137 @@
1+
****## Native TSDB Support for Cumulative Created Timestamp (CT) (and Delta Start Timestamp (ST) on the way)
2+
3+
* **Owners:**
4+
* [`@bwplotka`](https://github.com/bwplotka)
5+
* <[delta-type-WG](https://docs.google.com/document/d/1G0d_cLHkgrnWhXYG9oXEmjy2qp6GLSX2kxYiurLYUSQ/edit) members?>
6+
7+
* **Implementation Status:** `Partially implemented`
8+
9+
* **Related Issues and PRs:**
10+
* [WAL](https://github.com/prometheus/prometheus/issues/14218), [PRW2](https://github.com/prometheus/prometheus/issues/14220), [CT Meta](https://github.com/prometheus/prometheus/issues/14217).
11+
* [appender](https://github.com/prometheus/prometheus/pull/17104)
12+
* [initial attempt for ct per sample](https://github.com/prometheus/prometheus/pull/16046)
13+
* [rw2 proto change for ct per sample](https://github.com/prometheus/prometheus/pull/17036)
14+
15+
* **Other docs or links:**
16+
* [PROM-29 (Created Timestamp)](https://github.com/prometheus/proposals/blob/main/proposals/0029-created-timestamp.md)
17+
* [Delta type proposal](https://github.com/prometheus/proposals/pull/48), [Delta WG](https://docs.google.com/document/d/1G0d_cLHkgrnWhXYG9oXEmjy2qp6GLSX2kxYiurLYUSQ/edit)
18+
19+
> TL;DR: We propose to extend Prometheus TSDB storage sample definition to include an extra int64 that will represent the cumulative created timestamp (CT) and, for the future delta temporality ([PROM-48](https://github.com/prometheus/proposals/pull/48)), a delta start timestamp (ST).
20+
> Once implemented, wee propose to deprecate the `created-timestamps-zero-injection` experimental feature.
21+
22+
## Why
23+
24+
The main goal of this proposal is to make sure [PROM-29's created timestamp (CT)](0029-created-timestamp.md) information is reliably and efficiently stored in Prometheus TSDB, so:
25+
26+
* Written via TSDB Appender interfaces.
27+
* Query-able via TSDB Querier interfaces.
28+
* Persistent in WAL.
29+
* Watch-able (WAL) by Remote Writer.
30+
* (eventually) Persistent in TSDB block storage.
31+
32+
To do it reliably, we propose to extend TSDB storage to "natively" support CT as something you can attach to a sample and use later on.
33+
Native CT support in Prometheus TSDB would unblock the practical use of CT information for:
34+
35+
* Remote storages (Remote Write 2.0) (e.g. Otel, Chronosphere, Google)
36+
* PromQL and other read APIs (including federation) (e.g. increased cumulative based operation accuracy)
37+
38+
Furthermore, it would unblock future Prometheus features for wider range of monitoring cases like:
39+
40+
* Delta temporality support
41+
* UpAndDown counter (i.e. not monotonic counters) e.g. StatsD
42+
43+
On top of that this allows to simplify some existing features e.g. detecting (exponential) native histogram resets (instead of reset hints)
44+
45+
### Background: CT feature
46+
47+
[PROM-29](0029-created-timestamp.md) introduced the "created timestamp" (CT) concept for Prometheus cumulative metrics. Semantically, CT represents the time when "counting" (from 0) started.
48+
In other words, CT is the time when the counter "instance" was created.
49+
50+
Conceptually, CT extends the Prometheus data model for cumulative monotonic counters as follows:
51+
52+
* (new) int64 Timestamp (CT): When counting started.
53+
* float64 or [Histogram](https://github.com/prometheus/prometheus/blob/d7e9a2ffb0f0ee0b6835cda6952d12ceee1371d0/model/histogram/histogram.go#L50) Value (V): The current value of the count, since the CT time.
54+
* int64 Timestamp (T): When this value was observed.
55+
* Labels: Unique identity of a series.
56+
* This includes special metadata labels like: `__name__`, `__type__`, `__unit__`
57+
* Exemplars
58+
* Metadata
59+
60+
Since the CT concept introduction in Prometheus we:
61+
62+
* Extended Prometheus protobuf scrape format to include CT per each cumulative sample (TODO link).
63+
* Proposed (for OM 2) text format changes for CT scraping (improvement over existing OM1 `_created` lines) (TODO link).
64+
* Expanded Scrape parser interface to return `CreatedTimestamp` per sample (aka per line).
65+
* Optimized Protobuf and OpenMetrics parsers for CT use (TODO links).
66+
* Implemented an opt-in, experimental [`created-timestamps-zero-injection`](https://prometheus.io/docs/prometheus/latest/feature_flags/#created-timestamps-zero-injection) feature flag that injects fake sample (V: 0, T: CT).
67+
* Included CT in Remote Write 2 specification (TODO link).
68+
69+
### Background: Delta temporality
70+
71+
See the details, motivations and discussions about the delta temporality in [PROM-48](https://github.com/prometheus/proposals/pull/48).
72+
73+
The core TL;DR relevant for this proposal is that the delta temporality counter sample can be conceptually seen as a "mini-cumulative counters". Essentially delta is a single-sample (value) cumulative counter for a period between (inclusive) start(ST)/create(CT) timestamp and a (end)timestamp.
74+
75+
In other words, `increase(<counter>[5m])` produces a single delta sample for a `[t-5m, t]` period (V: `increase(<counter>[5m])`, CT/ST: `now()-5m`, T: `now()`).
76+
77+
This proves that it's worth considering delta when desiging a CT feature support.
78+
79+
### Background: CT (cumulative) vs ST (delta)
80+
81+
[Previous section](#background-delta-temporality) argues that conceptually the Cumulative Created Timestamp (CT) and Delta Start Timestamp (ST) are essentially the same thing. This is why typically they are stored in the same "field" in other system APIs and storages (e.g. start time in OpenTelemetry TODO link).
82+
83+
The notable difference when this special timestamp is used for cumulatives vs delta samples is the dynamicity **characteristics** of this timestamp.
84+
85+
* For the cumulatives we expect CT to change on every new counter restart, so:
86+
* Average: in the order of ~weeks/months for stable workloads, ~days/weeks for more dynamic environments (Kubernetes).
87+
* Best case: it never changes (infinite count) e.g days_since_X_total.
88+
* Worse case: it changes for every sample.
89+
* For the delta we expect CT to change for every sample.
90+
91+
### Pitfalls of the current solution(s)
92+
93+
* The `created-timestamps-zero-injection` feature allows some CT use cases, but it's limited in practice:
94+
* It's stateful, which means it can't be used effectively across the ecosystem. Essentially you can't miss a single sample (and/or you have to process all samples since 0) to find CT information per sample. For example:
95+
* Remote Write ingestion would need to be persistent and stateful, which blocks horizontal scalability of receiving.
96+
* It limits effectiveness of using CT for PromQL operations like `rate`, `resets` etc.
97+
* It makes "rolloup" (write time recording rules that pre-calculate rates) difficult to implement.
98+
* Given immutability invariant (e.g. Prometheus), you can't effectively inject CT at a later time (out of order writes are sometimes possible, but expensive, especially for a single sample to be written in the past per series).
99+
* It's prone to OOO false positives (we ignore this error for CTs now in Prometheus).
100+
* It's producing an artificial sample, which looks like it was scraped.
101+
* We can't implement delta temporarily effectively.
102+
103+
## Goals
104+
105+
* [MUST] Prometheus can reliably store, query, ingest and export cumulative created timestamp (CT) information (long term plan for [PROM-29](https://github.com/prometheus/proposals/blob/main/proposals/0029-created-timestamp.md#:~:text=For%20those%20reasons%2C%20created%20timestamps%20will%20also%20be%20stored%20as%20metadata%20per%20series%2C%20following%20the%20similar%20logic%20used%20for%20the%20zero%2Dinjection.))
106+
* [SHOULD] Prometheus can reliably store, query, ingest and export delta start time information. This unblocks [PROM-48 delta proposal](https://github.com/prometheus/proposals/pull/48). Notably adding delta feature later on should ideally not require another complex storage design or implementation.
107+
* [SHOULD] Overhead of the solution should be minimal--initial overhead target set to maximum of 10% CPU, 10% of memory and 15% of disk space.
108+
109+
## Non-Goals
110+
111+
In this propose we don't want to:
112+
113+
* Expand details on delta temporality. For this proposal it's enough to assume about delta, what's described in the [Background: Delta Temporality](#background-delta-temporality).
114+
* Expand or motivate the non-monotonic counter feature.
115+
116+
## How
117+
118+
TODO:
119+
* Describe touch points
120+
* Section for TSDB interfaces
121+
* Section for WAL changes + benchmark
122+
* Section for TSDB + benchmark
123+
* Section for PRW2 changes
124+
* Bonus: Should we rename CT to ST?
125+
* Expand on potential plan
126+
127+
## Alternatives
128+
129+
1. This is why not solution Z...
130+
131+
## Action Plan
132+
133+
The tasks to do in order to migrate to the new idea.
134+
135+
* [ ] Task one
136+
137+
* [ ] Task two

0 commit comments

Comments
 (0)