|
| 1 | +****## Native TSDB Support for Cumulative Created Timestamp (CT) (and Delta Start Timestamp (ST) on the way) |
| 2 | + |
| 3 | +* **Owners:** |
| 4 | + * [`@bwplotka`](https://github.com/bwplotka) |
| 5 | + * <[delta-type-WG](https://docs.google.com/document/d/1G0d_cLHkgrnWhXYG9oXEmjy2qp6GLSX2kxYiurLYUSQ/edit) members?> |
| 6 | + |
| 7 | +* **Implementation Status:** `Partially implemented` |
| 8 | + |
| 9 | +* **Related Issues and PRs:** |
| 10 | + * [WAL](https://github.com/prometheus/prometheus/issues/14218), [PRW2](https://github.com/prometheus/prometheus/issues/14220), [CT Meta](https://github.com/prometheus/prometheus/issues/14217). |
| 11 | + * [appender](https://github.com/prometheus/prometheus/pull/17104) |
| 12 | + * [initial attempt for ct per sample](https://github.com/prometheus/prometheus/pull/16046) |
| 13 | + * [rw2 proto change for ct per sample](https://github.com/prometheus/prometheus/pull/17036) |
| 14 | + |
| 15 | +* **Other docs or links:** |
| 16 | + * [PROM-29 (Created Timestamp)](https://github.com/prometheus/proposals/blob/main/proposals/0029-created-timestamp.md) |
| 17 | + * [Delta type proposal](https://github.com/prometheus/proposals/pull/48), [Delta WG](https://docs.google.com/document/d/1G0d_cLHkgrnWhXYG9oXEmjy2qp6GLSX2kxYiurLYUSQ/edit) |
| 18 | + |
| 19 | +> TL;DR: We propose to extend Prometheus TSDB storage sample definition to include an extra int64 that will represent the cumulative created timestamp (CT) and, for the future delta temporality ([PROM-48](https://github.com/prometheus/proposals/pull/48)), a delta start timestamp (ST). |
| 20 | +> Once implemented, wee propose to deprecate the `created-timestamps-zero-injection` experimental feature. |
| 21 | +
|
| 22 | +## Why |
| 23 | + |
| 24 | +The main goal of this proposal is to make sure [PROM-29's created timestamp (CT)](0029-created-timestamp.md) information is reliably and efficiently stored in Prometheus TSDB, so: |
| 25 | + |
| 26 | +* Written via TSDB Appender interfaces. |
| 27 | +* Query-able via TSDB Querier interfaces. |
| 28 | +* Persistent in WAL. |
| 29 | +* Watch-able (WAL) by Remote Writer. |
| 30 | +* (eventually) Persistent in TSDB block storage. |
| 31 | + |
| 32 | +To do it reliably, we propose to extend TSDB storage to "natively" support CT as something you can attach to a sample and use later on. |
| 33 | +Native CT support in Prometheus TSDB would unblock the practical use of CT information for: |
| 34 | + |
| 35 | +* Remote storages (Remote Write 2.0) (e.g. Otel, Chronosphere, Google) |
| 36 | +* PromQL and other read APIs (including federation) (e.g. increased cumulative based operation accuracy) |
| 37 | + |
| 38 | +Furthermore, it would unblock future Prometheus features for wider range of monitoring cases like: |
| 39 | + |
| 40 | +* Delta temporality support |
| 41 | +* UpAndDown counter (i.e. not monotonic counters) e.g. StatsD |
| 42 | + |
| 43 | +On top of that this allows to simplify some existing features e.g. detecting (exponential) native histogram resets (instead of reset hints) |
| 44 | + |
| 45 | +### Background: CT feature |
| 46 | + |
| 47 | +[PROM-29](0029-created-timestamp.md) introduced the "created timestamp" (CT) concept for Prometheus cumulative metrics. Semantically, CT represents the time when "counting" (from 0) started. |
| 48 | +In other words, CT is the time when the counter "instance" was created. |
| 49 | + |
| 50 | +Conceptually, CT extends the Prometheus data model for cumulative monotonic counters as follows: |
| 51 | + |
| 52 | +* (new) int64 Timestamp (CT): When counting started. |
| 53 | +* float64 or [Histogram](https://github.com/prometheus/prometheus/blob/d7e9a2ffb0f0ee0b6835cda6952d12ceee1371d0/model/histogram/histogram.go#L50) Value (V): The current value of the count, since the CT time. |
| 54 | +* int64 Timestamp (T): When this value was observed. |
| 55 | +* Labels: Unique identity of a series. |
| 56 | + * This includes special metadata labels like: `__name__`, `__type__`, `__unit__` |
| 57 | +* Exemplars |
| 58 | +* Metadata |
| 59 | + |
| 60 | +Since the CT concept introduction in Prometheus we: |
| 61 | + |
| 62 | +* Extended Prometheus protobuf scrape format to include CT per each cumulative sample (TODO link). |
| 63 | +* Proposed (for OM 2) text format changes for CT scraping (improvement over existing OM1 `_created` lines) (TODO link). |
| 64 | +* Expanded Scrape parser interface to return `CreatedTimestamp` per sample (aka per line). |
| 65 | +* Optimized Protobuf and OpenMetrics parsers for CT use (TODO links). |
| 66 | +* Implemented an opt-in, experimental [`created-timestamps-zero-injection`](https://prometheus.io/docs/prometheus/latest/feature_flags/#created-timestamps-zero-injection) feature flag that injects fake sample (V: 0, T: CT). |
| 67 | +* Included CT in Remote Write 2 specification (TODO link). |
| 68 | + |
| 69 | +### Background: Delta temporality |
| 70 | + |
| 71 | +See the details, motivations and discussions about the delta temporality in [PROM-48](https://github.com/prometheus/proposals/pull/48). |
| 72 | + |
| 73 | +The core TL;DR relevant for this proposal is that the delta temporality counter sample can be conceptually seen as a "mini-cumulative counters". Essentially delta is a single-sample (value) cumulative counter for a period between (inclusive) start(ST)/create(CT) timestamp and a (end)timestamp. |
| 74 | + |
| 75 | +In other words, `increase(<counter>[5m])` produces a single delta sample for a `[t-5m, t]` period (V: `increase(<counter>[5m])`, CT/ST: `now()-5m`, T: `now()`). |
| 76 | + |
| 77 | +This proves that it's worth considering delta when desiging a CT feature support. |
| 78 | + |
| 79 | +### Background: CT (cumulative) vs ST (delta) |
| 80 | + |
| 81 | +[Previous section](#background-delta-temporality) argues that conceptually the Cumulative Created Timestamp (CT) and Delta Start Timestamp (ST) are essentially the same thing. This is why typically they are stored in the same "field" in other system APIs and storages (e.g. start time in OpenTelemetry TODO link). |
| 82 | + |
| 83 | +The notable difference when this special timestamp is used for cumulatives vs delta samples is the dynamicity **characteristics** of this timestamp. |
| 84 | + |
| 85 | +* For the cumulatives we expect CT to change on every new counter restart, so: |
| 86 | + * Average: in the order of ~weeks/months for stable workloads, ~days/weeks for more dynamic environments (Kubernetes). |
| 87 | + * Best case: it never changes (infinite count) e.g days_since_X_total. |
| 88 | + * Worse case: it changes for every sample. |
| 89 | +* For the delta we expect CT to change for every sample. |
| 90 | + |
| 91 | +### Pitfalls of the current solution(s) |
| 92 | + |
| 93 | +* The `created-timestamps-zero-injection` feature allows some CT use cases, but it's limited in practice: |
| 94 | + * It's stateful, which means it can't be used effectively across the ecosystem. Essentially you can't miss a single sample (and/or you have to process all samples since 0) to find CT information per sample. For example: |
| 95 | + * Remote Write ingestion would need to be persistent and stateful, which blocks horizontal scalability of receiving. |
| 96 | + * It limits effectiveness of using CT for PromQL operations like `rate`, `resets` etc. |
| 97 | + * It makes "rolloup" (write time recording rules that pre-calculate rates) difficult to implement. |
| 98 | + * Given immutability invariant (e.g. Prometheus), you can't effectively inject CT at a later time (out of order writes are sometimes possible, but expensive, especially for a single sample to be written in the past per series). |
| 99 | + * It's prone to OOO false positives (we ignore this error for CTs now in Prometheus). |
| 100 | + * It's producing an artificial sample, which looks like it was scraped. |
| 101 | +* We can't implement delta temporarily effectively. |
| 102 | + |
| 103 | +## Goals |
| 104 | + |
| 105 | +* [MUST] Prometheus can reliably store, query, ingest and export cumulative created timestamp (CT) information (long term plan for [PROM-29](https://github.com/prometheus/proposals/blob/main/proposals/0029-created-timestamp.md#:~:text=For%20those%20reasons%2C%20created%20timestamps%20will%20also%20be%20stored%20as%20metadata%20per%20series%2C%20following%20the%20similar%20logic%20used%20for%20the%20zero%2Dinjection.)) |
| 106 | +* [SHOULD] Prometheus can reliably store, query, ingest and export delta start time information. This unblocks [PROM-48 delta proposal](https://github.com/prometheus/proposals/pull/48). Notably adding delta feature later on should ideally not require another complex storage design or implementation. |
| 107 | +* [SHOULD] Overhead of the solution should be minimal--initial overhead target set to maximum of 10% CPU, 10% of memory and 15% of disk space. |
| 108 | + |
| 109 | +## Non-Goals |
| 110 | + |
| 111 | +In this propose we don't want to: |
| 112 | + |
| 113 | +* Expand details on delta temporality. For this proposal it's enough to assume about delta, what's described in the [Background: Delta Temporality](#background-delta-temporality). |
| 114 | +* Expand or motivate the non-monotonic counter feature. |
| 115 | + |
| 116 | +## How |
| 117 | + |
| 118 | +TODO: |
| 119 | +* Describe touch points |
| 120 | +* Section for TSDB interfaces |
| 121 | +* Section for WAL changes + benchmark |
| 122 | +* Section for TSDB + benchmark |
| 123 | +* Section for PRW2 changes |
| 124 | +* Bonus: Should we rename CT to ST? |
| 125 | +* Expand on potential plan |
| 126 | + |
| 127 | +## Alternatives |
| 128 | + |
| 129 | +1. This is why not solution Z... |
| 130 | + |
| 131 | +## Action Plan |
| 132 | + |
| 133 | +The tasks to do in order to migrate to the new idea. |
| 134 | + |
| 135 | +* [ ] Task one |
| 136 | + |
| 137 | +* [ ] Task two |
0 commit comments