Skip to content

Conversation

@alex-hunt-materialize
Copy link
Contributor

@alex-hunt-materialize alex-hunt-materialize commented Dec 17, 2025

Design doc for prometheus metrics from queries

Motivation

Design doc.

Tips for reviewer

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

@alex-hunt-materialize alex-hunt-materialize marked this pull request as ready for review December 17, 2025 13:02
Copy link
Contributor

@teskje teskje left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This design does a good job of explaining the SQL-level changes, but it's missing details on the implementation, specifically about how the view results are to be collected.

In particular, I'm interested in:

  • Is the view run as an IMV, or is it queried periodically?
  • I assume the cluster specified in CREATE API is the cluster on which we run the IVM/view queries?
  • How do we plan to support per-replica metrics, i.e. those collected from compute introspection sources?

I'm also not sure about the concept of inserting into a special table to cause DDL to happen. We don't have precedent for this, afaik. It introduces difficulties around error handling: If the user inserts an invalid config, how do we deal report the error, and do we silently remove the broken config again or do we leave it around?

@alex-hunt-materialize
Copy link
Contributor Author

@teskje Thank you for taking a look!

Is the view run as an IMV, or is it queried periodically?

The individual metrics views are queried every time the HTTP endpoint receives a GET request. The actual view is user-defined, so they can use views, indexed views, materialized views, tables, whatever. If you think we should restrict this to only indexed views or materialized views, we can probably do that.

I assume the cluster specified in CREATE API is the cluster on which we run the IVM/view queries?

This is solely for querying the metrics registry to find the views. Perhaps the ON CLUSTER is unnecessary?

How do we plan to support per-replica metrics, i.e. those collected from compute introspection sources?

If a user can define a view for it, it should work here. How are these different from other things? Most likely, they would define a metric view which has labels for the cluster name or id and the replica name or id. The view would just have one row per replica.

I'm also not sure about the concept of inserting into a special table to cause DDL to happen. We don't have precedent for this, afaik. It introduces difficulties around error handling: If the user inserts an invalid config, how do we deal report the error, and do we silently remove the broken config again or do we leave it around?

Perhaps this should be its own type with its own syntax for inserting new metric views, rather than a table. The table is just a holdover from the hackdays implementation. I'd love recommendations for how to better handle this.

@teskje
Copy link
Contributor

teskje commented Dec 18, 2025

Making the view queries every time the API is scraped makes sense to me!

If you think we should restrict this to only indexed views or materialized views, we can probably do that.

Probably not necessary, but we should document that creating indexes is recommended. Hitting persist a lot isn't great for performance, as we learned from the queries the Console is making. Also it probably risks the Prometheus scraper timing out if queries take too long.

Perhaps the ON CLUSTER is unnecessary?

I think we need a way to specify on which cluster the queries should run. Indexes are per cluster after all. And we can't default to mz_catalog_server (or any system cluster) because users can't define their own indexes on that cluster.

If a user can define a view for it, it should work here. How are these different from other things? Most likely, they would define a metric view which has labels for the cluster name or id and the replica name or id. The view would just have one row per replica.

Yeah, that would be nice but is not how things work today 😅 To query the compute introspection relations, you need to target both a cluster and a replica, and the query will return results only for that replica. There is no way to collect results from all replicas with a single query. The promsql exporter has special logic to loop over all current replicas and run the introspection query on each of them.

Perhaps this should be its own type with its own syntax for inserting new metric views, rather than a table. The table is just a holdover from the hackdays implementation. I'd love recommendations for how to better handle this.

I'm thinking something like ALTER API <api> ADD METRIC <spec> and ALTER API <api> DROP METRIC <spec>? Or make metrics a first-class object and then CREATE/DROP METRIC <name> IN API <api> AS <spec>.

The second thing is how we do things usually, so I think I'd prefer that. It would come with system tables mz_apis and mz_metrics, for introspection.

@alex-hunt-materialize
Copy link
Contributor Author

Making the view queries every time the API is scraped makes sense to me!

If you think we should restrict this to only indexed views or materialized views, we can probably do that.

Probably not necessary, but we should document that creating indexes is recommended. Hitting persist a lot isn't great for performance, as we learned from the queries the Console is making. Also it probably risks the Prometheus scraper timing out if queries take too long.

Is there a downside to restricting it to indexed things? Prometheus queries are usually run frequently on a schedule, so any memory used by these views would likely need to be allocated in frequent spikes if not indexed. Is there higher CPU used by keeping an index up to date, rather than querying every few seconds or minutes?

Perhaps the ON CLUSTER is unnecessary?

I think we need a way to specify on which cluster the queries should run. Indexes are per cluster after all. And we can't default to mz_catalog_server (or any system cluster) because users can't define their own indexes on that cluster.

Do we just want to extend the metric definition to include a cluster to run it on?

If a user can define a view for it, it should work here. How are these different from other things? Most likely, they would define a metric view which has labels for the cluster name or id and the replica name or id. The view would just have one row per replica.

Yeah, that would be nice but is not how things work today 😅 To query the compute introspection relations, you need to target both a cluster and a replica, and the query will return results only for that replica. There is no way to collect results from all replicas with a single query. The promsql exporter has special logic to loop over all current replicas and run the introspection query on each of them.

Is this something we think will eventually be supported? We would like this internally, and I'm sure externally they would like some per-replica stuff (mostly, are all replicas healthy and "caught up"). Maybe we'll need to move replacing the promsql-exporter to a stretch goal :( .

Perhaps this should be its own type with its own syntax for inserting new metric views, rather than a table. The table is just a holdover from the hackdays implementation. I'd love recommendations for how to better handle this.

I'm thinking something like ALTER API <api> ADD METRIC <spec> and ALTER API <api> DROP METRIC <spec>? Or make metrics a first-class object and then CREATE/DROP METRIC <name> IN API <api> AS <spec>.

The second thing is how we do things usually, so I think I'd prefer that. It would come with system tables mz_apis and mz_metrics, for introspection.

CREATE/DROP METRIC <name> IN API <api> AS <spec> seems great to me. Would <spec> be a reference to a view, like we have here, or would it be the actual query?

@teskje
Copy link
Contributor

teskje commented Dec 18, 2025

Is there a downside to restricting it to indexed things? Prometheus queries are usually run frequently on a schedule, so any memory used by these views would likely need to be allocated in frequent spikes if not indexed. Is there higher CPU used by keeping an index up to date, rather than querying every few seconds or minutes?

If the query is cheap enough, running one ad-hoc dataflow every minute is probably fine. We also have persist peeks nowadays, which can make reading directly from persist very efficient, potentially even more efficient than reading from an index. But I'd say the main downside of requiring an index is the operational overhead of having to create an index for each of your Prometheus metrics, even if it shouldn't be necessary.

Do we just want to extend the metric definition to include a cluster to run it on?

This sounds like a good idea to me, but I'd be curious about product/FE takes.

Is this something we think will eventually be supported?

In the fullness of time, sure :D I also really want this, but it's hard to find a good design, and there hasn't been much of a desire for better compute introspection from product, afaik. It certainly won't happen soon.

Would <spec> be a reference to a view, like we have here, or would it be the actual query?

Good question! Things are easier implementation-wise if it's a relation item (a view, materialized view, table, or source). I don't really like the idea of introducing a new item type that has to do SQL planning and optimization. UX wise it probably doesn't make too much of a difference?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants