Break up the huge `MetricsService` interface by aditya1702 · Pull Request #543 · stellar/wallet-backend

aditya1702 · 2026-03-19T13:27:41Z

Closes #547

Break up the big MetricsService interface into a struct with individual service level metric files. This makes it very easy to add new metrics for existing services and future new features and their metrics

Phase 1 of metrics refactor: create domain-specific metric structs (DBMetrics, RPCMetrics, IngestionMetrics, HTTPMetrics, GraphQLMetrics, AuthMetrics) with constructors taking prometheus.Registerer. Add pool registration functions. Rewrite metrics.go to compose sub-structs in a top-level Metrics struct. The legacy MetricsService interface is kept temporarily and now delegates to the new structs.

Phase 2: Replace MetricsService interface with *metrics.DBMetrics in all 11 data model structs. Call sites now use direct Prometheus API (e.g., m.Metrics.QueryDuration.WithLabelValues(...).Observe(...)). Add DBMetrics() bridge method to legacy MetricsService interface for callers that still create via NewMetricsService(). Update NewModels() signature and all wiring in serve.go, ingest.go, and loadtest/runner.go.

Phase 3: Replace MetricsService interface with *metrics.RPCMetrics in rpcService. Call sites now use direct Prometheus API (e.g., r.metrics.MethodCallsTotal.WithLabelValues(...).Inc()). Add RPCMetrics() bridge method to legacy MetricsService interface. Update all NewRPCService callers.

Phase 4: Replace MetricsService interface in all middleware: - MetricsMiddleware: accepts *metrics.HTTPMetrics - GraphQLFieldMetrics: accepts *metrics.GraphQLMetrics - ComplexityLogger: accepts *metrics.GraphQLMetrics - AuthenticationMiddleware: accepts *metrics.AuthMetrics Update serve.go wiring to pass sub-structs from *metrics.Metrics.

Phase 5+7: Replace MetricsService in ingestion pipeline: - IngestServiceConfig.Metrics now holds *metrics.Metrics - ingestService uses m.appMetrics.Ingestion.* for all metric calls - Indexer accepts *metrics.IngestionMetrics directly - All processors accept *metrics.IngestionMetrics instead of MetricsServiceInterface, calling StateChangeProcessingDuration directly - loadtest/runner.go and ingest/ingest.go create *metrics.Metrics directly instead of going through the legacy interface

Phase 6: Replace MockMetricsService + .On().Maybe() chains with real prometheus.NewRegistry() + metrics.NewMetrics(reg) in all 23 test files. Delete MetricsService interface, metricsService struct, mocks.go, processors/metrics.go, and metrics_test.go (to be rewritten). Update resolver.go to accept *metrics.Metrics directly. Remove legacy MetricsService field from serve.go handlerDeps. Update cmd/channel_account to use *metrics.Metrics. Net effect: -2050 lines of mock boilerplate removed.

internal/metrics/pool.go

internal/indexer/processors/sac_balances.go

internal/serve/graphql/resolvers/resolver.go

internal/metrics/ingestion.go

internal/metrics/rpc.go

* Break up the huge `MetricsService` interface (#543) * metrics: add concrete metric structs with wallet_ namespace prefix Phase 1 of metrics refactor: create domain-specific metric structs (DBMetrics, RPCMetrics, IngestionMetrics, HTTPMetrics, GraphQLMetrics, AuthMetrics) with constructors taking prometheus.Registerer. Add pool registration functions. Rewrite metrics.go to compose sub-structs in a top-level Metrics struct. The legacy MetricsService interface is kept temporarily and now delegates to the new structs. * metrics: migrate data models to use concrete *DBMetrics struct Phase 2: Replace MetricsService interface with *metrics.DBMetrics in all 11 data model structs. Call sites now use direct Prometheus API (e.g., m.Metrics.QueryDuration.WithLabelValues(...).Observe(...)). Add DBMetrics() bridge method to legacy MetricsService interface for callers that still create via NewMetricsService(). Update NewModels() signature and all wiring in serve.go, ingest.go, and loadtest/runner.go. * metrics: migrate RPC service to use concrete *RPCMetrics struct Phase 3: Replace MetricsService interface with *metrics.RPCMetrics in rpcService. Call sites now use direct Prometheus API (e.g., r.metrics.MethodCallsTotal.WithLabelValues(...).Inc()). Add RPCMetrics() bridge method to legacy MetricsService interface. Update all NewRPCService callers. * metrics: migrate middleware to use concrete metric structs Phase 4: Replace MetricsService interface in all middleware: - MetricsMiddleware: accepts *metrics.HTTPMetrics - GraphQLFieldMetrics: accepts *metrics.GraphQLMetrics - ComplexityLogger: accepts *metrics.GraphQLMetrics - AuthenticationMiddleware: accepts *metrics.AuthMetrics Update serve.go wiring to pass sub-structs from *metrics.Metrics. * metrics: migrate ingestion, indexer, and processors to concrete structs Phase 5+7: Replace MetricsService in ingestion pipeline: - IngestServiceConfig.Metrics now holds *metrics.Metrics - ingestService uses m.appMetrics.Ingestion.* for all metric calls - Indexer accepts *metrics.IngestionMetrics directly - All processors accept *metrics.IngestionMetrics instead of MetricsServiceInterface, calling StateChangeProcessingDuration directly - loadtest/runner.go and ingest/ingest.go create *metrics.Metrics directly instead of going through the legacy interface * metrics: migrate all tests to real registries, delete legacy interface Phase 6: Replace MockMetricsService + .On().Maybe() chains with real prometheus.NewRegistry() + metrics.NewMetrics(reg) in all 23 test files. Delete MetricsService interface, metricsService struct, mocks.go, processors/metrics.go, and metrics_test.go (to be rewritten). Update resolver.go to accept *metrics.Metrics directly. Remove legacy MetricsService field from serve.go handlerDeps. Update cmd/channel_account to use *metrics.Metrics. Net effect: -2050 lines of mock boilerplate removed. * make check * Add metrics tests * Add CollectAndCompare tests * Fix all metrics (#545) * metrics: add concrete metric structs with wallet_ namespace prefix Phase 1 of metrics refactor: create domain-specific metric structs (DBMetrics, RPCMetrics, IngestionMetrics, HTTPMetrics, GraphQLMetrics, AuthMetrics) with constructors taking prometheus.Registerer. Add pool registration functions. Rewrite metrics.go to compose sub-structs in a top-level Metrics struct. The legacy MetricsService interface is kept temporarily and now delegates to the new structs. * metrics: migrate data models to use concrete *DBMetrics struct Phase 2: Replace MetricsService interface with *metrics.DBMetrics in all 11 data model structs. Call sites now use direct Prometheus API (e.g., m.Metrics.QueryDuration.WithLabelValues(...).Observe(...)). Add DBMetrics() bridge method to legacy MetricsService interface for callers that still create via NewMetricsService(). Update NewModels() signature and all wiring in serve.go, ingest.go, and loadtest/runner.go. * metrics: migrate RPC service to use concrete *RPCMetrics struct Phase 3: Replace MetricsService interface with *metrics.RPCMetrics in rpcService. Call sites now use direct Prometheus API (e.g., r.metrics.MethodCallsTotal.WithLabelValues(...).Inc()). Add RPCMetrics() bridge method to legacy MetricsService interface. Update all NewRPCService callers. * metrics: migrate middleware to use concrete metric structs Phase 4: Replace MetricsService interface in all middleware: - MetricsMiddleware: accepts *metrics.HTTPMetrics - GraphQLFieldMetrics: accepts *metrics.GraphQLMetrics - ComplexityLogger: accepts *metrics.GraphQLMetrics - AuthenticationMiddleware: accepts *metrics.AuthMetrics Update serve.go wiring to pass sub-structs from *metrics.Metrics. * metrics: migrate ingestion, indexer, and processors to concrete structs Phase 5+7: Replace MetricsService in ingestion pipeline: - IngestServiceConfig.Metrics now holds *metrics.Metrics - ingestService uses m.appMetrics.Ingestion.* for all metric calls - Indexer accepts *metrics.IngestionMetrics directly - All processors accept *metrics.IngestionMetrics instead of MetricsServiceInterface, calling StateChangeProcessingDuration directly - loadtest/runner.go and ingest/ingest.go create *metrics.Metrics directly instead of going through the legacy interface * metrics: migrate all tests to real registries, delete legacy interface Phase 6: Replace MockMetricsService + .On().Maybe() chains with real prometheus.NewRegistry() + metrics.NewMetrics(reg) in all 23 test files. Delete MetricsService interface, metricsService struct, mocks.go, processors/metrics.go, and metrics_test.go (to be rewritten). Update resolver.go to accept *metrics.Metrics directly. Remove legacy MetricsService field from serve.go handlerDeps. Update cmd/channel_account to use *metrics.Metrics. Net effect: -2050 lines of mock boilerplate removed. * refactor db metrics * make check * Add metrics tests * Add CollectAndCompare tests * fix db test * Add operation-level GraphQL metrics and middleware Introduce operation-level Prometheus collectors (operation duration histogram, operations counter, in-flight gauge, response size histogram) and rename the constructor to NewGraphQLMetrics. Replace heavy per-field timing/counters with a lightweight deprecated-field counter and complexity/response histograms to reduce cardinality and provide SLO-friendly metrics. Add GraphQLOperationMetrics middleware to record duration, throughput, errors and response size; add tests for operation and field middleware and update existing tests and registrations. Wire the new operation and field middlewares into the server handler. * Create graphql_field_metrics_test.go * make check * Add comments for DB metrics * Refactor ingestion metrics; add retries/errors Refactors Prometheus ingestion metrics and updates instrumentation across ingestion code. Duration was changed from a HistogramVec to a Histogram (calls updated), several metric names were renamed (ledgers/transactions/operations totals), BatchSize removed, and new metrics added: LagLedgers, LedgerFetchDuration, RetriesTotal, RetryExhaustionsTotal, ErrorsTotal (and adjusted Participants metric name/buckets). Instrumentation now observes ledger fetch duration, increments retry and exhaustion counters in fetch/flush/persist paths, reports errors on live ingestion failures, and updates lag when available. Tests updated to match new metric types, bucket counts, and include unit tests for the new metrics. * Enhance RPC metrics with histograms and gauges Refactor and expand RPC Prometheus instrumentation for better SLOs and observability. - Replace per-endpoint summary metrics and separate success/failure counters with: - wallet_rpc_request_duration_seconds (HistogramVec by method) - wallet_rpc_request_duration_seconds and wallet_rpc_method_duration_seconds use explicit rpcDurationBuckets - wallet_rpc_requests_total now has (method,status) labels for success/failure - Add wallet_rpc_in_flight_requests (Gauge) and wallet_rpc_response_size_bytes (HistogramVec) - Convert MethodDuration to a histogram and keep MethodErrorsTotal and MethodCallsTotal counters - Update registration to include new collectors and remove deprecated ones. - Update tests to assert new metrics, add histogram and bucket checks, and adjust transport counter tests to use (method,status) labels. - RPC service changes: - Remove heartbeat channel accessor from the interface and implementation - GetHealth now sets ServiceHealth and LatestLedger based on response and marks health=0 on errors - sendRPCRequest now tracks InFlightRequests, observes RequestDuration, records ResponseSizeBytes, and increments RequestsTotal with success/failure labels instead of old endpoint counters These changes improve latency and size visibility, simplify error/success accounting, and provide gauges useful for detecting RPC node stalls or connection exhaustion. * Update rpc.go * Rename pool label and expand pool/DB metrics Replace the pond pool "channel" label with a clearer "pool_name" label and rename the RegisterPoolMetrics parameter accordingly. Update pool metrics (use wallet_pool_tasks_dropped_total instead of tasks_completed) and tests to reflect the label/name changes. Add extensive documentation comments and new Prometheus metrics for pgxpool (constructing_conns gauge, acquire/empty-acquire counters, wait time counters, new_conns/canceled/max_lifetime/max_idle destroy counters) and improve help text for several metrics to provide better observability of pool and DB connection behavior. * Add QueryExecMode to DB pool config Expose pgx.QueryExecMode on PoolConfig and apply it when opening the connection pool. If non-zero, the value is copied into cfg.ConnConfig.DefaultQueryExecMode so callers can override pgx's default (cached prepared statements). The serve config now sets QueryExecMode to Exec to avoid server-side prepared statement caching which conflicts with PgBouncer in transaction pooling mode (SQLSTATE 42P05), and imports github.com/jackc/pgx/v5. * Refactor GraphQL metrics and remove RPC heartbeat Ensure GraphQL operation metrics properly decrement InFlightOperations exactly once by adding a responded guard and defer. Normalize GraphQL error labels: unrecognized extension codes now map to "unknown" (and the comment documents the closed set). Remove the heartbeatChannel from rpcService and its mock/tests, simplifying the RPC service surface and cleaning up related test assertions.

* metrics: add concrete metric structs with wallet_ namespace prefix Phase 1 of metrics refactor: create domain-specific metric structs (DBMetrics, RPCMetrics, IngestionMetrics, HTTPMetrics, GraphQLMetrics, AuthMetrics) with constructors taking prometheus.Registerer. Add pool registration functions. Rewrite metrics.go to compose sub-structs in a top-level Metrics struct. The legacy MetricsService interface is kept temporarily and now delegates to the new structs. * metrics: migrate data models to use concrete *DBMetrics struct Phase 2: Replace MetricsService interface with *metrics.DBMetrics in all 11 data model structs. Call sites now use direct Prometheus API (e.g., m.Metrics.QueryDuration.WithLabelValues(...).Observe(...)). Add DBMetrics() bridge method to legacy MetricsService interface for callers that still create via NewMetricsService(). Update NewModels() signature and all wiring in serve.go, ingest.go, and loadtest/runner.go. * metrics: migrate RPC service to use concrete *RPCMetrics struct Phase 3: Replace MetricsService interface with *metrics.RPCMetrics in rpcService. Call sites now use direct Prometheus API (e.g., r.metrics.MethodCallsTotal.WithLabelValues(...).Inc()). Add RPCMetrics() bridge method to legacy MetricsService interface. Update all NewRPCService callers. * metrics: migrate middleware to use concrete metric structs Phase 4: Replace MetricsService interface in all middleware: - MetricsMiddleware: accepts *metrics.HTTPMetrics - GraphQLFieldMetrics: accepts *metrics.GraphQLMetrics - ComplexityLogger: accepts *metrics.GraphQLMetrics - AuthenticationMiddleware: accepts *metrics.AuthMetrics Update serve.go wiring to pass sub-structs from *metrics.Metrics. * metrics: migrate ingestion, indexer, and processors to concrete structs Phase 5+7: Replace MetricsService in ingestion pipeline: - IngestServiceConfig.Metrics now holds *metrics.Metrics - ingestService uses m.appMetrics.Ingestion.* for all metric calls - Indexer accepts *metrics.IngestionMetrics directly - All processors accept *metrics.IngestionMetrics instead of MetricsServiceInterface, calling StateChangeProcessingDuration directly - loadtest/runner.go and ingest/ingest.go create *metrics.Metrics directly instead of going through the legacy interface * metrics: migrate all tests to real registries, delete legacy interface Phase 6: Replace MockMetricsService + .On().Maybe() chains with real prometheus.NewRegistry() + metrics.NewMetrics(reg) in all 23 test files. Delete MetricsService interface, metricsService struct, mocks.go, processors/metrics.go, and metrics_test.go (to be rewritten). Update resolver.go to accept *metrics.Metrics directly. Remove legacy MetricsService field from serve.go handlerDeps. Update cmd/channel_account to use *metrics.Metrics. Net effect: -2050 lines of mock boilerplate removed. * refactor db metrics * make check * Add metrics tests * Add CollectAndCompare tests * fix db test * Add operation-level GraphQL metrics and middleware Introduce operation-level Prometheus collectors (operation duration histogram, operations counter, in-flight gauge, response size histogram) and rename the constructor to NewGraphQLMetrics. Replace heavy per-field timing/counters with a lightweight deprecated-field counter and complexity/response histograms to reduce cardinality and provide SLO-friendly metrics. Add GraphQLOperationMetrics middleware to record duration, throughput, errors and response size; add tests for operation and field middleware and update existing tests and registrations. Wire the new operation and field middlewares into the server handler. * Create graphql_field_metrics_test.go * make check * Add comments for DB metrics * Refactor ingestion metrics; add retries/errors Refactors Prometheus ingestion metrics and updates instrumentation across ingestion code. Duration was changed from a HistogramVec to a Histogram (calls updated), several metric names were renamed (ledgers/transactions/operations totals), BatchSize removed, and new metrics added: LagLedgers, LedgerFetchDuration, RetriesTotal, RetryExhaustionsTotal, ErrorsTotal (and adjusted Participants metric name/buckets). Instrumentation now observes ledger fetch duration, increments retry and exhaustion counters in fetch/flush/persist paths, reports errors on live ingestion failures, and updates lag when available. Tests updated to match new metric types, bucket counts, and include unit tests for the new metrics. * Enhance RPC metrics with histograms and gauges Refactor and expand RPC Prometheus instrumentation for better SLOs and observability. - Replace per-endpoint summary metrics and separate success/failure counters with: - wallet_rpc_request_duration_seconds (HistogramVec by method) - wallet_rpc_request_duration_seconds and wallet_rpc_method_duration_seconds use explicit rpcDurationBuckets - wallet_rpc_requests_total now has (method,status) labels for success/failure - Add wallet_rpc_in_flight_requests (Gauge) and wallet_rpc_response_size_bytes (HistogramVec) - Convert MethodDuration to a histogram and keep MethodErrorsTotal and MethodCallsTotal counters - Update registration to include new collectors and remove deprecated ones. - Update tests to assert new metrics, add histogram and bucket checks, and adjust transport counter tests to use (method,status) labels. - RPC service changes: - Remove heartbeat channel accessor from the interface and implementation - GetHealth now sets ServiceHealth and LatestLedger based on response and marks health=0 on errors - sendRPCRequest now tracks InFlightRequests, observes RequestDuration, records ResponseSizeBytes, and increments RequestsTotal with success/failure labels instead of old endpoint counters These changes improve latency and size visibility, simplify error/success accounting, and provide gauges useful for detecting RPC node stalls or connection exhaustion. * Update rpc.go * Rename pool label and expand pool/DB metrics Replace the pond pool "channel" label with a clearer "pool_name" label and rename the RegisterPoolMetrics parameter accordingly. Update pool metrics (use wallet_pool_tasks_dropped_total instead of tasks_completed) and tests to reflect the label/name changes. Add extensive documentation comments and new Prometheus metrics for pgxpool (constructing_conns gauge, acquire/empty-acquire counters, wait time counters, new_conns/canceled/max_lifetime/max_idle destroy counters) and improve help text for several metrics to provide better observability of pool and DB connection behavior. * Add QueryExecMode to DB pool config Expose pgx.QueryExecMode on PoolConfig and apply it when opening the connection pool. If non-zero, the value is copied into cfg.ConnConfig.DefaultQueryExecMode so callers can override pgx's default (cached prepared statements). The serve config now sets QueryExecMode to Exec to avoid server-side prepared statement caching which conflicts with PgBouncer in transaction pooling mode (SQLSTATE 42P05), and imports github.com/jackc/pgx/v5. * remove envelope_xdr and meta_xdr - 1 * fix all tests * Add back the envelopeXDR and metaXDR temporarily for tests * Refactor GraphQL metrics and remove RPC heartbeat Ensure GraphQL operation metrics properly decrement InFlightOperations exactly once by adding a responded guard and defer. Normalize GraphQL error labels: unrecognized extension codes now map to "unknown" (and the comment documents the closed set). Remove the heartbeatChannel from rpcService and its mock/tests, simplifying the RPC service surface and cleaning up related test assertions. * Break up the huge `MetricsService` interface (#543) * metrics: add concrete metric structs with wallet_ namespace prefix Phase 1 of metrics refactor: create domain-specific metric structs (DBMetrics, RPCMetrics, IngestionMetrics, HTTPMetrics, GraphQLMetrics, AuthMetrics) with constructors taking prometheus.Registerer. Add pool registration functions. Rewrite metrics.go to compose sub-structs in a top-level Metrics struct. The legacy MetricsService interface is kept temporarily and now delegates to the new structs. * metrics: migrate data models to use concrete *DBMetrics struct Phase 2: Replace MetricsService interface with *metrics.DBMetrics in all 11 data model structs. Call sites now use direct Prometheus API (e.g., m.Metrics.QueryDuration.WithLabelValues(...).Observe(...)). Add DBMetrics() bridge method to legacy MetricsService interface for callers that still create via NewMetricsService(). Update NewModels() signature and all wiring in serve.go, ingest.go, and loadtest/runner.go. * metrics: migrate RPC service to use concrete *RPCMetrics struct Phase 3: Replace MetricsService interface with *metrics.RPCMetrics in rpcService. Call sites now use direct Prometheus API (e.g., r.metrics.MethodCallsTotal.WithLabelValues(...).Inc()). Add RPCMetrics() bridge method to legacy MetricsService interface. Update all NewRPCService callers. * metrics: migrate middleware to use concrete metric structs Phase 4: Replace MetricsService interface in all middleware: - MetricsMiddleware: accepts *metrics.HTTPMetrics - GraphQLFieldMetrics: accepts *metrics.GraphQLMetrics - ComplexityLogger: accepts *metrics.GraphQLMetrics - AuthenticationMiddleware: accepts *metrics.AuthMetrics Update serve.go wiring to pass sub-structs from *metrics.Metrics. * metrics: migrate ingestion, indexer, and processors to concrete structs Phase 5+7: Replace MetricsService in ingestion pipeline: - IngestServiceConfig.Metrics now holds *metrics.Metrics - ingestService uses m.appMetrics.Ingestion.* for all metric calls - Indexer accepts *metrics.IngestionMetrics directly - All processors accept *metrics.IngestionMetrics instead of MetricsServiceInterface, calling StateChangeProcessingDuration directly - loadtest/runner.go and ingest/ingest.go create *metrics.Metrics directly instead of going through the legacy interface * metrics: migrate all tests to real registries, delete legacy interface Phase 6: Replace MockMetricsService + .On().Maybe() chains with real prometheus.NewRegistry() + metrics.NewMetrics(reg) in all 23 test files. Delete MetricsService interface, metricsService struct, mocks.go, processors/metrics.go, and metrics_test.go (to be rewritten). Update resolver.go to accept *metrics.Metrics directly. Remove legacy MetricsService field from serve.go handlerDeps. Update cmd/channel_account to use *metrics.Metrics. Net effect: -2050 lines of mock boilerplate removed. * make check * Add metrics tests * Add CollectAndCompare tests * Fix all metrics (#545) * metrics: add concrete metric structs with wallet_ namespace prefix Phase 1 of metrics refactor: create domain-specific metric structs (DBMetrics, RPCMetrics, IngestionMetrics, HTTPMetrics, GraphQLMetrics, AuthMetrics) with constructors taking prometheus.Registerer. Add pool registration functions. Rewrite metrics.go to compose sub-structs in a top-level Metrics struct. The legacy MetricsService interface is kept temporarily and now delegates to the new structs. * metrics: migrate data models to use concrete *DBMetrics struct Phase 2: Replace MetricsService interface with *metrics.DBMetrics in all 11 data model structs. Call sites now use direct Prometheus API (e.g., m.Metrics.QueryDuration.WithLabelValues(...).Observe(...)). Add DBMetrics() bridge method to legacy MetricsService interface for callers that still create via NewMetricsService(). Update NewModels() signature and all wiring in serve.go, ingest.go, and loadtest/runner.go. * metrics: migrate RPC service to use concrete *RPCMetrics struct Phase 3: Replace MetricsService interface with *metrics.RPCMetrics in rpcService. Call sites now use direct Prometheus API (e.g., r.metrics.MethodCallsTotal.WithLabelValues(...).Inc()). Add RPCMetrics() bridge method to legacy MetricsService interface. Update all NewRPCService callers. * metrics: migrate middleware to use concrete metric structs Phase 4: Replace MetricsService interface in all middleware: - MetricsMiddleware: accepts *metrics.HTTPMetrics - GraphQLFieldMetrics: accepts *metrics.GraphQLMetrics - ComplexityLogger: accepts *metrics.GraphQLMetrics - AuthenticationMiddleware: accepts *metrics.AuthMetrics Update serve.go wiring to pass sub-structs from *metrics.Metrics. * metrics: migrate ingestion, indexer, and processors to concrete structs Phase 5+7: Replace MetricsService in ingestion pipeline: - IngestServiceConfig.Metrics now holds *metrics.Metrics - ingestService uses m.appMetrics.Ingestion.* for all metric calls - Indexer accepts *metrics.IngestionMetrics directly - All processors accept *metrics.IngestionMetrics instead of MetricsServiceInterface, calling StateChangeProcessingDuration directly - loadtest/runner.go and ingest/ingest.go create *metrics.Metrics directly instead of going through the legacy interface * metrics: migrate all tests to real registries, delete legacy interface Phase 6: Replace MockMetricsService + .On().Maybe() chains with real prometheus.NewRegistry() + metrics.NewMetrics(reg) in all 23 test files. Delete MetricsService interface, metricsService struct, mocks.go, processors/metrics.go, and metrics_test.go (to be rewritten). Update resolver.go to accept *metrics.Metrics directly. Remove legacy MetricsService field from serve.go handlerDeps. Update cmd/channel_account to use *metrics.Metrics. Net effect: -2050 lines of mock boilerplate removed. * refactor db metrics * make check * Add metrics tests * Add CollectAndCompare tests * fix db test * Add operation-level GraphQL metrics and middleware Introduce operation-level Prometheus collectors (operation duration histogram, operations counter, in-flight gauge, response size histogram) and rename the constructor to NewGraphQLMetrics. Replace heavy per-field timing/counters with a lightweight deprecated-field counter and complexity/response histograms to reduce cardinality and provide SLO-friendly metrics. Add GraphQLOperationMetrics middleware to record duration, throughput, errors and response size; add tests for operation and field middleware and update existing tests and registrations. Wire the new operation and field middlewares into the server handler. * Create graphql_field_metrics_test.go * make check * Add comments for DB metrics * Refactor ingestion metrics; add retries/errors Refactors Prometheus ingestion metrics and updates instrumentation across ingestion code. Duration was changed from a HistogramVec to a Histogram (calls updated), several metric names were renamed (ledgers/transactions/operations totals), BatchSize removed, and new metrics added: LagLedgers, LedgerFetchDuration, RetriesTotal, RetryExhaustionsTotal, ErrorsTotal (and adjusted Participants metric name/buckets). Instrumentation now observes ledger fetch duration, increments retry and exhaustion counters in fetch/flush/persist paths, reports errors on live ingestion failures, and updates lag when available. Tests updated to match new metric types, bucket counts, and include unit tests for the new metrics. * Enhance RPC metrics with histograms and gauges Refactor and expand RPC Prometheus instrumentation for better SLOs and observability. - Replace per-endpoint summary metrics and separate success/failure counters with: - wallet_rpc_request_duration_seconds (HistogramVec by method) - wallet_rpc_request_duration_seconds and wallet_rpc_method_duration_seconds use explicit rpcDurationBuckets - wallet_rpc_requests_total now has (method,status) labels for success/failure - Add wallet_rpc_in_flight_requests (Gauge) and wallet_rpc_response_size_bytes (HistogramVec) - Convert MethodDuration to a histogram and keep MethodErrorsTotal and MethodCallsTotal counters - Update registration to include new collectors and remove deprecated ones. - Update tests to assert new metrics, add histogram and bucket checks, and adjust transport counter tests to use (method,status) labels. - RPC service changes: - Remove heartbeat channel accessor from the interface and implementation - GetHealth now sets ServiceHealth and LatestLedger based on response and marks health=0 on errors - sendRPCRequest now tracks InFlightRequests, observes RequestDuration, records ResponseSizeBytes, and increments RequestsTotal with success/failure labels instead of old endpoint counters These changes improve latency and size visibility, simplify error/success accounting, and provide gauges useful for detecting RPC node stalls or connection exhaustion. * Update rpc.go * Rename pool label and expand pool/DB metrics Replace the pond pool "channel" label with a clearer "pool_name" label and rename the RegisterPoolMetrics parameter accordingly. Update pool metrics (use wallet_pool_tasks_dropped_total instead of tasks_completed) and tests to reflect the label/name changes. Add extensive documentation comments and new Prometheus metrics for pgxpool (constructing_conns gauge, acquire/empty-acquire counters, wait time counters, new_conns/canceled/max_lifetime/max_idle destroy counters) and improve help text for several metrics to provide better observability of pool and DB connection behavior. * Add QueryExecMode to DB pool config Expose pgx.QueryExecMode on PoolConfig and apply it when opening the connection pool. If non-zero, the value is copied into cfg.ConnConfig.DefaultQueryExecMode so callers can override pgx's default (cached prepared statements). The serve config now sets QueryExecMode to Exec to avoid server-side prepared statement caching which conflicts with PgBouncer in transaction pooling mode (SQLSTATE 42P05), and imports github.com/jackc/pgx/v5. * Refactor GraphQL metrics and remove RPC heartbeat Ensure GraphQL operation metrics properly decrement InFlightOperations exactly once by adding a responded guard and defer. Normalize GraphQL error labels: unrecognized extension codes now map to "unknown" (and the comment documents the closed set). Remove the heartbeatChannel from rpcService and its mock/tests, simplifying the RPC service surface and cleaning up related test assertions.

aditya1702 added 6 commits March 18, 2026 16:45

aditya1702 changed the base branch from main to feature/finalize-metrics March 19, 2026 13:27

aditya1702 added 3 commits March 19, 2026 09:37

make check

c3b624d

Add metrics tests

c91fa31

Add CollectAndCompare tests

a3ad0a6

aditya1702 marked this pull request as ready for review March 27, 2026 20:45