Skip to content

Control plane is shard rate limiting when indexing capacity is available #5980

@srspnda

Description

@srspnda

I have a cluster using 100 indexers, 4 cores 15GB per, with config and build info below. The cluster is reasonably healthy otherwise but there is a single index that is getting shard rate limited every day as morning traffic increases and then continues throughout the day until the evening, then it goes away.

The architecture is Vector http sinks into the ingest V2 API. Looking at Vector logs I can see that all of the rate limiting is happening to the same sink / index pair.

2025-11-11T17:50:14.919422Z  WARN sink{component_kind="sink" component_id=shared_log_to_quickwit_foo component_type=http}:request{request_id=1753}: vector::sinks::util::retries: Retrying after response. reason=too many requests
< this log repeatedly happens, no other components reported>

Some metrics when then the 429s are occurring:

Image Image Image Image

Sometimes the control plane decides to add shards and the 429s go away:

Image Image Image

My best guess at the moment is that we're just running too many indexers and that index throughput has a high distribution i.e. we have 2 indexes doing ~70% of the throughput, and 23 accounting for the rest. The plan I had was to reduce indexer count to see if average ST/LT ingestion throughput per node would increase enough to fix the issue. Though I wanted to report my current state to see if my understanding was correct and if this issue was worth some attention. Thanks for all the work. Let me know if you need more information.

build info

{
  "build": {
    "build_date": "2025-05-23T00:29:41Z",
    "build_profile": "release",
    "build_target": "aarch64-unknown-linux-gnu",
    "cargo_pkg_version": "0.8.0",
    "commit_date": "unknown",
    "commit_hash": "unknown",
    "commit_short_hash": "unknown",
    "commit_tags": [],
    "version": "0.8.0-nightly"
  },
  "runtime": {
    "num_cpus": 4,
    "num_threads_blocking": 3,
    "num_threads_non_blocking": 1
  }
}

ingest and indexer config

{
  "ingest_api_config": {
    "max_queue_memory_usage": "2.1 GB",
    "max_queue_disk_usage": "4.3 GB",
    "replication_factor": 1,
    "content_length_limit": "10.5 MB",
    "shard_throughput_limit": "5.2 MB",
    "shard_burst_limit": "52.4 MB",
    "shard_scale_up_factor": 1.5
  },
  "indexer_config": {
    "split_store_max_num_bytes": "100.0 GB",
    "split_store_max_num_splits": 1000,
    "max_concurrent_split_uploads": 12,
    "max_merge_write_throughput": null,
    "merge_concurrency": 2,
    "enable_otlp_endpoint": true,
    "enable_cooperative_indexing": false,
    "cpu_capacity": "4000m"
  }
}

25 indexes, all with the same config

// index config
version: 0.9
index_id: foo
doc_mapping:
  mode: dynamic
  field_mappings:
    - name: timestamp
      type: datetime
      input_formats:
        - rfc3339
        - unix_timestamp
        - iso8601
      fast: true
      fast_precision: milliseconds
    - name: message
      type: json
      tokenizer: raw
      fast: true
    - name: quickwit_message
      type: text
      tokenizer: default
      record: position
      fieldnorms: true
      fast:
        normalizer: lowercase
  timestamp_field: timestamp

indexing_settings:
  commit_timeout_secs: 30

search_settings:
  default_search_fields: [quickwit_message]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions