Control plane is shard rate limiting when indexing capacity is available

I have a cluster using 100 indexers, 4 cores 15GB per, with config and build info below. The cluster is reasonably healthy otherwise but there is a single index that is getting shard rate limited every day as morning traffic increases and then continues throughout the day until the evening, then it goes away.

The architecture is Vector http sinks into the ingest V2 API. Looking at Vector logs I can see that all of the rate limiting is happening to the same sink / index pair.

```
2025-11-11T17:50:14.919422Z  WARN sink{component_kind="sink" component_id=shared_log_to_quickwit_foo component_type=http}:request{request_id=1753}: vector::sinks::util::retries: Retrying after response. reason=too many requests
< this log repeatedly happens, no other components reported>
```

Some metrics when then the 429s are occurring:

<img width="408" height="213" alt="Image" src="https://github.com/user-attachments/assets/35518e9a-f2e8-4570-bedf-247575cb62e1" />
<img width="402" height="214" alt="Image" src="https://github.com/user-attachments/assets/087758c6-cecf-48f5-aa2d-609d9f4ec30b" />
<img width="415" height="211" alt="Image" src="https://github.com/user-attachments/assets/2366c6a5-04a0-48ec-9dc5-787ff6e22eda" />
<img width="402" height="197" alt="Image" src="https://github.com/user-attachments/assets/9f11b880-14d0-4e87-ba04-2a3760c9d8fb" />


Sometimes the control plane decides to add shards and the 429s go away:


<img width="406" height="213" alt="Image" src="https://github.com/user-attachments/assets/7019f064-b1d4-4e32-9967-48f953b2702e" />
<img width="416" height="204" alt="Image" src="https://github.com/user-attachments/assets/c1a6a1d8-969c-4b56-a25e-5a4f53cbef8f" />
<img width="408" height="209" alt="Image" src="https://github.com/user-attachments/assets/9c8ec2cc-4fde-4813-80ec-d3c0d538cd1e" />


My best guess at the moment is that we're just running too many indexers and that index throughput has a high distribution i.e. we have 2 indexes doing ~70% of the throughput, and 23 accounting for the rest. The plan I had was to reduce indexer count to see if average ST/LT ingestion throughput per node would increase enough to fix the issue. Though I wanted to report my current state to see if my understanding was correct and if this issue was worth some attention. Thanks for all the work. Let me know if you need more information.

build info
```
{
  "build": {
    "build_date": "2025-05-23T00:29:41Z",
    "build_profile": "release",
    "build_target": "aarch64-unknown-linux-gnu",
    "cargo_pkg_version": "0.8.0",
    "commit_date": "unknown",
    "commit_hash": "unknown",
    "commit_short_hash": "unknown",
    "commit_tags": [],
    "version": "0.8.0-nightly"
  },
  "runtime": {
    "num_cpus": 4,
    "num_threads_blocking": 3,
    "num_threads_non_blocking": 1
  }
}
```

ingest and indexer config
```
{
  "ingest_api_config": {
    "max_queue_memory_usage": "2.1 GB",
    "max_queue_disk_usage": "4.3 GB",
    "replication_factor": 1,
    "content_length_limit": "10.5 MB",
    "shard_throughput_limit": "5.2 MB",
    "shard_burst_limit": "52.4 MB",
    "shard_scale_up_factor": 1.5
  },
  "indexer_config": {
    "split_store_max_num_bytes": "100.0 GB",
    "split_store_max_num_splits": 1000,
    "max_concurrent_split_uploads": 12,
    "max_merge_write_throughput": null,
    "merge_concurrency": 2,
    "enable_otlp_endpoint": true,
    "enable_cooperative_indexing": false,
    "cpu_capacity": "4000m"
  }
}
```

25 indexes, all with the same config
```
// index config
version: 0.9
index_id: foo
doc_mapping:
  mode: dynamic
  field_mappings:
    - name: timestamp
      type: datetime
      input_formats:
        - rfc3339
        - unix_timestamp
        - iso8601
      fast: true
      fast_precision: milliseconds
    - name: message
      type: json
      tokenizer: raw
      fast: true
    - name: quickwit_message
      type: text
      tokenizer: default
      record: position
      fieldnorms: true
      fast:
        normalizer: lowercase
  timestamp_field: timestamp

indexing_settings:
  commit_timeout_secs: 30

search_settings:
  default_search_fields: [quickwit_message]
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Control plane is shard rate limiting when indexing capacity is available #5980

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Control plane is shard rate limiting when indexing capacity is available #5980

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions