Skip to content

[BUG] doc_count values for a terms/multi-terms aggregation is approximate #4833

@LantaoJin

Description

@LantaoJin

What is the bug?
The correctness issue happens on following PPL queries on Clickbench:
Q15 and its plan
Q16 and its plan
Q17 and its plan
Q22 and its plan
Q34 and its plan
Q35 and its plan

For example, the query of Q16 is fully pushed down to DSL.
The following DSL query returns incorrect (approximate) TopK:

{
  "from": 0,
  "size": 0,
  "timeout": "1m",
  "aggregations": {
    "UserID": {
      "terms": {
        "field": "UserID",
        "size": 10,
        "min_doc_count": 1,
        "shard_min_doc_count": 0,
        "show_term_doc_count_error": false,
        "order": [
          {
            "count()": "desc"
          },
          {
            "_key": "asc"
          }
        ]
      },
      "aggregations": {
        "count()": {
          "value_count": {
            "field": "_index"
          }
        }
      }
    }
  }
}

The result is

{"took":4205,"timed_out":false,"terminated_early":true,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":{"value":10000,"relation":"gte"},"max_score":null,"hits":[]},"aggregations":{"UserID":{"doc_count_error_upper_bound":-1,"sum_other_doc_count":99890741,"buckets":[{"key":1313338681122956954,"doc_count":29097,"count()":{"value":29097}},{"key":1907779576417363396,"doc_count":25331,"count()":{"value":25331}},{"key":2305303682471783379,"doc_count":10594,"count()":{"value":10594}},{"key":7982623143712728547,"doc_count":7584,"count()":{"value":7584}},{"key":6018350421959114808,"doc_count":6678,"count()":{"value":6678}},{"key":1090981537032625727,"doc_count":6197,"count()":{"value":6197}},{"key":5730251990344211405,"doc_count":6019,"count()":{"value":6019}},{"key":7280399273658728997,"doc_count":6018,"count()":{"value":6018}},{"key":835157184735512989,"doc_count":5211,"count()":{"value":5211}},{"key":770542365400669095,"doc_count":4027,"count()":{"value":4027}}]}}}

But the expected result is

UserID,count(1)
1313338681122956954,29097
1907779576417363396,25333
2305303682471783379,10597
7982623143712728547,7584
6018350421959114808,6678
7280399273658728997,6411
1090981537032625727,6197
5730251990344211405,6019
835157184735512989,5211
770542365400669095,4906

The actual doc_count is 6411 by querying following DSL query:

{
  "size": 0,
  "track_total_hits": true,
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "UserID": "7280399273658728997"
          }
        }
      ]
    }
  }
}

This is an OpenSearch Core issue, cannot be fixed within SQL plugin.

Metadata

Metadata

Assignees

Type

No type

Projects

Status

Not Started

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions