-
Notifications
You must be signed in to change notification settings - Fork 178
Description
Problem Description
OpenSearch SQL experienced severe memory exhaustion and Out-of-Memory (OOM) errors when executing PPL queries with 10+ consecutive filtering operations (regex and/or where clauses), regardless of dataset size. The issue manifested as:
- Memory consumption growing exponentially with the number of filtering operations (regex/where)
- GC overhead limit exceeded errors
- Query failures even with small datasets
- System becoming unresponsive during query execution
Example Failing Queries
Example 1: Multiple regex operations
source=accounts
| regex firstname="^[A-Z][a-z]+"
| regex lastname="^[A-Z][a-z]+$"
| regex address="\\d+\\s+[A-Za-z]+\\s+(Street|Lane|Avenue|Place|Court|Terrace)"
| regex employer="^[A-Z][a-z]+"
| regex email="[a-z]+[a-z]+@[a-z]+\\.(com|net|org)$"
| regex city="^[A-Z][a-z]+$"
| regex state="^[A-Z]{2}$"
| regex gender="^(M|F)$"
| regex firstname=".*[aeiouAEIOU].*"
| regex lastname=".*[bcdfghjklmnpqrstvwxyzBCDFGHJKLMNPQRSTVWXYZ].*"
| regex address=".*[0-9]{3}.*"
| regex employer=".*[aeiouAEIOU]$"
| regex email=".*[a-z]{5,}@.*"
| regex city=".*[aeiouyAEIOUY]{2}.*"
| regex address="^[0-9]+\\s.*"
| regex lastname="[A-Z][a-z]*[aeiouy]$"
| where age >= 25 AND age <= 40
| fields firstname, lastname, email, employer, address, city, state
| head 10Example 2: Multiple where operations
source=accounts
| where age > 20
| where age < 50
| where balance > 1000
| where balance < 50000
| where account_number > 0
| where account_number < 1000
| where age != 25
| where age != 30
| where age != 35
| where balance != 5000
| where balance != 10000
| where account_number != 100
| where account_number != 200
| where age > 21
| where age < 49
| where balance > 1001
| fields firstname, lastname, age, balance
| head 10Example Failures in Log
[2025-11-20T20:55:54,723][INFO ][o.o.s.p.PPLService ] [integTest-0] [9d49584e-83ad-484b-9fc8-5489de5517a5] Incoming request source=table | where identifier > *** | where identifier < *** | where identifier > *** | where identifier < *** | where identifier > *** | where identifier < *** | where identifier != *** | where identifier != *** | where identifier != *** | where identifier != *** | where identifier != *** | where identifier != *** | where identifier != *** | where identifier > *** | where identifier < *** | where identifier > *** | fields + identifier,identifier,identifier,identifier | head 10
[2025-11-20T20:56:04,871][INFO ][o.o.m.j.JvmGcMonitorService] [integTest-0] [gc][123] overhead, spent [372ms] collecting in the last [1s]
[2025-11-20T20:56:05,878][WARN ][o.o.m.j.JvmGcMonitorService] [integTest-0] [gc][124] overhead, spent [686ms] collecting in the last [1s]
[2025-11-20T20:56:07,136][WARN ][o.o.m.j.JvmGcMonitorService] [integTest-0] [gc][125] overhead, spent [1.1s] collecting in the last [1.2s]
[2025-11-20T20:56:08,412][WARN ][o.o.m.j.JvmGcMonitorService] [integTest-0] [gc][126] overhead, spent [1.2s] collecting in the last [1.2s]
[2025-11-20T20:56:09,536][WARN ][o.o.m.j.JvmGcMonitorService] [integTest-0] [gc][127] overhead, spent [1s] collecting in the last [1.1s]
[2025-11-20T20:56:10,601][WARN ][o.o.m.j.JvmGcMonitorService] [integTest-0] [gc][128] overhead, spent [1s] collecting in the last [1s]
[2025-11-20T20:56:11,753][WARN ][o.o.m.j.JvmGcMonitorService] [integTest-0] [gc][129] overhead, spent [1s] collecting in the last [1.1s]
[2025-11-20T20:56:12,914][WARN ][o.o.m.j.JvmGcMonitorService] [integTest-0] [gc][130] overhead, spent [1.1s] collecting in the last [1.1s]
[2025-11-20T20:56:13,920][WARN ][o.o.m.j.JvmGcMonitorService] [integTest-0] [gc][131] overhead, spent [967ms] collecting in the last [1s]
...Root Cause Analysis
Initial Misdiagnosis
Initially suspected that Pattern.compile() calls were causing memory issues due to lack of caching. However, implementing pattern caching did not resolve the problem.
Actual Root Cause
The real issue was identified by examining the Calcite query planning process:
-
Deep Filter RelNode Chains: Each
regexorwherecommand in the PPL query created a separateFilterRelNode during the analysis phase, resulting in a deeply nested chain:Filter(regex15) └─ Filter(regex14) └─ Filter(regex13) └─ ... (10+ levels deep) └─ Scan(table) -
Calcite FilterMergeRule Explosion: During the optimization phase, Calcite's
FilterMergeRuleattempts to merge these Filter nodes. However, the intersection with various push-down rules caused a combinatorial explosion:- The rule tried to generate all possible combinations of filter orderings
- With 10+ filters, this created millions of intermediate RelNode objects
- Each intermediate object consumed memory
- The planning queue grew exponentially
-
Memory Exhaustion: The memory was exhausted not during query execution, but during the query planning/optimization phase itself, before any data was processed.
Key Insight
The problem was architectural: too many separate Filter RelNodes triggered exponential growth in Calcite's optimization queue due to the intersection of FilterMergeRule and push-down optimization rules.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status