Skip to content

[BUG] Memory Exhaustion for Multiple Filtering Operations in PPL #4842

@RyanL1997

Description

@RyanL1997

Problem Description

OpenSearch SQL experienced severe memory exhaustion and Out-of-Memory (OOM) errors when executing PPL queries with 10+ consecutive filtering operations (regex and/or where clauses), regardless of dataset size. The issue manifested as:

  • Memory consumption growing exponentially with the number of filtering operations (regex/where)
  • GC overhead limit exceeded errors
  • Query failures even with small datasets
  • System becoming unresponsive during query execution

Example Failing Queries

Example 1: Multiple regex operations

source=accounts
| regex firstname="^[A-Z][a-z]+"
| regex lastname="^[A-Z][a-z]+$"
| regex address="\\d+\\s+[A-Za-z]+\\s+(Street|Lane|Avenue|Place|Court|Terrace)"
| regex employer="^[A-Z][a-z]+"
| regex email="[a-z]+[a-z]+@[a-z]+\\.(com|net|org)$"
| regex city="^[A-Z][a-z]+$"
| regex state="^[A-Z]{2}$"
| regex gender="^(M|F)$"
| regex firstname=".*[aeiouAEIOU].*"
| regex lastname=".*[bcdfghjklmnpqrstvwxyzBCDFGHJKLMNPQRSTVWXYZ].*"
| regex address=".*[0-9]{3}.*"
| regex employer=".*[aeiouAEIOU]$"
| regex email=".*[a-z]{5,}@.*"
| regex city=".*[aeiouyAEIOUY]{2}.*"
| regex address="^[0-9]+\\s.*"
| regex lastname="[A-Z][a-z]*[aeiouy]$"
| where age >= 25 AND age <= 40
| fields firstname, lastname, email, employer, address, city, state
| head 10

Example 2: Multiple where operations

source=accounts
| where age > 20
| where age < 50
| where balance > 1000
| where balance < 50000
| where account_number > 0
| where account_number < 1000
| where age != 25
| where age != 30
| where age != 35
| where balance != 5000
| where balance != 10000
| where account_number != 100
| where account_number != 200
| where age > 21
| where age < 49
| where balance > 1001
| fields firstname, lastname, age, balance
| head 10

Example Failures in Log

[2025-11-20T20:55:54,723][INFO ][o.o.s.p.PPLService       ] [integTest-0] [9d49584e-83ad-484b-9fc8-5489de5517a5] Incoming request source=table | where identifier > *** | where identifier < *** | where identifier > *** | where identifier < *** | where identifier > *** | where identifier < *** | where identifier != *** | where identifier != *** | where identifier != *** | where identifier != *** | where identifier != *** | where identifier != *** | where identifier != *** | where identifier > *** | where identifier < *** | where identifier > *** | fields + identifier,identifier,identifier,identifier | head 10
[2025-11-20T20:56:04,871][INFO ][o.o.m.j.JvmGcMonitorService] [integTest-0] [gc][123] overhead, spent [372ms] collecting in the last [1s]
[2025-11-20T20:56:05,878][WARN ][o.o.m.j.JvmGcMonitorService] [integTest-0] [gc][124] overhead, spent [686ms] collecting in the last [1s]
[2025-11-20T20:56:07,136][WARN ][o.o.m.j.JvmGcMonitorService] [integTest-0] [gc][125] overhead, spent [1.1s] collecting in the last [1.2s]
[2025-11-20T20:56:08,412][WARN ][o.o.m.j.JvmGcMonitorService] [integTest-0] [gc][126] overhead, spent [1.2s] collecting in the last [1.2s]
[2025-11-20T20:56:09,536][WARN ][o.o.m.j.JvmGcMonitorService] [integTest-0] [gc][127] overhead, spent [1s] collecting in the last [1.1s]
[2025-11-20T20:56:10,601][WARN ][o.o.m.j.JvmGcMonitorService] [integTest-0] [gc][128] overhead, spent [1s] collecting in the last [1s]
[2025-11-20T20:56:11,753][WARN ][o.o.m.j.JvmGcMonitorService] [integTest-0] [gc][129] overhead, spent [1s] collecting in the last [1.1s]
[2025-11-20T20:56:12,914][WARN ][o.o.m.j.JvmGcMonitorService] [integTest-0] [gc][130] overhead, spent [1.1s] collecting in the last [1.1s]
[2025-11-20T20:56:13,920][WARN ][o.o.m.j.JvmGcMonitorService] [integTest-0] [gc][131] overhead, spent [967ms] collecting in the last [1s]
...

Root Cause Analysis

Initial Misdiagnosis

Initially suspected that Pattern.compile() calls were causing memory issues due to lack of caching. However, implementing pattern caching did not resolve the problem.

Actual Root Cause

The real issue was identified by examining the Calcite query planning process:

  1. Deep Filter RelNode Chains: Each regex or where command in the PPL query created a separate Filter RelNode during the analysis phase, resulting in a deeply nested chain:

    Filter(regex15)
    └─ Filter(regex14)
       └─ Filter(regex13)
          └─ ... (10+ levels deep)
             └─ Scan(table)
    
  2. Calcite FilterMergeRule Explosion: During the optimization phase, Calcite's FilterMergeRule attempts to merge these Filter nodes. However, the intersection with various push-down rules caused a combinatorial explosion:

    • The rule tried to generate all possible combinations of filter orderings
    • With 10+ filters, this created millions of intermediate RelNode objects
    • Each intermediate object consumed memory
    • The planning queue grew exponentially
  3. Memory Exhaustion: The memory was exhausted not during query execution, but during the query planning/optimization phase itself, before any data was processed.

Key Insight

The problem was architectural: too many separate Filter RelNodes triggered exponential growth in Calcite's optimization queue due to the intersection of FilterMergeRule and push-down optimization rules.

Metadata

Metadata

Assignees

Labels

PPLPiped processing languagebugSomething isn't workingperformanceMake it fast!

Type

No type

Projects

Status

Not Started

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions