Conversation
…he same stage as the dataflow stage.
…re's no reason not to do so.
There was a problem hiding this comment.
Pull request overview
This PR refactors the C++ dataflow implementation to reduce re-evaluation of predicates across compilation stages, achieving approximately 4-5% improvement in total analysis time. The changes reorganize how cached predicates are structured and introduce a new DataFlowNodes.qll file to centralize node definitions, thereby reducing stage overlap.
Changes:
- Introduces
DataFlowNodes.qllto centralize dataflow node definitions, moving classes likeNode,ParameterNode,VariableNode, etc. fromDataFlowUtil.qll - Wraps cached predicates in
Cachedmodules with appropriate stage collapsing to prevent unnecessary recomputation - Adds new cached predicates in
IRConstruction.qllto expose translation information without causing re-evaluation - Reorganizes
TaintTrackingUtil.qllto wrap predicates in aCachedmodule for proper stage management - Moves several cached predicates from
DataFlowPrivate.qllinto theCachedmodule for better stage control
Reviewed changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| FlowSummaryNode.ql | Adds import of DataFlowNodes for test access to node classes |
| CaptureModels.qll | Updates references from DataFlow::FieldAddress to DataFlowNodes::FieldAddress |
| IRConstruction.qll | Adds cached predicates to expose translation information (allocation extents, transparent conversions, etc.) |
| TaintTrackingUtil.qll | Wraps taint tracking predicates in a Cached module with stage collapsing |
| SsaImplCommon.qll | Moves utility predicates into Cached module for proper stage management |
| SsaImpl.qll | Delegates to IRConstruction for initialization target address |
| PrintIRUtilities.qll, PrintIRLocalFlow.qll, PrintIRFieldFlowSteps.qll | Adds DataFlowNodes imports for access to node classes |
| ModelUtil.qll | Adds DataFlowNodes import |
| ExprNodes.qll | Uses IRConstruction cached predicates instead of direct translation class access |
| DataFlowUtil.qll | Major refactoring: moves node definitions to DataFlowNodes.qll, imports DataFlowNodes::Public |
| DataFlowPrivate.qll | Consolidates cached predicates (toString, getLocation, newtype definitions) into Cached module |
| DataFlowNodes.qll | New file containing node class definitions, Content types, and related predicates |
| FlowSummaryImpl.qll | Updates toString call to use toStringImpl() |
| ExternalFlow.qll | Adds pragma[nomagic] for performance optimization |
| import cpp | ||
| import semmle.code.cpp.rangeanalysis.RangeAnalysisUtils | ||
| import semmle.code.cpp.ir.dataflow.DataFlow | ||
| private import semmle.code.cpp.ir.dataflow.internal.DataFlowNodes |
There was a problem hiding this comment.
Doesn't look ideal that we now need to import something internal here. Does this query do something odd that requires this?
There was a problem hiding this comment.
Yes: 18dea55
I'm not sure why we did it this way, but it imports internal dataflow nodes to find definitions outside a loop. I think we could do this with the proper SSA API we have these days, but I didnt want to change that in this PR
There was a problem hiding this comment.
For standard queries you still only have to import semmle.code.cpp.dataflow.new.DataFlow
jketema
left a comment
There was a problem hiding this comment.
LGTM. As far as I can tell this is really just moving around stuff (and adding some missing comments). DCA numbers look ok.
This PR removes most of the re-evaluation of predicates from previous stages. In particular, the dataflow stage made use of various internal predicates from IR generation which caused some unfortunate re-evaluation.
There were also some other stage-related issues where various predicates in dataflow were
cached, but not put inside acachedmodule, which produced unnecessary stages.Unfortunately, this was easier said than done. It required a large refactoring of the dataflow files because
Iwe have made some fairly arbitrary choices as to what belonged inDataFlowUtil.qllandDataFlowPrivate.qllwhich made it hard to unify the stages. I tried my best to make this refactoring commit-by-commit reviewable.Most of this PR has been guided by the output of Schack's stageoverlap.py script. There should be no changes to any resulting tuples.
DCA looks great. We shave off about 4-5% of the total analysis time 🎉 Additionally, my experience is that this PR removes a lot of recomputation of
DataFlowUtil.qll/DataFlowPrivate.qllpredicates when I modify anisSourceorisSinkin a query.