-
Notifications
You must be signed in to change notification settings - Fork 204
Fix: Allow list[str] for pandas DataFrame columns and index parameters #1664
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Fixes facebook#1657. This PR adds test cases demonstrating that list[str] can be used with pandas DataFrame columns and index parameters when using the corrected SequenceNotStr protocol definition from pandas main branch. ## The Issue When attempting to pass list[str] to DataFrame's columns or index parameters, type checkers reject the assignment due to a protocol incompatibility: ```python df = pd.DataFrame([[1, 2, 3]], columns=["A", "B", "C"]) # ERROR: Argument `list[str]` is not assignable to parameter `columns` ``` The issue is in pandas 2.x's SequenceNotStr protocol definition, where the index() method has mismatched parameter kinds: - pandas 2.x (broken): `def index(self, value: Any, /, start: int = 0, stop: int = ...) -> int` - list.index (actual): `def index(self, value, start=0, stop=sys.maxsize, /) -> int` Only `value` is position-only in the protocol, but all parameters are position-only in list.index, causing structural subtyping to fail. ## The Solution Pandas fixed this in their main branch (for 3.0) by making all parameters position-only to match the actual list.index signature: ```python def index(self, value: Any, start: int = ..., stop: int = ..., /) -> int ``` This PR adds test cases using inline stubs with the corrected protocol definition, verifying that the fix resolves the issue. ## Test Plan Added test suite in pyrefly/lib/test/pandas/dataframe.rs: - test_dataframe_list_str_columns: DataFrame with list[str] columns - test_dataframe_list_str_both: DataFrame with list[str] for both params Both tests use inline stubs with the corrected SequenceNotStr protocol. Verification: ``` cargo test test::pandas --lib # All tests pass ``` ## References - Issue: pandas-dev/pandas#56995 - Fixed in pandas main: https://github.com/pandas-dev/pandas/blob/main/pandas/_typing.py
This adds minimal pandas stubs with the corrected SequenceNotStr protocol to pyrefly's bundled typeshed, fixing issue facebook#1657 for all users. The stubs include: - Corrected SequenceNotStr protocol (all params position-only) - DataFrame and Series classes - Common read functions (read_csv, read_excel, read_json) - concat function These bundled stubs will be used for pandas 2.x installations and provide the fix until pandas 3.0 is released with the corrected protocol.
|
Thanks for the PR, and for looking into Pandas to diagnose our issues! Requesting review from @rchen152 because I think she has the best context on stub bundling - as a rule of thumb I don't think we would want to override released stubs, but if the typeshed stubs are badly broken for a library this important we might need to consider it. We also might be able to hack Pyrefly internals a bit if necessary as an alternative - in theory I think an edge case in |
Instead of bundling corrected pandas stubs, implement an edge case in is_subset_param_list that allows position-only parameters (PosOnly) to match regular positional parameters (Pos) in protocol checking. This fixes the pandas SequenceNotStr protocol compatibility issue where list.index() (which has position-only params) was failing to match SequenceNotStr.index() from pandas 2.x typeshed stubs (which incorrectly lacks position-only markers). The edge case is type-safe: if a protocol allows a parameter to be passed by position or keyword (Pos), an implementation that only allows positional (PosOnly) is more restrictive, which is acceptable for subtyping. Benefits over bundled stubs: - Single code change instead of maintaining separate stub files - Won't conflict with future typeshed updates - Easier to track and maintain - General improvement to protocol checking, not pandas-specific Changes: - Add edge case in pyrefly/lib/solver/subset.rs:123-136 - Add test case test_dataframe_with_broken_stubs to verify fix works with actual broken pandas 2.x stubs - Remove bundled pandas stubs (no longer needed) Fixes facebook#1657
|
Hi @stroxler, Thanks for the suggestion! I added an edge case in Added Testingcargo test test_dataframeResult: passes |
| // This handles cases like list.index() (which has position-only params) matching | ||
| // SequenceNotStr.index() from pandas 2.x typeshed stubs (which incorrectly | ||
| // lacks position-only markers, fixed in pandas 3.0). | ||
| // From a typing perspective, this is sound: if a protocol allows a parameter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems backwards to me. The implementation should be less restrictive, so that it can be used everywhere the protocol can be used.
Summary
Fixes #1657.
This fixes a type checking bug where Pyrefly incorrectly flagged errors when using
list[str]forcolumnsandindexparameters in thepd.DataFrameconstructor:The root cause is pandas 2.x's incorrect
SequenceNotStrprotocol definition in typeshed. The protocol'sindex()method signature is:This makes only
valueposition-only, butlist.index()requires ALL parameters to be position-only. Sincelist.index(value, start, stop, /)doesn't structurally match the protocol (parameter kinds differ),list[str]fails to satisfy the protocol constraint.The Fix
Added corrected pandas stubs to pyrefly's bundled typeshed matching the fix from pandas main branch
(targeted for pandas 2.3/3.0). The corrected
SequenceNotStrprotocol inpandas/_typing.pyinow defines:All parameters are position-only, matching
list.index()signature. This allowslist[str]to properly satisfy theSequenceNotStr[str]protocol used by DataFrame'scolumnsandindexparameters.The stubs include:
pandas/_typing.pyi- CorrectedSequenceNotStrprotocol and type aliasespandas/__init__.pyi- DataFrame, Series exports and common functionspandas/core/frame.pyi- DataFrame class definitionpandas/core/series.pyi- Series class definitionMETADATA.toml- Package metadatapandas/core/__init__.pyi- Core module stubTotal: 6 stub files (82 lines) + test (140 lines)
Test Plan
pyrefly/lib/test/pandas/dataframe.rs:cargo test test_dataframeResult: passes
references :
DataFrameargumentcolumnsfails to type-check when passed a list of strings (pandas 2.2.0) pandas-dev/pandas#56995