UN-3349 [FIX] Sanitize SQL identifiers in database connectors#1872
UN-3349 [FIX] Sanitize SQL identifiers in database connectors#1872kirtimanmishrazipstack merged 9 commits intomainfrom
Conversation
Add defense-in-depth SQL injection prevention: validate identifiers against allowlist regex, quote with DB-specific styles, and parameterize WHERE clause values. Fixes GHSA-8mj6-f7hp-pwp9 (CVSS 8.8). - New sql_safety.py utility: validate_identifier(), quote_identifier(), safe_identifier() with QuoteStyle enum (DOUBLE_QUOTE, BACKTICK, SQUARE_BRACKET) - Fix all 8 connectors: PostgreSQL, Redshift, MSSQL, MySQL, MariaDB, Oracle, Snowflake, BigQuery - Parameterize auth token queries in prompt-service and x2text-service - Fix Snowflake error message leaking SQL query to HTTP response - Add get_quote_style() abstract method to UnstractDB base class - Add params support to UnstractDB.execute() Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
26 tests covering validate_identifier, quote_identifier, and safe_identifier across all QuoteStyle variants. Tests include real-world injection payloads from the security advisory. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
✅ Files skipped from review due to trivial changes (1)
Summary by CodeRabbitRelease Notes
WalkthroughIntroduces a sql_safety module and applies identifier validation/quoting plus parameterized query support across database connectors and authentication code; updates UnstractDB execute/insert APIs and adds tests for identifier safety. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
| Filename | Overview |
|---|---|
| unstract/connectors/src/unstract/connectors/databases/sql_safety.py | New utility module: allowlist regex + DB-specific quoting + combined safe_identifier(); P1 issue: allowlist permits hyphens which are safe only for quoted identifiers but cause SQL errors in unquoted column-name paths |
| unstract/connectors/src/unstract/connectors/databases/unstract_db.py | Base class: execute() gains params support, get_sql_insert_query() becomes instance method, column names validated-only (not quoted); unquoted hyphenated column names would cause SQL syntax errors |
| unstract/connectors/src/unstract/connectors/databases/bigquery/bigquery.py | BigQuery: get_information_schema uses job_config parameterization, DDL uses safe_identifier with backtick; execute() override silently discards params arg (P2) |
| unstract/connectors/src/unstract/connectors/databases/postgresql/postgresql.py | PostgreSQL: SET search_path uses psycopg2.sql.Identifier(), INSERT columns fully quoted via safe_identifier, no issues found |
| unstract/connectors/src/unstract/connectors/databases/snowflake/snowflake.py | Snowflake: table name quoted, column names validated-only (intentional for UPPERCASE normalization); hyphenated column names would fail SQL; error message sanitized |
| unstract/connectors/src/unstract/connectors/databases/oracle_db/oracle_db.py | Oracle: table name quoted, column names validate-only (intentional for UPPERCASE normalization); parameterized user_tab_columns query with named :table_name parameter |
| unstract/connectors/src/unstract/connectors/databases/mssql/mssql.py | MSSQL: uses SQUARE_BRACKET quoting, IF NOT EXISTS WHERE literals use validated raw values (acknowledged in prior review, documented with comments) |
| unstract/connectors/src/unstract/connectors/databases/redshift/redshift.py | Redshift: validate_identifier(self.schema) in get_engine() is safe because constructor defaults schema to 'public' and guards against empty values |
| unstract/connectors/src/unstract/connectors/databases/mysql/mysql.py | MySQL: information_schema query parameterized with %s, DDL uses safe_identifier with backtick quoting, no issues found |
| unstract/connectors/src/unstract/connectors/databases/mariadb/mariadb.py | MariaDB: identical improvements to MySQL, information_schema parameterized, DDL uses backtick quoting, no issues found |
| unstract/connectors/tests/databases/test_sql_safety.py | 26 unit tests covering validation, quoting styles, real-world injection payloads, and dot-qualified names; good coverage of the new module |
| prompt-service/src/unstract/prompt_service/helpers/auth.py | Token lookup queries parameterized with %s placeholders, eliminating direct string interpolation of user-supplied tokens |
| x2text-service/app/authentication_middleware.py | Bearer token lookup query parameterized with (token,) tuple, same fix as prompt-service auth |
Comments Outside Diff (1)
-
unstract/connectors/src/unstract/connectors/databases/sql_safety.py, line 29 (link)Hyphens allowed by allowlist but unsafe in unquoted column-name paths
_IDENTIFIER_PATTERNexplicitly allows hyphens (-) so that table/schema names likemy-projectpass validation. However, several code paths callvalidate_identifieron column names and then embed them unquoted directly in SQL:unstract_db.pycreate_table_query(line ~178):validate_identifier(key)thensql_query += f"{key} {sql_type}, "unstract_db.pyget_sql_insert_query(line ~202–204):validate_identifier(k)thenkeys_str = ",".join(sql_keys)snowflake.pyget_sql_insert_query(line ~350–352): same patternoracle_db.pyget_sql_insert_query: same pattern
If a workflow's prompt schema defines a field named
my-field, the key reaches these paths, passesvalidate_identifier, and produces SQL like:CREATE TABLE … (my-field VARCHAR(…)) -- parsed as: my MINUS field VARCHAR(...) INSERT INTO … (my-field) VALUES (…) -- same parse error
This causes a hard SQL syntax error at runtime, while the validation layer gave a false green light.
Root cause: the same allowlist is shared between identifiers that will always be quoted (table/schema names via
safe_identifier) and identifiers that are intentionally left unquoted (column names, to preserve Snowflake/Oracle upper-case normalization).Possible fix: strip hyphens from
_IDENTIFIER_PATTERNso that the single pattern is safe for both quoted and unquoted contexts:Alternatively, introduce a stricter
_COLUMN_IDENTIFIER_PATTERNfor unquoted column-name contexts and use it in the paths that skip quoting.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: unstract/connectors/src/unstract/connectors/databases/sql_safety.py
Line: 29
Comment:
**Hyphens allowed by allowlist but unsafe in unquoted column-name paths**
`_IDENTIFIER_PATTERN` explicitly allows hyphens (`-`) so that table/schema names like `my-project` pass validation. However, several code paths call `validate_identifier` on **column names** and then embed them **unquoted** directly in SQL:
- `unstract_db.py` `create_table_query` (line ~178): `validate_identifier(key)` then `sql_query += f"{key} {sql_type}, "`
- `unstract_db.py` `get_sql_insert_query` (line ~202–204): `validate_identifier(k)` then `keys_str = ",".join(sql_keys)`
- `snowflake.py` `get_sql_insert_query` (line ~350–352): same pattern
- `oracle_db.py` `get_sql_insert_query`: same pattern
If a workflow's prompt schema defines a field named `my-field`, the key reaches these paths, passes `validate_identifier`, and produces SQL like:
```sql
CREATE TABLE … (my-field VARCHAR(…)) -- parsed as: my MINUS field VARCHAR(...)
INSERT INTO … (my-field) VALUES (…) -- same parse error
```
This causes a hard SQL syntax error at runtime, while the validation layer gave a false green light.
**Root cause:** the same allowlist is shared between identifiers that will always be quoted (table/schema names via `safe_identifier`) and identifiers that are intentionally left unquoted (column names, to preserve Snowflake/Oracle upper-case normalization).
**Possible fix:** strip hyphens from `_IDENTIFIER_PATTERN` so that the single pattern is safe for both quoted and unquoted contexts:
```suggestion
_IDENTIFIER_PATTERN = re.compile(r"^[a-zA-Z_][a-zA-Z0-9_]*$")
```
Alternatively, introduce a stricter `_COLUMN_IDENTIFIER_PATTERN` for unquoted column-name contexts and use it in the paths that skip quoting.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: unstract/connectors/src/unstract/connectors/databases/bigquery/bigquery.py
Line: 129-134
Comment:
**`execute` silently drops `params` argument**
The base class `execute` signature was updated to accept `params` and the base-class `get_information_schema` now always passes `params=(table_name,)`. BigQuery overrides `get_information_schema`, so that particular call path is safe today. However, BigQuery's `execute` override accepts `params` and silently discards it:
```python
def execute(self, query: str, params: Any = None) -> Any:
try:
query_job = self.get_engine().query(query) # params never forwarded
return query_job.result()
```
Any future caller that does `self.execute(query, params=(...))` on a BigQuery instance will receive incorrect results without any error. The correct behavior is to either forward params via a `QueryJobConfig`, or raise `NotImplementedError` if called with params — rather than silently ignoring them.
How can I resolve this? If you propose a fix, please make it concise.Reviews (7): Last reviewed commit: "Merge branch 'main' into fix/UN-3349-san..." | Re-trigger Greptile
unstract/connectors/src/unstract/connectors/databases/bigquery/bigquery.py
Show resolved
Hide resolved
unstract/connectors/src/unstract/connectors/databases/unstract_db.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (5)
prompt-service/src/unstract/prompt_service/helpers/auth.py (1)
29-40:⚠️ Potential issue | 🟡 MinorConsider not logging bearer tokens in error messages.
Similar to the x2text-service, the error messages log the actual bearer token value (lines 29, 35-36, 40), which could expose sensitive credentials in log files.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@prompt-service/src/unstract/prompt_service/helpers/auth.py` around lines 29 - 40, The error logs in helpers/auth.py currently print the full bearer token (variable token) in multiple app.logger.error calls (the checks around platform_key, is_active, and authentication failure); update these logger calls to avoid emitting the raw token by either removing the token from the message or logging a sanitized version (e.g., mask all but last 4 chars or log a hashed representation) and keep contextual text (e.g., "Authentication failed. bearer token not found" / "Token is not active" / "Authentication failed. Invalid bearer token") without the plaintext credential; change the three app.logger.error invocations that reference token to use the sanitized variable (e.g., sanitized_token or token_hash) or omit the token entirely.x2text-service/app/authentication_middleware.py (1)
36-51:⚠️ Potential issue | 🟡 MinorConsider not logging bearer tokens in error messages.
The error messages log the actual bearer token value (lines 37, 44-45, 50), which could expose sensitive credentials in log files. Consider logging only a truncated/masked version or omitting the token entirely.
🛡️ Suggested approach
if not result_row or len(result_row) == 0: current_app.logger.error( - f"Authentication failed. bearer token not found {token}" + "Authentication failed. bearer token not found" ) return False🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@x2text-service/app/authentication_middleware.py` around lines 36 - 51, The current_app.logger.error calls in authentication_middleware.py are logging the full bearer token (references: token, platform_key, is_active, result_row) — change those three error logs (the "bearer token not found", "Token is not active", and "Invalid bearer token" messages which currently interpolate {token}) to avoid emitting the raw token: either omit the token entirely or replace it with a masked/truncated value (e.g., show only last 4 chars or fixed number of asterisks) before passing into current_app.logger.error, so all log statements no longer contain the full sensitive token string.unstract/connectors/src/unstract/connectors/databases/snowflake/snowflake.py (2)
152-153:⚠️ Potential issue | 🟠 MajorStop logging raw SQL text and bound values.
The exception detail is sanitized now, but these debug/error logs still dump the full statement and row payloads. That leaks query internals back into logs and undercuts the security goal of this change.
Also applies to: 171-177
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@unstract/connectors/src/unstract/connectors/databases/snowflake/snowflake.py` around lines 152 - 153, Remove the debug logs that print raw SQL and bound parameters in the execute_query flow: do not log sql_query or sql_values directly in SnowflakeConnector.execute_query (and corresponding debug/error calls around lines 171-177). Instead log a non-sensitive fingerprint (e.g., hash of sql_query), the query type or first N characters if needed, and a count of bound parameters, or redact values entirely. Update any logger.debug/logger.error calls in that function to use the fingerprint/count/redacted message so no full SQL text or row payloads are written to logs.
112-124:⚠️ Potential issue | 🔴 CriticalQuote the fixed Snowflake columns consistently in DDL and DML statements.
get_create_table_base_query()creates built-in columns unquoted (e.g.,created_by TEXT), which Snowflake stores as uppercaseCREATED_BY. However,get_sql_insert_query()now references these columns with quoted lowercase identifiers (e.g.,"created_by"), which Snowflake treats as a different column. Insert operations for rows including built-in columns will fail with "column not found" errors.Apply one consistent convention across all DDL and DML statements: either quote all fixed column names (and normalize to uppercase), or leave all unquoted (and normalize identifiers before insertion).
Also applies to: 133-142, 346-348
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@unstract/connectors/src/unstract/connectors/databases/snowflake/snowflake.py` around lines 112 - 124, get_create_table_base_query defines built-in columns unquoted while get_sql_insert_query uses quoted lowercase identifiers, causing Snowflake to treat them as different columns; make quoting consistent by either quoting all built-in column names in get_create_table_base_query (use safe_identifier/QuoteStyle.DOUBLE_QUOTE or explicit double quotes and normalize names to lowercase in DML), or remove quotes from get_sql_insert_query and normalize identifiers to uppercase so both DDL and DML reference the same identifier form; update the column definitions in get_create_table_base_query and the corresponding column references in get_sql_insert_query (and the related blocks at the ranges mentioned) to use the same quoting/normalization approach.unstract/connectors/src/unstract/connectors/databases/oracle_db/oracle_db.py (1)
119-129:⚠️ Potential issue | 🔴 CriticalQuoted table name with unquoted built-in columns breaks Oracle identifier semantics.
get_create_table_base_query()quotes the table identifier but leaves built-in columns unquoted, whileget_sql_insert_query()quotes every column reference, andget_information_schema()usesUPPER(:table_name)to query the dictionary. In Oracle, unquoted identifiers normalize to uppercase at creation but are stored as-is when quoted. This creates a mismatch: unquoted columns in CREATE TABLE are stored uppercase (ID, CREATED_BY, etc.), but quoted columns in INSERT preserve case (lowercase "id", "created_by"), so subsequent INSERT operations will fail with ORA-00904 (column not found). Additionally, the schema lookup will fail for non-uppercase table names.Normalize table and column names to uppercase before quoting, or keep all identifiers unquoted consistently.
Also applies to: 148-157, 173-183, 224-226
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@unstract/connectors/src/unstract/connectors/databases/oracle_db/oracle_db.py` around lines 119 - 129, get_create_table_base_query() currently quotes the table name via safe_identifier(..., QuoteStyle.DOUBLE_QUOTE) but emits unquoted column names (which Oracle uppercases), while get_sql_insert_query() quotes every column (preserving case), and get_information_schema() uses UPPER(:table_name), causing identifier mismatches and ORA-00904. Fix by normalizing identifiers to uppercase before quoting (or stop quoting all identifiers) so table and column names are consistently created and referenced; update get_create_table_base_query(), get_sql_insert_query(), and get_information_schema() to call .upper() on table and column names prior to passing to safe_identifier or to use unquoted identifiers consistently, and ensure safe_identifier usage is uniform (QuoteStyle.DOUBLE_QUOTE) across these functions so creation, insert and schema lookup use the same canonical form.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@unstract/connectors/src/unstract/connectors/databases/bigquery/bigquery.py`:
- Around line 129-132: The execute method currently ignores params; change it so
when params is provided you build a google.cloud.bigquery.QueryJobConfig with
query_parameters set (convert positional tuple/list into BigQuery
ScalarQueryParameter objects, or map dict keys to NamedQueryParameter objects)
and pass that as job_config to self.get_engine().query(query,
job_config=job_config) before returning query_job.result(); update execute and,
if necessary, add imports for QueryJobConfig and the appropriate Parameter
classes and handle the None/no-params path to keep current behavior.
- Around line 228-233: The BigQuery parameter names are being generated with
backticks (e.g., "@`{key}`") which is invalid for BigQuery named parameters;
update the code in the INSERT builder and the other occurrence (previously
around lines 279–281) to emit parameters as "@{key}" instead while keeping the
identifier validation via validate_identifier(key); specifically change the list
comprehensions/formatting that build escaped_params/parameter placeholders to
use f"@{key}" (or equivalent) instead of f"@`{key}`" so binding with
ScalarQueryParameter works.
In `@unstract/connectors/src/unstract/connectors/databases/unstract_db.py`:
- Around line 178-179: The function get_sql_insert_query currently uses an
implicit Optional by annotating sql_values: list[str] = None; update its type
hint to be explicit (e.g., sql_values: list[str] | None or sql_values:
Optional[list[str]]) and add the corresponding import (from typing import
Optional) if you choose Optional, so the parameter explicitly allows None;
adjust only the annotation for the sql_values parameter in get_sql_insert_query.
---
Outside diff comments:
In `@prompt-service/src/unstract/prompt_service/helpers/auth.py`:
- Around line 29-40: The error logs in helpers/auth.py currently print the full
bearer token (variable token) in multiple app.logger.error calls (the checks
around platform_key, is_active, and authentication failure); update these logger
calls to avoid emitting the raw token by either removing the token from the
message or logging a sanitized version (e.g., mask all but last 4 chars or log a
hashed representation) and keep contextual text (e.g., "Authentication failed.
bearer token not found" / "Token is not active" / "Authentication failed.
Invalid bearer token") without the plaintext credential; change the three
app.logger.error invocations that reference token to use the sanitized variable
(e.g., sanitized_token or token_hash) or omit the token entirely.
In
`@unstract/connectors/src/unstract/connectors/databases/oracle_db/oracle_db.py`:
- Around line 119-129: get_create_table_base_query() currently quotes the table
name via safe_identifier(..., QuoteStyle.DOUBLE_QUOTE) but emits unquoted column
names (which Oracle uppercases), while get_sql_insert_query() quotes every
column (preserving case), and get_information_schema() uses UPPER(:table_name),
causing identifier mismatches and ORA-00904. Fix by normalizing identifiers to
uppercase before quoting (or stop quoting all identifiers) so table and column
names are consistently created and referenced; update
get_create_table_base_query(), get_sql_insert_query(), and
get_information_schema() to call .upper() on table and column names prior to
passing to safe_identifier or to use unquoted identifiers consistently, and
ensure safe_identifier usage is uniform (QuoteStyle.DOUBLE_QUOTE) across these
functions so creation, insert and schema lookup use the same canonical form.
In
`@unstract/connectors/src/unstract/connectors/databases/snowflake/snowflake.py`:
- Around line 152-153: Remove the debug logs that print raw SQL and bound
parameters in the execute_query flow: do not log sql_query or sql_values
directly in SnowflakeConnector.execute_query (and corresponding debug/error
calls around lines 171-177). Instead log a non-sensitive fingerprint (e.g., hash
of sql_query), the query type or first N characters if needed, and a count of
bound parameters, or redact values entirely. Update any
logger.debug/logger.error calls in that function to use the
fingerprint/count/redacted message so no full SQL text or row payloads are
written to logs.
- Around line 112-124: get_create_table_base_query defines built-in columns
unquoted while get_sql_insert_query uses quoted lowercase identifiers, causing
Snowflake to treat them as different columns; make quoting consistent by either
quoting all built-in column names in get_create_table_base_query (use
safe_identifier/QuoteStyle.DOUBLE_QUOTE or explicit double quotes and normalize
names to lowercase in DML), or remove quotes from get_sql_insert_query and
normalize identifiers to uppercase so both DDL and DML reference the same
identifier form; update the column definitions in get_create_table_base_query
and the corresponding column references in get_sql_insert_query (and the related
blocks at the ranges mentioned) to use the same quoting/normalization approach.
In `@x2text-service/app/authentication_middleware.py`:
- Around line 36-51: The current_app.logger.error calls in
authentication_middleware.py are logging the full bearer token (references:
token, platform_key, is_active, result_row) — change those three error logs (the
"bearer token not found", "Token is not active", and "Invalid bearer token"
messages which currently interpolate {token}) to avoid emitting the raw token:
either omit the token entirely or replace it with a masked/truncated value
(e.g., show only last 4 chars or fixed number of asterisks) before passing into
current_app.logger.error, so all log statements no longer contain the full
sensitive token string.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: fde10716-6d33-425f-8f59-986c5edcf09d
📒 Files selected for processing (13)
prompt-service/src/unstract/prompt_service/helpers/auth.pyunstract/connectors/src/unstract/connectors/databases/bigquery/bigquery.pyunstract/connectors/src/unstract/connectors/databases/mariadb/mariadb.pyunstract/connectors/src/unstract/connectors/databases/mssql/mssql.pyunstract/connectors/src/unstract/connectors/databases/mysql/mysql.pyunstract/connectors/src/unstract/connectors/databases/oracle_db/oracle_db.pyunstract/connectors/src/unstract/connectors/databases/postgresql/postgresql.pyunstract/connectors/src/unstract/connectors/databases/redshift/redshift.pyunstract/connectors/src/unstract/connectors/databases/snowflake/snowflake.pyunstract/connectors/src/unstract/connectors/databases/sql_safety.pyunstract/connectors/src/unstract/connectors/databases/unstract_db.pyunstract/connectors/tests/databases/test_sql_safety.pyx2text-service/app/authentication_middleware.py
unstract/connectors/src/unstract/connectors/databases/bigquery/bigquery.py
Show resolved
Hide resolved
unstract/connectors/src/unstract/connectors/databases/bigquery/bigquery.py
Outdated
Show resolved
Hide resolved
unstract/connectors/src/unstract/connectors/databases/unstract_db.py
Outdated
Show resolved
Hide resolved
…fix type hints - Remove redundant validate_identifier() calls in BigQuery and Oracle (safe_identifier already validates internally) - Fix type annotation: list[str] = None → list[str] | None = None in unstract_db.py and snowflake.py Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
unstract/connectors/src/unstract/connectors/databases/bigquery/bigquery.py (1)
330-384: Good security improvement with minor exception handling issue.The
get_information_schemamethod now properly:
- Validates
project,dataset, andtableviavalidate_identifier()before embedding in the schema path- Uses parameterized query via
QueryJobConfigwithScalarQueryParameterfor thetable_namefilterThe static analysis S608 warning on lines 358-360 is a false positive - the identifiers are validated before being embedded.
However, the exception handling on lines 372-373 could be improved by chaining the exception:
♻️ Proposed improvement for exception chaining
try: query_job = self.get_engine().query(query, job_config=job_config) results = query_job.result() except Exception as e: - raise ConnectorError(str(e)) + raise ConnectorError(str(e)) from e🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@unstract/connectors/src/unstract/connectors/databases/bigquery/bigquery.py` around lines 330 - 384, In get_information_schema, the except block currently raises ConnectorError with only str(e); change it to chain the original exception so the traceback is preserved (use "raise ConnectorError(... ) from e" or construct ConnectorError with the original exception) in the except Exception as e handler where query_job = self.get_engine().query(...) is called; this keeps the same message but attaches the original exception for debugging while preserving existing behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@unstract/connectors/src/unstract/connectors/databases/bigquery/bigquery.py`:
- Around line 330-384: In get_information_schema, the except block currently
raises ConnectorError with only str(e); change it to chain the original
exception so the traceback is preserved (use "raise ConnectorError(... ) from e"
or construct ConnectorError with the original exception) in the except Exception
as e handler where query_job = self.get_engine().query(...) is called; this
keeps the same message but attaches the original exception for debugging while
preserving existing behavior.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 26ccb316-eb6e-4724-8505-6d758e4baa4b
📒 Files selected for processing (4)
unstract/connectors/src/unstract/connectors/databases/bigquery/bigquery.pyunstract/connectors/src/unstract/connectors/databases/oracle_db/oracle_db.pyunstract/connectors/src/unstract/connectors/databases/snowflake/snowflake.pyunstract/connectors/src/unstract/connectors/databases/unstract_db.py
…patibility Snowflake and Oracle normalize unquoted identifiers to UPPERCASE. Quoting column names (e.g., "created_by") preserves lowercase, causing a mismatch with existing tables where columns are stored as CREATED_BY. Changed approach: validate column names with allowlist regex (blocks injection) but don't quote them, preserving the original unquoted behavior. Table names remain quoted since they're quoted in both CREATE and INSERT consistently. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Addressing outside-diff findings from CodeRabbit review (#3997809940): Snowflake/Oracle column case mismatch (Critical) — Fixed in 228c129. Changed to validate-only for column names (no quoting) in Snowflake, Oracle, and the base class. Snowflake/Oracle normalize unquoted identifiers to UPPERCASE, so quoting lowercase column names would break existing tables. Bearer tokens logged in error messages (Minor, prompt-service + x2text) — Pre-existing pattern outside our diff. Valid concern but out of scope for this security fix. Should be addressed in a follow-up. Snowflake debug logs still dump raw SQL (Major) — Pre-existing |
Code reviewFound 1 issue:
🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
|
@athul-rs Already addressed in this reply. |
kirtimanmishrazipstack
left a comment
There was a problem hiding this comment.
Code review
Found 3 issues:
- MSSQL
get_create_table_base_querystill uses string interpolation in WHERE clause -- Whilesafe_identifier()validates the values first (blocking metacharacters), the existence check still interpolatesschema_nameandtable_namedirectly as string literals (WHERE TABLE_SCHEMA = '{schema_name}'). This is inconsistent with the defense-in-depth approach used inget_information_schemain the same file, which correctly uses%sparameterized queries. Consider parameterizing these, or adding a comment explaining why parameterization is not feasible in this DDLIF NOT EXISTScontext.
unstract/unstract/connectors/src/unstract/connectors/databases/mssql/mssql.py
Lines 126 to 142 in bf1c577
- Allowlist regex excludes
$and#without documentation -- The regex^[a-zA-Z_][a-zA-Z0-9_-]*$is a good secure default but intentionally excludes$(valid in Oracle/PostgreSQL identifiers) and#(MSSQL temp tables). If any deployment uses these patterns, this change will break them. Consider documenting the intentional exclusions above the pattern so future maintainers understand the tradeoff.
- BigQuery
get_information_schemahas redundant local import -- The method importsfrom google.cloud import bigquery as bq_moduleat line 362, but the class__init__already imports and stores the module asself.bigquery(line 35). This creates a second import path. Consider usingbq_module = self.bigqueryinstead to reuse the existing reference.
Overall this is a solid security fix with good defense-in-depth design. The sql_safety.py module is clean, well-tested (26 tests), and the parameterization changes across connectors and auth services are correct.
Generated with Claude Code
- If this code review was useful, please react with 👍. Otherwise, react with 👎.
|
@kirtimanmishrazipstack Thanks for the review and the positive feedback on the overall approach! 1. MSSQL WHERE clause interpolation — Already addressed in this reply. 2. Regex excludes 3. BigQuery redundant local import — This is intentional. The class-level comment says |
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Test ResultsSummary
Runner Tests - Full Report
SDK1 Tests - Full Report
|
|



What
sql_safety.pyutility for SQL identifier validation and quoting across all 8 database connectorssql_safetymoduleWhy
How
sql_safety.pymodule:validate_identifier()(allowlist),quote_identifier()(DB-specific),safe_identifier()(combined),QuoteStyleenumunstract_db.py: addedget_quote_style()abstract method, parameterizedexecute(), updatedget_information_schema(),create_table_query(),get_sql_insert_query()get_quote_style()and uses safe identifiers in DDL/DML/metadata queriespsycopg2.sql.Identifier()for SET search_path%sparameterized queries (pattern already existed in same files)Can this PR break any existing features?
"my_table"behaves identically tomy_table). Parameterized queries return identical results. The only behavioral change is that identifiers containing SQL metacharacters are now rejected withValueError.get_sql_insert_querychanged from@staticmethodto instance method — all callers already invoke it on instances (db_class.get_sql_insert_query(...)), so this is backward compatible.Database Migrations
Env Config
Related Issues or PRs
Dependencies Versions
Notes on Testing
sql_safety.pycovering validation, quoting, and real-world payloadsChecklist