-
Notifications
You must be signed in to change notification settings - Fork 30
feat(debug log): enable debug log through environment variable and have HTTP request/responses being logged as airbyte messages #711
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… have HTTP request/responses being logged as airbyte messages
👋 Greetings, Airbyte Team Member!Here are some helpful tips and reminders for your convenience. Testing This CDK VersionYou can test this version of the CDK using the following: # Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@maxi297/debug-log-project#egg=airbyte-python-cdk[dev]' --help
# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch maxi297/debug-log-projectHelpful ResourcesPR Slash CommandsAirbyte Maintainers can execute the following slash commands on your PR:
|
📝 WalkthroughWalkthroughAdds a platform-driven debug flag API (is_platform_debug_log_enabled) and uses it in entrypoint to choose DEBUG vs INFO. Reworks AirbyteLogFormatter to emit structured AirbyteMessage JSON for all levels, with DEBUG formatting differing when platform debug is enabled. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Entrypoint
participant LoggerAPI as logger.is_platform_debug_log_enabled
participant PythonLogging as logging
User->>Entrypoint: run(..., --debug?)
Entrypoint->>LoggerAPI: is_platform_debug_log_enabled()
LoggerAPI-->>Entrypoint: bool
alt --debug or platform debug
Entrypoint->>PythonLogging: setLevel(DEBUG)
else
Entrypoint->>PythonLogging: setLevel(INFO)
end
sequenceDiagram
participant Code as Calling Code
participant Logger as AirbyteLogFormatter
participant Env as LOG_LEVEL
participant Serializer as AirbyteMessageSerializer
Code->>Logger: format(record)
alt record.level == DEBUG
Logger->>Env: is_platform_debug_log_enabled()
alt platform debug = true
Logger->>Serializer: build AirbyteMessage(LOG, level=DEBUG, message+extras)
Serializer-->>Logger: json
else
Logger->>Serializer: build AirbyteMessage(LOG, level=DEBUG, simple fields)
Serializer-->>Logger: json
end
else
Logger->>Serializer: build AirbyteMessage(LOG, level=record.level, filtered message)
Serializer-->>Logger: json
end
Logger-->>Code: structured JSON string
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Would you like me to suggest specific unit tests or examples to validate the new debug-vs-platform behavior, wdyt? Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (2)
airbyte_cdk/logger.py (1)
75-79: Default to Level.INFO enum, not string
airbyte_levelfalls back to the literal "INFO" string which is inconsistent with the Level enum used elsewhere and could cause subtle type issues. Can we default toLevel.INFO(and optionally annotate the variable)?Apply this diff?
- airbyte_level = self.level_mapping.get(record.levelno, "INFO") + airbyte_level: Level = self.level_mapping.get(record.levelno, Level.INFO)wdyt?
airbyte_cdk/entrypoint.py (1)
161-165: Annotateis_platform_debug_log_enabledwith-> boolto satisfy mypy (and maybe capture its result in a local variable for clearer logs? wdyt?)We’re currently calling an untyped helper, which is tripping mypy. Let’s update its signature and optionally surface the flag in
entrypoint.pyfor a more descriptive debug message.• In
airbyte_cdk/logger.py(around line 43):--- a/airbyte_cdk/logger.py +++ b/airbyte_cdk/logger.py @@ -43,7 +43,7 @@ -def is_platform_debug_log_enabled(): +def is_platform_debug_log_enabled() -> bool: return os.environ.get("LOG_LEVEL", "info").lower() == "debug"• In
airbyte_cdk/entrypoint.py(lines 161–165), optionally capture the platform flag and update the log text:--- a/airbyte_cdk/entrypoint.py +++ b/airbyte_cdk/entrypoint.py @@ -161,7 +161,11 @@ - if (hasattr(parsed_args, "debug") and parsed_args.debug) or is_platform_debug_log_enabled(): + platform_debug = is_platform_debug_log_enabled() + if (hasattr(parsed_args, "debug") and parsed_args.debug) or platform_debug: self.logger.setLevel(logging.DEBUG) logger.setLevel(logging.DEBUG) - self.logger.debug("Debug logs enabled") + self.logger.debug( + "Debug logs enabled%s" % (" (platform override)" if platform_debug else "") + )These changes should clear the mypy error and give a clearer “platform override” indicator when appropriate—wdyt?
🧹 Nitpick comments (3)
airbyte_cdk/logger.py (3)
92-93: Preserve structure of extras (avoid str-casting) and make JSON robust
extract_extra_args_from_recordcoerces all extras tostr, which loses structure for request/response payloads that are dicts/lists. This contradicts the stated goal of having structureddatalocally. Could we keep native types and only fall back tostrduring serialization?Apply these diffs?
- debug_dict = {"type": "DEBUG", "message": record.getMessage(), "data": extras} - return filter_secrets(json.dumps(debug_dict)) + debug_dict = {"type": "DEBUG", "message": record.getMessage(), "data": extras} + # Ensure non-serializable extras don't break logging + return filter_secrets(json.dumps(debug_dict, default=str))- default_attrs = logging.LogRecord("", 0, "", 0, None, None, None).__dict__.keys() - extra_keys = set(record.__dict__.keys()) - default_attrs - return {k: str(getattr(record, k)) for k in extra_keys if hasattr(record, k)} + default_attrs = logging.LogRecord("", 0, "", 0, None, None, None).__dict__.keys() + extra_keys = set(record.__dict__.keys()) - default_attrs + # Preserve native types; let JSON serialization handle them (with fallback to str at call site) + return {k: getattr(record, k) for k in extra_keys if hasattr(record, k)}wdyt?
Also applies to: 103-111
80-91: Consistent and safe serialization for platform debug pathMinor: For consistency and performance, would you consider using orjson for the extras JSON in the platform path and adding a
default=strfallback? Also, this avoids double filtering passes by filtering once on the final composed string.Apply this diff?
- message = f"{filter_secrets(record.getMessage())} ///\nExtra logs: {filter_secrets(json.dumps(extras))}" + extras_json = orjson.dumps(extras, default=str).decode() + message = filter_secrets(f"{record.getMessage()} ///\nExtra logs: {extras_json}") log_message = AirbyteMessage(type=Type.LOG, log=AirbyteLogMessage(level=airbyte_level, message=message)) return orjson.dumps(AirbyteMessageSerializer.dump(log_message)).decode()wdyt?
80-93: Optional hardening: mask common sensitive HTTP headers in extrasWhile
filter_secretshelps for values sourced from config, responses/headers often contain credentials not present in the config (e.g., Set-Cookie, Authorization). Would you like to mask a small set of common sensitive headers before logging to reduce leakage risk?As a follow-up, we could add this helper:
SENSITIVE_HEADER_NAMES = {"authorization", "proxy-authorization", "cookie", "set-cookie", "x-api-key", "x-auth-token"} def _sanitize_http_extras(extras: Mapping[str, Any]) -> Mapping[str, Any]: if not isinstance(extras, dict): return extras sanitized = dict(extras) headers = sanitized.get("headers") if isinstance(headers, Mapping): sanitized_headers = {} for k, v in headers.items(): if isinstance(k, str) and k.lower() in SENSITIVE_HEADER_NAMES: sanitized_headers[k] = "****" else: sanitized_headers[k] = v sanitized["headers"] = sanitized_headers return sanitized…then call
_sanitize_http_extras(extras)right before serializing in both DEBUG branches. Happy to wire this in if you want, wdyt?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
airbyte_cdk/entrypoint.py(2 hunks)airbyte_cdk/logger.py(3 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (2)
airbyte_cdk/entrypoint.py (4)
airbyte_cdk/sources/declarative/partition_routers/substream_partition_router.py (1)
logger(421-422)airbyte_cdk/sources/streams/core.py (1)
logger(128-129)unit_tests/test_logger.py (1)
logger(15-17)airbyte_cdk/logger.py (2)
init_logger(47-52)is_platform_debug_log_enabled(43-44)
airbyte_cdk/logger.py (2)
airbyte_cdk/utils/airbyte_secrets_utils.py (1)
filter_secrets(73-80)airbyte_cdk/models/airbyte_protocol.py (1)
AirbyteMessage(79-88)
🪛 GitHub Actions: Linters
airbyte_cdk/entrypoint.py
[error] 161-161: mypy: Call to untyped function is_platform_debug_log_enabled in typed context [no-untyped-call]. Step: poetry run mypy --config-file mypy.ini airbyte_cdk
airbyte_cdk/logger.py
[error] 43-43: mypy: Function is missing a return type annotation [no-untyped-def]. Step: poetry run mypy --config-file mypy.ini airbyte_cdk
[error] 80-80: mypy: Call to untyped function is_platform_debug_log_enabled in typed context [no-untyped-call]. Step: poetry run mypy --config-file mypy.ini airbyte_cdk
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
- GitHub Check: Check: source-pokeapi
- GitHub Check: Check: source-shopify
- GitHub Check: Check: destination-motherduck
- GitHub Check: Check: source-intercom
- GitHub Check: Check: source-hardcoded-records
- GitHub Check: Pytest (Fast)
- GitHub Check: SDM Docker Image Build
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Analyze (python)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
airbyte_cdk/logger.py (4)
43-45: Add a brief docstring to clarify contract and scope?Since this function is a new public API, could we add a short docstring clarifying that it only checks the platform-controlled LOG_LEVEL env var (case-insensitive), and that CLI vs env precedence is handled by the caller (entrypoint)? It helps future readers avoid assuming it reads CLI flags. wdyt?
def is_platform_debug_log_enabled() -> bool: - return os.environ.get("LOG_LEVEL", "info").lower() == "debug" + """True if platform-level debug logging is enabled via LOG_LEVEL=debug (case-insensitive). + + Note: CLI vs env precedence is determined by the caller (e.g., entrypoint), not here. + """ + return os.environ.get("LOG_LEVEL", "info").lower() == "debug"
75-79: Use Level.INFO for the default instead of the string "INFO"The fallback currently returns a bare string, while the mapping returns Level enum values. For type consistency (and to avoid subtle comparisons), can we default to Level.INFO? wdyt?
- airbyte_level = self.level_mapping.get(record.levelno, "INFO") + airbyte_level = self.level_mapping.get(record.levelno, Level.INFO)
80-96: Consider preserving structured extras and standardizing serializationTwo small improvements here to better align with the PR objective of structured request/response data and to keep serialization consistent:
- Today, extract_extra_args_from_record() stringifies extras, so the "data" object under local DEBUG is actually strings (e.g., dicts become single-quoted strings), which defeats structured data. Would you be open to making extras JSON-safe but still structured (fallback to str only when necessary)?
- Also, we mix json and orjson; can we standardize on orjson for speed and consistency?
Proposed tweaks in this block (switch to orjson), wdyt?
- message = f"{filter_secrets(record.getMessage())} ///\nExtra logs: {filter_secrets(json.dumps(extras))}" + extras_json = orjson.dumps(extras).decode() + message = f"{filter_secrets(record.getMessage())} ///\nExtra logs: {filter_secrets(extras_json)}" ... - debug_dict = {"type": "DEBUG", "message": record.getMessage(), "data": extras} - return filter_secrets(json.dumps(debug_dict)) + debug_dict = {"type": "DEBUG", "message": record.getMessage(), "data": extras} + return filter_secrets(orjson.dumps(debug_dict).decode())And to actually keep extras structured, consider updating extract_extra_args_from_record() (outside this hunk) to only stringify values that aren’t JSON-serializable:
@staticmethod def extract_extra_args_from_record(record: logging.LogRecord) -> Mapping[str, Any]: default_attrs = logging.LogRecord("", 0, "", 0, None, None, None).__dict__.keys() extra_keys = set(record.__dict__.keys()) - default_attrs def _json_safe(v: Any) -> Any: try: orjson.dumps(v) return v except TypeError: return str(v) return {k: _json_safe(getattr(record, k)) for k in extra_keys if hasattr(record, k)}This should keep request/response bodies and headers as genuine objects in local DEBUG while still being robust. wdyt?
80-96: Guardrails for very large extras payloads?HTTP bodies/headers can be big. Do you want to consider truncating extras (e.g., cap body length) in platform mode to keep single LOG messages manageable, or at least put a size cap and note truncation in the message? Happy to propose a small helper if helpful. wdyt?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
airbyte_cdk/logger.py(3 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
airbyte_cdk/logger.py (2)
airbyte_cdk/utils/airbyte_secrets_utils.py (1)
filter_secrets(73-80)airbyte_cdk/models/airbyte_protocol.py (1)
AirbyteMessage(79-88)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
- GitHub Check: Check: source-hardcoded-records
- GitHub Check: Check: source-pokeapi
- GitHub Check: Check: source-intercom
- GitHub Check: Check: destination-motherduck
- GitHub Check: Check: source-shopify
- GitHub Check: SDM Docker Image Build
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (Fast)
- GitHub Check: Analyze (python)
🔇 Additional comments (3)
airbyte_cdk/logger.py (3)
7-7: Import looks good and scopedos is used by is_platform_debug_log_enabled; no concerns here.
43-45: Nice: this resolves the mypy error noted earlierThe explicit -> bool return type should satisfy the linter and downstream call sites.
75-103: Verify entrypoint debug vs. LOG_LEVEL precedence
- In airbyte_cdk/entrypoint.py (lines 161–164), debug is enabled when
(parsed_args.debug) OR is_platform_debug_log_enabled().is_platform_debug_log_enabled()returns true only ifLOG_LEVELenv var =="debug".- As written, CLI
--debugstill enables debug even ifLOG_LEVEL!="debug", and vice versa.Could you confirm this OR-based logic matches the contract that
LOG_LEVELshould “take priority” when both are present? For example, if the intent is for the env var to override CLI in both directions, we may need to checkLOG_LEVELfirst and ignoreparsed_args.debugwhen it’s set. wdyt?
What
We want debug logs in the form of AirbyteMessages to be logged when the flag
LOG_LEVELisdebug.How
Identify if the debug log has been enable by the environment variable (i.e. the platform) or by the arg
--debug. If both, we prioritize the platform logs. Debug logs locally looks like this:In the platform, they will look like this:
More context as to why we implemented this solution
We want those logs to be human readable as much as possible. Right now, the logs are printed as this:
The flow of the developer is usually to format the DEBUG message, copy/paste the body, decode it and read it. The Airbyte protocol equivalent would be this:
However even though it is allowed, I'm not sure what the platform does with the
extraseven though additional properties are allowed. Therefore, it'll lead to a solution that the developer will have to find where in the log message the body is and to decode it twice which adds a lot of work to be done on any request that needs to be read. The alternatives I see are:extrapart and put everything in a single string (something likeReceiving response. Status: %s, headers: %s, body: %s). It'll degrade the DX compared to what we do locally because between formatting the DEBUG message and copy/pasting the body, the dev will have to locate the body within the string. We can add a sequence of character before and after that we expect will not be present in must HTTP bodies to ease that but it is still not ideal.extraand in themessagewhich allow to cover both cases but can add a lot in terms of size. This might be dangerous in terms of the JSON limit there was in jackson when the platform deserialize (I don't know if it is still a problem)--debugparam, we print this through theextrasdictionarySummary by CodeRabbit