Skip to content

Conversation

darynaishchenko
Copy link
Contributor

@darynaishchenko darynaishchenko commented Oct 2, 2025

Summary by CodeRabbit

  • New Features
    • Built-in upload support for file-based sources with a 1.5 GB per-file size limit and clear error messaging when exceeded.
    • Enhanced transfer logging showing file sizes, paths, and transfer duration for improved observability.
  • Bug Fixes
    • Prevents oversized file transfers that could cause failures or timeouts by enforcing size limits early.
  • Tests
    • Expanded test coverage for upload flows and size-limit scenarios to ensure reliability.

@github-actions github-actions bot added the chore label Oct 2, 2025
Copy link

github-actions bot commented Oct 2, 2025

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

Testing This CDK Version

You can test this version of the CDK using the following:

# Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@daryna/file-based/refactor-upload-method#egg=airbyte-python-cdk[dev]' --help

# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch daryna/file-based/refactor-upload-method

Helpful Resources

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

  • /autofix - Fixes most formatting and linting issues
  • /poetry-lock - Updates poetry.lock file
  • /test - Runs connector tests with the updated CDK
  • /poe build - Regenerate git-committed build artifacts, such as the pydantic models which are generated from the manifest JSON schema in YAML.
  • /poe <command> - Runs any poe command in the CDK environment

📝 Edit this welcome message.

@darynaishchenko
Copy link
Contributor Author

darynaishchenko commented Oct 2, 2025

/autofix

Auto-Fix Job Info

This job attempts to auto-fix any linting or formating issues. If any fixes are made,
those changes will be automatically committed and pushed back to the PR.

Note: This job can only be run by maintainers. On PRs from forks, this command requires
that the PR author has enabled the Allow edits from maintainers option.

PR auto-fix job started... Check job output.

✅ Changes applied successfully.

Copy link
Contributor

coderabbitai bot commented Oct 2, 2025

📝 Walkthrough

Walkthrough

Implements a concrete upload flow in AbstractFileBasedStreamReader with a 1.5GB file-size guard, introduces an UploadableRemoteFile interface with download/size semantics, updates imports and type hints, removes the abstract file_size method, and adds unit tests exercising upload and size-limit behavior.

Changes

Cohort / File(s) Summary of changes
Stream reader upload implementation
airbyte_cdk/sources/file_based/file_based_stream_reader.py
Converts AbstractFileBasedStreamReader.upload to a concrete method accepting UploadableRemoteFile, adds FILE_SIZE_LIMIT = 1_500_000_000, enforces file-size guard using FileSizeLimitError with FailureType.config_error, computes transfer paths, logs and times download, downloads file to local directory, returns FileRecordData and AirbyteRecordMessageFileReference, removes abstract file_size method, and updates imports/types.
Remote file interface for uploads
airbyte_cdk/sources/file_based/remote_file.py
Adds UploadableRemoteFile ABC (extends RemoteFile) with id/created_at/updated_at, abstract size property and download_to_local_directory method, and helpers source_file_relative_path and file_uri_for_logging; imports ABC/abstractmethod.
File-transfer type hint update
airbyte_cdk/sources/file_based/file_types/file_transfer.py
Changes upload method parameter type from RemoteFile to UploadableRemoteFile and updates the corresponding import.
Unit tests for upload and limits
unit_tests/sources/file_based/test_file_based_stream_reader.py
Adds TestStreamReaderWithDefaultUpload test subclass and test_upload_with_file_transfer_reader() to exercise upload flow using UploadableRemoteFile, validates size-limit behavior (raises FileSizeLimitError), and introduces necessary test imports and scaffolding.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Caller
  participant Reader as AbstractFileBasedStreamReader
  participant File as UploadableRemoteFile
  participant Local as LocalFS
  participant Log as Logger

  Caller->>Reader: upload(file, local_directory, logger)
  Reader->>File: size (property)
  alt size > FILE_SIZE_LIMIT
    Reader-->>Caller: raise FileSizeLimitError (FailureType.config_error)
  else size within limit
    Reader->>Reader: _get_file_transfer_paths(file)
    Reader->>Log: log start (file_uri_for_logging, size)
    Reader->>File: download_to_local_directory(local_path)
    File->>Local: write bytes
    Reader->>Log: log completion (duration)
    Reader-->>Caller: return (FileRecordData, AirbyteRecordMessageFileReference)
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels

enhancement

Suggested reviewers

  • maxi297

Would you like adding a negative test asserting the exact FailureType on FileSizeLimitError and a boundary test at exactly 1_500_000_000 bytes, wdyt?

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title clearly summarizes the main changes by noting the addition of UploadableRemoteFile and the refactoring of the upload method in the file-based CDK, matching the pull request’s content.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch daryna/file-based/refactor-upload-method

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (3)
airbyte_cdk/sources/file_based/file_based_file_transfer_reader.py (1)

22-26: Consider using datetime instead of str for timestamps.

The file_created_at property returns a str, but timestamps are typically better represented as datetime objects for type safety and easier manipulation. The same applies to file_updated_at (lines 30-34). This would align with how RemoteFile.last_modified is typed as datetime. Wdyt about using datetime here?

airbyte_cdk/sources/file_based/file_based_stream_reader.py (2)

193-194: Remove commented-out code.

These commented lines duplicate the guard logic now in __init__. They should be removed to keep the code clean, wdyt?

Apply this diff:

-        # if self.file_transfer_reader_class is None and self.upload.__func__ == AbstractFileBasedStreamReader.upload:
-        #     raise NotImplementedError("One of file_transfer_reader_class or upload method must be defined to support file transfer.")
-

213-215: Consider extracting the size formatting logic.

The size formatting appears here and potentially elsewhere. Extracting it to a helper function could improve maintainability, wdyt?

Something like:

def _format_file_size(size_bytes: int) -> str:
    mb = size_bytes / (1024 * 1024)
    gb = mb / 1024
    return f"{mb:,.2f} MB ({gb:.2f} GB)"

Then use it as:

logger.info(
    f"Starting to download the file {file_transfer.file_uri_for_logging} with size: {self._format_file_size(file_size)}"
)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7ab013d and d5941c3.

📒 Files selected for processing (2)
  • airbyte_cdk/sources/file_based/file_based_file_transfer_reader.py (1 hunks)
  • airbyte_cdk/sources/file_based/file_based_stream_reader.py (5 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
airbyte_cdk/sources/file_based/file_based_file_transfer_reader.py (3)
unit_tests/sources/file_based/file_types/test_excel_parser.py (1)
  • remote_file (46-47)
airbyte_cdk/sources/file_based/remote_file.py (1)
  • RemoteFile (11-18)
airbyte_cdk/sources/file_based/file_based_stream_reader.py (1)
  • file_size (131-138)
airbyte_cdk/sources/file_based/file_based_stream_reader.py (6)
airbyte_cdk/sources/file_based/file_based_file_transfer_reader.py (8)
  • AbstractFileBasedFileTransferReader (6-64)
  • file_size (38-42)
  • source_file_relative_path (53-57)
  • file_uri_for_logging (60-64)
  • download_to_local_directory (45-49)
  • file_id (14-18)
  • file_created_at (22-26)
  • file_updated_at (30-34)
airbyte_cdk/sources/file_based/file_record_data.py (1)
  • FileRecordData (11-23)
airbyte_cdk/sources/file_based/remote_file.py (1)
  • RemoteFile (11-18)
airbyte_cdk/sources/declarative/models/declarative_component_schema.py (1)
  • FailureType (749-752)
airbyte_cdk/sources/file_based/exceptions.py (1)
  • FileSizeLimitError (158-159)
airbyte_cdk/sources/file_based/file_types/file_transfer.py (1)
  • upload (18-30)
🪛 GitHub Actions: Linters
airbyte_cdk/sources/file_based/file_based_file_transfer_reader.py

[error] 60-60: Function is missing a return type annotation [no-untyped-def]

airbyte_cdk/sources/file_based/file_based_stream_reader.py

[error] 49-49: "Callable[[RemoteFile, str, Logger], tuple[FileRecordData, Any]]" has no attribute "func" [attr-defined]


[error] 196-196: "AbstractFileBasedFileTransferReader" not callable [operator]


[error] 196-196: "None" not callable [misc]

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
  • GitHub Check: Check: source-google-drive
  • GitHub Check: Check: source-pokeapi
  • GitHub Check: Check: source-intercom
  • GitHub Check: Check: destination-motherduck
  • GitHub Check: Check: source-hardcoded-records
  • GitHub Check: Check: source-shopify
  • GitHub Check: SDM Docker Image Build
  • GitHub Check: Manifest Server Docker Image Build
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Pytest (All, Python 3.12, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.13, Ubuntu)
🔇 Additional comments (1)
airbyte_cdk/sources/file_based/file_based_stream_reader.py (1)

225-240: LGTM on the FileRecordData and file reference construction!

The logic correctly assembles the metadata from the file transfer object and constructs both FileRecordData and AirbyteRecordMessageFileReference with all the necessary fields. Nice work on maintaining the complete audit trail!

Copy link

github-actions bot commented Oct 2, 2025

PyTest Results (Fast)

3 779 tests  +1   3 767 ✅ +1   6m 23s ⏱️ -11s
    1 suites ±0      12 💤 ±0 
    1 files   ±0       0 ❌ ±0 

Results for commit 7a0f962. ± Comparison against base commit 7ab013d.

♻️ This comment has been updated with latest results.

Copy link

github-actions bot commented Oct 2, 2025

PyTest Results (Full)

3 782 tests  +1   3 770 ✅ +1   11m 7s ⏱️ +3s
    1 suites ±0      12 💤 ±0 
    1 files   ±0       0 ❌ ±0 

Results for commit 7a0f962. ± Comparison against base commit 7ab013d.

♻️ This comment has been updated with latest results.

@darynaishchenko
Copy link
Contributor Author

darynaishchenko commented Oct 2, 2025

/autofix

Auto-Fix Job Info

This job attempts to auto-fix any linting or formating issues. If any fixes are made,
those changes will be automatically committed and pushed back to the PR.

Note: This job can only be run by maintainers. On PRs from forks, this command requires
that the PR author has enabled the Allow edits from maintainers option.

PR auto-fix job started... Check job output.

✅ Changes applied successfully.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
airbyte_cdk/sources/file_based/file_based_stream_reader.py (1)

176-181: Docstring return type is outdated.

The method returns a tuple (FileRecordData, AirbyteRecordMessageFileReference) but the docstring documents only the file reference. Update to avoid confusion, wdyt?

-           Returns:
-               AirbyteRecordMessageFileReference: A file reference object containing:
-                   - staging_file_url (str): The absolute path to the referenced file in the staging area.
-                   - file_size_bytes (int): The size of the referenced file in bytes.
-                   - source_file_relative_path (str): The relative path to the referenced file in source.
+           Returns:
+               Tuple[FileRecordData, AirbyteRecordMessageFileReference]:
+                   - FileRecordData:
+                       - folder (str): The folder of the referenced file.
+                       - file_name (str): The name of the referenced file.
+                       - bytes (int): Size in bytes.
+                       - id (Optional[str]): File identifier.
+                       - mime_type (Optional[str]): MIME type.
+                       - created_at (Optional[str]): Creation timestamp (string).
+                       - updated_at (Optional[str]): Update timestamp (string).
+                       - source_uri (str): Original source URI.
+                   - AirbyteRecordMessageFileReference:
+                       - staging_file_url (str): Absolute path in staging.
+                       - source_file_relative_path (str): Source-relative path.
+                       - file_size_bytes (int): Size in bytes.
🧹 Nitpick comments (3)
airbyte_cdk/sources/file_based/file_based_file_transfer_reader.py (1)

7-7: Consider ClassVar and unit consistency for FILE_SIZE_LIMIT.

Would you like to:

  • Type it as a ClassVar[int] for clarity with type-checkers?
  • Confirm whether limits/logs use decimal GB or binary GiB and align across messages to avoid confusion, wdyt?

Apply:

+from typing import ClassVar
@@
-class AbstractFileBasedFileTransferReader(ABC):
-    FILE_SIZE_LIMIT = 1_500_000_000
+class AbstractFileBasedFileTransferReader(ABC):
+    FILE_SIZE_LIMIT: ClassVar[int] = 1_500_000_000
airbyte_cdk/sources/file_based/file_based_stream_reader.py (2)

190-194: Unify size units with logs (GB vs GiB) and stable formatting.

The error uses decimal GB (1e9) while logs use 1024-based MB/GB. Would you want to standardize on binary units and label as MiB/GiB for consistency, plus format to a fixed precision, wdyt?

-            message = f"File size exceeds the {file_transfer.FILE_SIZE_LIMIT / 1e+9} GB limit."
+            message = f"File size exceeds the {file_transfer.FILE_SIZE_LIMIT / (1024 ** 3):.2f} GiB limit."

204-214: Refactor logging to lazy formatting; also likely fixes Ruff formatting failure.

These f-strings are long and pre-format eagerly. Would you switch to logger’s lazy % formatting, shorten lines, and align to MiB/GiB units, wdyt?

-        logger.info(
-            f"Starting to download the file {file_transfer.file_uri_for_logging} with size: {file_size / (1024 * 1024):,.2f} MB ({file_size / (1024 * 1024 * 1024):.2f} GB)"
-        )
+        logger.info(
+            "Starting to download the file %s with size: %.2f MiB (%.2f GiB)",
+            file_transfer.file_uri_for_logging,
+            file_size / (1024 ** 2),
+            file_size / (1024 ** 3),
+        )
@@
-        logger.info(
-            f"Finished downloading the file {file_transfer.file_uri_for_logging} and saved to {local_file_path} in {write_duration:,.2f} seconds."
-        )
+        logger.info(
+            "Finished downloading the file %s and saved to %s in %.2f seconds.",
+            file_transfer.file_uri_for_logging,
+            local_file_path,
+            write_duration,
+        )

Please also run the project formatter (e.g., poetry run ruff format .) to ensure the pipeline’s “would be reformatted” check passes.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 997c451 and bee6d01.

📒 Files selected for processing (2)
  • airbyte_cdk/sources/file_based/file_based_file_transfer_reader.py (1 hunks)
  • airbyte_cdk/sources/file_based/file_based_stream_reader.py (5 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
airbyte_cdk/sources/file_based/file_based_file_transfer_reader.py (2)
unit_tests/sources/file_based/file_types/test_excel_parser.py (1)
  • remote_file (46-47)
airbyte_cdk/sources/file_based/remote_file.py (1)
  • RemoteFile (11-18)
airbyte_cdk/sources/file_based/file_based_stream_reader.py (6)
airbyte_cdk/sources/file_based/exceptions.py (1)
  • FileSizeLimitError (158-159)
airbyte_cdk/sources/file_based/file_based_file_transfer_reader.py (8)
  • AbstractFileBasedFileTransferReader (6-64)
  • file_size (38-42)
  • source_file_relative_path (53-57)
  • file_uri_for_logging (60-64)
  • download_to_local_directory (45-49)
  • file_id (14-18)
  • file_created_at (22-26)
  • file_updated_at (30-34)
airbyte_cdk/sources/file_based/file_types/file_transfer.py (1)
  • upload (18-30)
unit_tests/sources/file_based/in_memory_files_source.py (2)
  • upload (143-146)
  • file_size (140-141)
unit_tests/sources/file_based/test_file_based_stream_reader.py (2)
  • upload (87-90)
  • file_size (84-85)
airbyte_cdk/sources/file_based/file_record_data.py (1)
  • FileRecordData (11-23)
🪛 GitHub Actions: Linters
airbyte_cdk/sources/file_based/file_based_stream_reader.py

[error] 188-188: Ruff format check failed. 1 file would be reformatted by 'poetry run ruff format --diff .'

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
  • GitHub Check: Check: source-intercom
  • GitHub Check: Check: source-hardcoded-records
  • GitHub Check: Check: source-pokeapi
  • GitHub Check: Check: destination-motherduck
  • GitHub Check: Check: source-google-drive
  • GitHub Check: Check: source-shopify
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.13, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.12, Ubuntu)
  • GitHub Check: SDM Docker Image Build
  • GitHub Check: Manifest Server Docker Image Build
🔇 Additional comments (4)
airbyte_cdk/sources/file_based/file_based_file_transfer_reader.py (1)

59-64: Nice: return type annotation added.

The property type hint resolves the earlier linter error and matches RemoteFile.uri’s str type. LGTM.

airbyte_cdk/sources/file_based/file_based_stream_reader.py (3)

46-52: Good runtime guard for configuration/override.

Comparing type(self).upload to the base method avoids the func mypy issue. LGTM.


159-161: Correct return annotation for class property.

Returning Type[AbstractFileBasedFileTransferReader] | None is the right contract. LGTM.


182-186: Explicit None check before instantiation looks good.

This prevents calling None and clarifies the default method contract. LGTM.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
unit_tests/sources/file_based/test_file_based_stream_reader.py (2)

72-99: Test stub looks good, but consider delegating file_uri_for_logging to the base class?

The test class correctly implements all abstract members. However, overriding file_uri_for_logging to return a hardcoded string might cause confusion if logs are checked. The base class already provides a sensible default (self.remote_file.uri). Would it make sense to remove the override and rely on the base implementation, wdyt?

If you decide to remove the override, apply this diff:

-    @property
-    def file_uri_for_logging(self) -> str:
-        return "logging/url"
-

535-541: Strengthen test assertions for better coverage?

The test only checks that file_record_data and file_reference are truthy. Could you strengthen these assertions to verify specific fields or types? For example, you could assert that file_reference.file_size_bytes == 200 and file_record_data.id == "test_file_id", wdyt?

Apply this diff to strengthen the assertions:

     file_record_data, file_reference = stream_reader.upload(remote_file, "test_directory", logger)
-    assert file_record_data
-    assert file_reference
+    assert file_record_data.id == "test_file_id"
+    assert file_record_data.bytes == 200
+    assert file_reference.file_size_bytes == 200
+    assert file_reference.source_file_relative_path == "source/path"
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7e1b850 and 3f7d458.

📒 Files selected for processing (1)
  • unit_tests/sources/file_based/test_file_based_stream_reader.py (3 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
unit_tests/sources/file_based/test_file_based_stream_reader.py (4)
airbyte_cdk/sources/file_based/file_based_stream_reader.py (4)
  • config (55-56)
  • config (60-70)
  • file_transfer_reader_class (160-161)
  • upload (163-231)
airbyte_cdk/sources/file_based/exceptions.py (1)
  • FileSizeLimitError (158-159)
airbyte_cdk/sources/file_based/file_based_file_transfer_reader.py (1)
  • AbstractFileBasedFileTransferReader (6-64)
airbyte_cdk/sources/file_based/remote_file.py (1)
  • RemoteFile (11-18)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
  • GitHub Check: Check: source-hardcoded-records
  • GitHub Check: Check: source-google-drive
  • GitHub Check: Check: source-intercom
  • GitHub Check: Check: destination-motherduck
  • GitHub Check: Check: source-pokeapi
  • GitHub Check: Check: source-shopify
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.13, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.12, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: SDM Docker Image Build
  • GitHub Check: Manifest Server Docker Image Build
  • GitHub Check: Pytest (Fast)
🔇 Additional comments (3)
unit_tests/sources/file_based/test_file_based_stream_reader.py (3)

10-10: LGTM!

The new imports are necessary for the test scaffolding and align with the file-transfer reader pattern introduced in the PR.

Also applies to: 16-19


101-133: LGTM!

The test stream reader correctly implements all abstract members and sets file_transfer_reader_class to enable the file-transfer-backed upload path. The __test__ = False marker correctly prevents pytest from treating this as a test class.


531-549: Fix missing dependency and initialize config

  • Could you add dunamai to the project dependencies (or adjust pytest warning filters) to resolve the ModuleNotFoundError in airbyte_cdk/__init__.py?
  • Could you initialize stream_reader.config = TestSpec(**{"streams": []}) before invoking upload in test_upload_with_file_transfer_reader?

wdyt?

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Copy link
Contributor

@maxi297 maxi297 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this, this is cool! Just a last concern on the design and I think we're there after that

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
unit_tests/sources/file_based/test_file_based_stream_reader.py (1)

524-528: Remove the duplicate size limit test?

Lines 525-526 and 527-528 are identical assertions testing the same FileSizeLimitError behavior. Could you consolidate these into a single assertion, wdyt?

Apply this diff to remove the duplication:

     blob.size = 2_500_000_000
     with pytest.raises(FileSizeLimitError):
         stream_reader.upload(uploadable_remote_file, "test_directory", logger)
-    with pytest.raises(FileSizeLimitError):
-        stream_reader.upload(uploadable_remote_file, "test_directory", logger)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1be0b73 and b8c9a2c.

📒 Files selected for processing (3)
  • airbyte_cdk/sources/file_based/file_based_stream_reader.py (5 hunks)
  • airbyte_cdk/sources/file_based/remote_file.py (2 hunks)
  • unit_tests/sources/file_based/test_file_based_stream_reader.py (3 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
airbyte_cdk/sources/file_based/remote_file.py (1)
unit_tests/sources/file_based/test_file_based_stream_reader.py (2)
  • size (504-505)
  • download_to_local_directory (507-508)
unit_tests/sources/file_based/test_file_based_stream_reader.py (3)
airbyte_cdk/sources/file_based/file_based_stream_reader.py (6)
  • config (46-47)
  • config (51-61)
  • AbstractFileBasedStreamReader (34-247)
  • get_matching_files (79-99)
  • open_file (64-76)
  • upload (150-215)
airbyte_cdk/sources/file_based/exceptions.py (1)
  • FileSizeLimitError (158-159)
airbyte_cdk/sources/file_based/remote_file.py (4)
  • RemoteFile (11-18)
  • UploadableRemoteFile (21-57)
  • size (32-36)
  • download_to_local_directory (39-43)
airbyte_cdk/sources/file_based/file_based_stream_reader.py (3)
airbyte_cdk/sources/file_based/exceptions.py (1)
  • FileSizeLimitError (158-159)
airbyte_cdk/sources/file_based/file_record_data.py (1)
  • FileRecordData (11-23)
airbyte_cdk/sources/file_based/remote_file.py (6)
  • RemoteFile (11-18)
  • UploadableRemoteFile (21-57)
  • size (32-36)
  • source_file_relative_path (46-50)
  • file_uri_for_logging (53-57)
  • download_to_local_directory (39-43)
🪛 GitHub Actions: Linters
airbyte_cdk/sources/file_based/remote_file.py

[error] 4-4: I001 Import block is unsorted or un-formatted. Found 1 error. Run 'poetry run ruff check --fix .' to fix.

airbyte_cdk/sources/file_based/file_based_stream_reader.py

[error] 206-206: Command: poetry run mypy --config-file mypy.ini airbyte_cdk. Argument 'created_at' to 'FileRecordData' has incompatible type 'datetime | None'; expected 'str | None' [arg-type]


[error] 207-207: Command: poetry run mypy --config-file mypy.ini airbyte_cdk. Argument 'updated_at' to 'FileRecordData' has incompatible type 'datetime | None'; expected 'str | None' [arg-type]

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
  • GitHub Check: Check: source-intercom
  • GitHub Check: Check: source-shopify
  • GitHub Check: Check: source-google-drive
  • GitHub Check: Check: destination-motherduck
  • GitHub Check: Check: source-hardcoded-records
  • GitHub Check: Check: source-pokeapi
  • GitHub Check: Manifest Server Docker Image Build
  • GitHub Check: SDM Docker Image Build
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.12, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.13, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (Fast)
🔇 Additional comments (5)
airbyte_cdk/sources/file_based/file_based_stream_reader.py (3)

6-6: LGTM!

The new imports and FILE_SIZE_LIMIT constant are well-chosen. The 1.5GB limit provides a reasonable balance for file transfer operations.

Also applies to: 12-12, 14-14, 24-24, 26-26, 40-40


150-178: LGTM!

The type guard and size validation logic are solid. The error message correctly shows the 1.5 GB limit dynamically, and using FailureType.config_error is appropriate here.


180-198: LGTM!

The download logic is well-instrumented with detailed logging and timing information. This will be helpful for debugging and monitoring file transfer operations.

unit_tests/sources/file_based/test_file_based_stream_reader.py (2)

10-10: LGTM!

The new imports are appropriate for the test coverage being added.

Also applies to: 16-16, 18-18


69-99: LGTM!

The test helper class correctly implements all required abstract methods with minimal stubs, allowing the default upload() implementation to be tested. The __test__ = False flag is a nice touch to prevent pytest confusion.

@darynaishchenko
Copy link
Contributor Author

darynaishchenko commented Oct 3, 2025

/autofix

Auto-Fix Job Info

This job attempts to auto-fix any linting or formating issues. If any fixes are made,
those changes will be automatically committed and pushed back to the PR.

Note: This job can only be run by maintainers. On PRs from forks, this command requires
that the PR author has enabled the Allow edits from maintainers option.

PR auto-fix job started... Check job output.

✅ Changes applied successfully.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
airbyte_cdk/sources/file_based/remote_file.py (4)

21-24: Minor formatting nit in the docstring.

There's a missing space before the opening parenthesis in "uploading(file transferring)". Should it be "uploading (file transferring)" or perhaps just "uploading or file transferring", wdyt?

Apply this diff:

-    A file in a file-based stream that supports uploading(file transferring).
+    A file in a file-based stream that supports uploading (file transferring).

26-28: Consider using datetime for timestamp fields.

The created_at and updated_at fields are typed as Optional[str], but the parent class uses datetime for last_modified. Using datetime objects for timestamps provides better type safety and easier manipulation. Is there a specific reason these need to be strings (e.g., varying formats from different sources), wdyt?

If datetime is appropriate, apply this diff:

+from datetime import datetime
+
 class UploadableRemoteFile(RemoteFile, ABC):
     """
     A file in a file-based stream that supports uploading(file transferring).
     """
 
     id: Optional[str] = None
-    created_at: Optional[str] = None
-    updated_at: Optional[str] = None
+    created_at: Optional[datetime] = None
+    updated_at: Optional[datetime] = None

38-43: Consider documenting error handling behavior.

The download_to_local_directory method could encounter various errors (network failures, disk I/O issues, permission problems). Would it be helpful to add a note in the docstring about expected exceptions or error handling behavior that implementers should follow, wdyt?

For example:

     @abstractmethod
     def download_to_local_directory(self, local_file_path: str) -> None:
         """
         Download the file from remote source to local storage.
+        
+        Args:
+            local_file_path: The local path where the file should be saved.
+        
+        Raises:
+            Should raise appropriate exceptions for network, I/O, or permission errors.
         """
         ...

45-57: Both properties return the same uri value.

The source_file_relative_path and file_uri_for_logging properties both return self.uri. While they convey different semantic intents, having multiple accessors for the same value might create confusion about which to use when. Is there a plan for these to diverge in behavior, or could we simplify by using just one property, wdyt?

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b8c9a2c and eb1c883.

📒 Files selected for processing (1)
  • airbyte_cdk/sources/file_based/remote_file.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
airbyte_cdk/sources/file_based/remote_file.py (1)
unit_tests/sources/file_based/test_file_based_stream_reader.py (2)
  • size (504-505)
  • download_to_local_directory (507-508)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
  • GitHub Check: Check: source-pokeapi
  • GitHub Check: Check: destination-motherduck
  • GitHub Check: Check: source-google-drive
  • GitHub Check: Check: source-intercom
  • GitHub Check: Check: source-hardcoded-records
  • GitHub Check: Check: source-shopify
  • GitHub Check: Pytest (All, Python 3.13, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.12, Ubuntu)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Manifest Server Docker Image Build
  • GitHub Check: SDM Docker Image Build

Copy link
Contributor

@maxi297 maxi297 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's very clean, I like it very much. Thanks for tackling those comments and this issue. This will definitely make the maintenance of the file upload easier

@darynaishchenko darynaishchenko changed the title chore(file-based-cdk): add AbstractFileBasedFileTransferReader, refactor upload method chore(file-based-cdk): add UploadableRemoteFile, refactor upload method Oct 3, 2025
@darynaishchenko darynaishchenko merged commit c67570b into main Oct 6, 2025
28 of 30 checks passed
@darynaishchenko darynaishchenko deleted the daryna/file-based/refactor-upload-method branch October 6, 2025 09:41
Comment on lines +21 to +51
class UploadableRemoteFile(RemoteFile, ABC):
"""
A file in a file-based stream that supports uploading(file transferring).
"""

id: Optional[str] = None
created_at: Optional[str] = None
updated_at: Optional[str] = None

@property
@abstractmethod
def size(self) -> int:
"""
Returns the file size in bytes.
"""
...

@abstractmethod
def download_to_local_directory(self, local_file_path: str) -> None:
"""
Download the file from remote source to local storage.
"""
...

@property
def source_file_relative_path(self) -> str:
"""
Returns the relative path of the source file.
"""
return self.uri

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants