-
Notifications
You must be signed in to change notification settings - Fork 30
chore(file-based-cdk): add UploadableRemoteFile, refactor upload method #780
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore(file-based-cdk): add UploadableRemoteFile, refactor upload method #780
Conversation
👋 Greetings, Airbyte Team Member!Here are some helpful tips and reminders for your convenience. Testing This CDK VersionYou can test this version of the CDK using the following: # Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@daryna/file-based/refactor-upload-method#egg=airbyte-python-cdk[dev]' --help
# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch daryna/file-based/refactor-upload-method Helpful ResourcesPR Slash CommandsAirbyte Maintainers can execute the following slash commands on your PR:
|
/autofix
|
📝 WalkthroughWalkthroughImplements a concrete upload flow in AbstractFileBasedStreamReader with a 1.5GB file-size guard, introduces an UploadableRemoteFile interface with download/size semantics, updates imports and type hints, removes the abstract file_size method, and adds unit tests exercising upload and size-limit behavior. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Caller
participant Reader as AbstractFileBasedStreamReader
participant File as UploadableRemoteFile
participant Local as LocalFS
participant Log as Logger
Caller->>Reader: upload(file, local_directory, logger)
Reader->>File: size (property)
alt size > FILE_SIZE_LIMIT
Reader-->>Caller: raise FileSizeLimitError (FailureType.config_error)
else size within limit
Reader->>Reader: _get_file_transfer_paths(file)
Reader->>Log: log start (file_uri_for_logging, size)
Reader->>File: download_to_local_directory(local_path)
File->>Local: write bytes
Reader->>Log: log completion (duration)
Reader-->>Caller: return (FileRecordData, AirbyteRecordMessageFileReference)
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested labels
Suggested reviewers
Would you like adding a negative test asserting the exact Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
🧹 Nitpick comments (3)
airbyte_cdk/sources/file_based/file_based_file_transfer_reader.py (1)
22-26
: Consider usingdatetime
instead ofstr
for timestamps.The
file_created_at
property returns astr
, but timestamps are typically better represented asdatetime
objects for type safety and easier manipulation. The same applies tofile_updated_at
(lines 30-34). This would align with howRemoteFile.last_modified
is typed asdatetime
. Wdyt about usingdatetime
here?airbyte_cdk/sources/file_based/file_based_stream_reader.py (2)
193-194
: Remove commented-out code.These commented lines duplicate the guard logic now in
__init__
. They should be removed to keep the code clean, wdyt?Apply this diff:
- # if self.file_transfer_reader_class is None and self.upload.__func__ == AbstractFileBasedStreamReader.upload: - # raise NotImplementedError("One of file_transfer_reader_class or upload method must be defined to support file transfer.") -
213-215
: Consider extracting the size formatting logic.The size formatting appears here and potentially elsewhere. Extracting it to a helper function could improve maintainability, wdyt?
Something like:
def _format_file_size(size_bytes: int) -> str: mb = size_bytes / (1024 * 1024) gb = mb / 1024 return f"{mb:,.2f} MB ({gb:.2f} GB)"Then use it as:
logger.info( f"Starting to download the file {file_transfer.file_uri_for_logging} with size: {self._format_file_size(file_size)}" )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
airbyte_cdk/sources/file_based/file_based_file_transfer_reader.py
(1 hunks)airbyte_cdk/sources/file_based/file_based_stream_reader.py
(5 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
airbyte_cdk/sources/file_based/file_based_file_transfer_reader.py (3)
unit_tests/sources/file_based/file_types/test_excel_parser.py (1)
remote_file
(46-47)airbyte_cdk/sources/file_based/remote_file.py (1)
RemoteFile
(11-18)airbyte_cdk/sources/file_based/file_based_stream_reader.py (1)
file_size
(131-138)
airbyte_cdk/sources/file_based/file_based_stream_reader.py (6)
airbyte_cdk/sources/file_based/file_based_file_transfer_reader.py (8)
AbstractFileBasedFileTransferReader
(6-64)file_size
(38-42)source_file_relative_path
(53-57)file_uri_for_logging
(60-64)download_to_local_directory
(45-49)file_id
(14-18)file_created_at
(22-26)file_updated_at
(30-34)airbyte_cdk/sources/file_based/file_record_data.py (1)
FileRecordData
(11-23)airbyte_cdk/sources/file_based/remote_file.py (1)
RemoteFile
(11-18)airbyte_cdk/sources/declarative/models/declarative_component_schema.py (1)
FailureType
(749-752)airbyte_cdk/sources/file_based/exceptions.py (1)
FileSizeLimitError
(158-159)airbyte_cdk/sources/file_based/file_types/file_transfer.py (1)
upload
(18-30)
🪛 GitHub Actions: Linters
airbyte_cdk/sources/file_based/file_based_file_transfer_reader.py
[error] 60-60: Function is missing a return type annotation [no-untyped-def]
airbyte_cdk/sources/file_based/file_based_stream_reader.py
[error] 49-49: "Callable[[RemoteFile, str, Logger], tuple[FileRecordData, Any]]" has no attribute "func" [attr-defined]
[error] 196-196: "AbstractFileBasedFileTransferReader" not callable [operator]
[error] 196-196: "None" not callable [misc]
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
- GitHub Check: Check: source-google-drive
- GitHub Check: Check: source-pokeapi
- GitHub Check: Check: source-intercom
- GitHub Check: Check: destination-motherduck
- GitHub Check: Check: source-hardcoded-records
- GitHub Check: Check: source-shopify
- GitHub Check: SDM Docker Image Build
- GitHub Check: Manifest Server Docker Image Build
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (Fast)
- GitHub Check: Pytest (All, Python 3.12, Ubuntu)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Pytest (All, Python 3.13, Ubuntu)
🔇 Additional comments (1)
airbyte_cdk/sources/file_based/file_based_stream_reader.py (1)
225-240
: LGTM on the FileRecordData and file reference construction!The logic correctly assembles the metadata from the file transfer object and constructs both
FileRecordData
andAirbyteRecordMessageFileReference
with all the necessary fields. Nice work on maintaining the complete audit trail!
airbyte_cdk/sources/file_based/file_based_file_transfer_reader.py
Outdated
Show resolved
Hide resolved
airbyte_cdk/sources/file_based/file_based_file_transfer_reader.py
Outdated
Show resolved
Hide resolved
/autofix
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
airbyte_cdk/sources/file_based/file_based_stream_reader.py (1)
176-181
: Docstring return type is outdated.The method returns a tuple (FileRecordData, AirbyteRecordMessageFileReference) but the docstring documents only the file reference. Update to avoid confusion, wdyt?
- Returns: - AirbyteRecordMessageFileReference: A file reference object containing: - - staging_file_url (str): The absolute path to the referenced file in the staging area. - - file_size_bytes (int): The size of the referenced file in bytes. - - source_file_relative_path (str): The relative path to the referenced file in source. + Returns: + Tuple[FileRecordData, AirbyteRecordMessageFileReference]: + - FileRecordData: + - folder (str): The folder of the referenced file. + - file_name (str): The name of the referenced file. + - bytes (int): Size in bytes. + - id (Optional[str]): File identifier. + - mime_type (Optional[str]): MIME type. + - created_at (Optional[str]): Creation timestamp (string). + - updated_at (Optional[str]): Update timestamp (string). + - source_uri (str): Original source URI. + - AirbyteRecordMessageFileReference: + - staging_file_url (str): Absolute path in staging. + - source_file_relative_path (str): Source-relative path. + - file_size_bytes (int): Size in bytes.
🧹 Nitpick comments (3)
airbyte_cdk/sources/file_based/file_based_file_transfer_reader.py (1)
7-7
: Consider ClassVar and unit consistency for FILE_SIZE_LIMIT.Would you like to:
- Type it as a ClassVar[int] for clarity with type-checkers?
- Confirm whether limits/logs use decimal GB or binary GiB and align across messages to avoid confusion, wdyt?
Apply:
+from typing import ClassVar @@ -class AbstractFileBasedFileTransferReader(ABC): - FILE_SIZE_LIMIT = 1_500_000_000 +class AbstractFileBasedFileTransferReader(ABC): + FILE_SIZE_LIMIT: ClassVar[int] = 1_500_000_000airbyte_cdk/sources/file_based/file_based_stream_reader.py (2)
190-194
: Unify size units with logs (GB vs GiB) and stable formatting.The error uses decimal GB (1e9) while logs use 1024-based MB/GB. Would you want to standardize on binary units and label as MiB/GiB for consistency, plus format to a fixed precision, wdyt?
- message = f"File size exceeds the {file_transfer.FILE_SIZE_LIMIT / 1e+9} GB limit." + message = f"File size exceeds the {file_transfer.FILE_SIZE_LIMIT / (1024 ** 3):.2f} GiB limit."
204-214
: Refactor logging to lazy formatting; also likely fixes Ruff formatting failure.These f-strings are long and pre-format eagerly. Would you switch to logger’s lazy % formatting, shorten lines, and align to MiB/GiB units, wdyt?
- logger.info( - f"Starting to download the file {file_transfer.file_uri_for_logging} with size: {file_size / (1024 * 1024):,.2f} MB ({file_size / (1024 * 1024 * 1024):.2f} GB)" - ) + logger.info( + "Starting to download the file %s with size: %.2f MiB (%.2f GiB)", + file_transfer.file_uri_for_logging, + file_size / (1024 ** 2), + file_size / (1024 ** 3), + ) @@ - logger.info( - f"Finished downloading the file {file_transfer.file_uri_for_logging} and saved to {local_file_path} in {write_duration:,.2f} seconds." - ) + logger.info( + "Finished downloading the file %s and saved to %s in %.2f seconds.", + file_transfer.file_uri_for_logging, + local_file_path, + write_duration, + )Please also run the project formatter (e.g., poetry run ruff format .) to ensure the pipeline’s “would be reformatted” check passes.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
airbyte_cdk/sources/file_based/file_based_file_transfer_reader.py
(1 hunks)airbyte_cdk/sources/file_based/file_based_stream_reader.py
(5 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
airbyte_cdk/sources/file_based/file_based_file_transfer_reader.py (2)
unit_tests/sources/file_based/file_types/test_excel_parser.py (1)
remote_file
(46-47)airbyte_cdk/sources/file_based/remote_file.py (1)
RemoteFile
(11-18)
airbyte_cdk/sources/file_based/file_based_stream_reader.py (6)
airbyte_cdk/sources/file_based/exceptions.py (1)
FileSizeLimitError
(158-159)airbyte_cdk/sources/file_based/file_based_file_transfer_reader.py (8)
AbstractFileBasedFileTransferReader
(6-64)file_size
(38-42)source_file_relative_path
(53-57)file_uri_for_logging
(60-64)download_to_local_directory
(45-49)file_id
(14-18)file_created_at
(22-26)file_updated_at
(30-34)airbyte_cdk/sources/file_based/file_types/file_transfer.py (1)
upload
(18-30)unit_tests/sources/file_based/in_memory_files_source.py (2)
upload
(143-146)file_size
(140-141)unit_tests/sources/file_based/test_file_based_stream_reader.py (2)
upload
(87-90)file_size
(84-85)airbyte_cdk/sources/file_based/file_record_data.py (1)
FileRecordData
(11-23)
🪛 GitHub Actions: Linters
airbyte_cdk/sources/file_based/file_based_stream_reader.py
[error] 188-188: Ruff format check failed. 1 file would be reformatted by 'poetry run ruff format --diff .'
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
- GitHub Check: Check: source-intercom
- GitHub Check: Check: source-hardcoded-records
- GitHub Check: Check: source-pokeapi
- GitHub Check: Check: destination-motherduck
- GitHub Check: Check: source-google-drive
- GitHub Check: Check: source-shopify
- GitHub Check: Pytest (Fast)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (All, Python 3.13, Ubuntu)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Pytest (All, Python 3.12, Ubuntu)
- GitHub Check: SDM Docker Image Build
- GitHub Check: Manifest Server Docker Image Build
🔇 Additional comments (4)
airbyte_cdk/sources/file_based/file_based_file_transfer_reader.py (1)
59-64
: Nice: return type annotation added.The property type hint resolves the earlier linter error and matches RemoteFile.uri’s str type. LGTM.
airbyte_cdk/sources/file_based/file_based_stream_reader.py (3)
46-52
: Good runtime guard for configuration/override.Comparing type(self).upload to the base method avoids the func mypy issue. LGTM.
159-161
: Correct return annotation for class property.Returning Type[AbstractFileBasedFileTransferReader] | None is the right contract. LGTM.
182-186
: Explicit None check before instantiation looks good.This prevents calling None and clarifies the default method contract. LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (2)
unit_tests/sources/file_based/test_file_based_stream_reader.py (2)
72-99
: Test stub looks good, but consider delegatingfile_uri_for_logging
to the base class?The test class correctly implements all abstract members. However, overriding
file_uri_for_logging
to return a hardcoded string might cause confusion if logs are checked. The base class already provides a sensible default (self.remote_file.uri
). Would it make sense to remove the override and rely on the base implementation, wdyt?If you decide to remove the override, apply this diff:
- @property - def file_uri_for_logging(self) -> str: - return "logging/url" -
535-541
: Strengthen test assertions for better coverage?The test only checks that
file_record_data
andfile_reference
are truthy. Could you strengthen these assertions to verify specific fields or types? For example, you could assert thatfile_reference.file_size_bytes == 200
andfile_record_data.id == "test_file_id"
, wdyt?Apply this diff to strengthen the assertions:
file_record_data, file_reference = stream_reader.upload(remote_file, "test_directory", logger) - assert file_record_data - assert file_reference + assert file_record_data.id == "test_file_id" + assert file_record_data.bytes == 200 + assert file_reference.file_size_bytes == 200 + assert file_reference.source_file_relative_path == "source/path"
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
unit_tests/sources/file_based/test_file_based_stream_reader.py
(3 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
unit_tests/sources/file_based/test_file_based_stream_reader.py (4)
airbyte_cdk/sources/file_based/file_based_stream_reader.py (4)
config
(55-56)config
(60-70)file_transfer_reader_class
(160-161)upload
(163-231)airbyte_cdk/sources/file_based/exceptions.py (1)
FileSizeLimitError
(158-159)airbyte_cdk/sources/file_based/file_based_file_transfer_reader.py (1)
AbstractFileBasedFileTransferReader
(6-64)airbyte_cdk/sources/file_based/remote_file.py (1)
RemoteFile
(11-18)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
- GitHub Check: Check: source-hardcoded-records
- GitHub Check: Check: source-google-drive
- GitHub Check: Check: source-intercom
- GitHub Check: Check: destination-motherduck
- GitHub Check: Check: source-pokeapi
- GitHub Check: Check: source-shopify
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Pytest (All, Python 3.13, Ubuntu)
- GitHub Check: Pytest (All, Python 3.12, Ubuntu)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: SDM Docker Image Build
- GitHub Check: Manifest Server Docker Image Build
- GitHub Check: Pytest (Fast)
🔇 Additional comments (3)
unit_tests/sources/file_based/test_file_based_stream_reader.py (3)
10-10
: LGTM!The new imports are necessary for the test scaffolding and align with the file-transfer reader pattern introduced in the PR.
Also applies to: 16-19
101-133
: LGTM!The test stream reader correctly implements all abstract members and sets
file_transfer_reader_class
to enable the file-transfer-backed upload path. The__test__ = False
marker correctly prevents pytest from treating this as a test class.
531-549
: Fix missing dependency and initialize config
- Could you add
dunamai
to the project dependencies (or adjust pytest warning filters) to resolve the ModuleNotFoundError inairbyte_cdk/__init__.py
?- Could you initialize
stream_reader.config = TestSpec(**{"streams": []})
before invokingupload
intest_upload_with_file_transfer_reader
?wdyt?
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this, this is cool! Just a last concern on the design and I think we're there after that
airbyte_cdk/sources/file_based/file_based_file_transfer_reader.py
Outdated
Show resolved
Hide resolved
airbyte_cdk/sources/file_based/file_based_file_transfer_reader.py
Outdated
Show resolved
Hide resolved
airbyte_cdk/sources/file_based/file_based_file_transfer_reader.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (1)
unit_tests/sources/file_based/test_file_based_stream_reader.py (1)
524-528
: Remove the duplicate size limit test?Lines 525-526 and 527-528 are identical assertions testing the same
FileSizeLimitError
behavior. Could you consolidate these into a single assertion, wdyt?Apply this diff to remove the duplication:
blob.size = 2_500_000_000 with pytest.raises(FileSizeLimitError): stream_reader.upload(uploadable_remote_file, "test_directory", logger) - with pytest.raises(FileSizeLimitError): - stream_reader.upload(uploadable_remote_file, "test_directory", logger)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
airbyte_cdk/sources/file_based/file_based_stream_reader.py
(5 hunks)airbyte_cdk/sources/file_based/remote_file.py
(2 hunks)unit_tests/sources/file_based/test_file_based_stream_reader.py
(3 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
airbyte_cdk/sources/file_based/remote_file.py (1)
unit_tests/sources/file_based/test_file_based_stream_reader.py (2)
size
(504-505)download_to_local_directory
(507-508)
unit_tests/sources/file_based/test_file_based_stream_reader.py (3)
airbyte_cdk/sources/file_based/file_based_stream_reader.py (6)
config
(46-47)config
(51-61)AbstractFileBasedStreamReader
(34-247)get_matching_files
(79-99)open_file
(64-76)upload
(150-215)airbyte_cdk/sources/file_based/exceptions.py (1)
FileSizeLimitError
(158-159)airbyte_cdk/sources/file_based/remote_file.py (4)
RemoteFile
(11-18)UploadableRemoteFile
(21-57)size
(32-36)download_to_local_directory
(39-43)
airbyte_cdk/sources/file_based/file_based_stream_reader.py (3)
airbyte_cdk/sources/file_based/exceptions.py (1)
FileSizeLimitError
(158-159)airbyte_cdk/sources/file_based/file_record_data.py (1)
FileRecordData
(11-23)airbyte_cdk/sources/file_based/remote_file.py (6)
RemoteFile
(11-18)UploadableRemoteFile
(21-57)size
(32-36)source_file_relative_path
(46-50)file_uri_for_logging
(53-57)download_to_local_directory
(39-43)
🪛 GitHub Actions: Linters
airbyte_cdk/sources/file_based/remote_file.py
[error] 4-4: I001 Import block is unsorted or un-formatted. Found 1 error. Run 'poetry run ruff check --fix .' to fix.
airbyte_cdk/sources/file_based/file_based_stream_reader.py
[error] 206-206: Command: poetry run mypy --config-file mypy.ini airbyte_cdk. Argument 'created_at' to 'FileRecordData' has incompatible type 'datetime | None'; expected 'str | None' [arg-type]
[error] 207-207: Command: poetry run mypy --config-file mypy.ini airbyte_cdk. Argument 'updated_at' to 'FileRecordData' has incompatible type 'datetime | None'; expected 'str | None' [arg-type]
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
- GitHub Check: Check: source-intercom
- GitHub Check: Check: source-shopify
- GitHub Check: Check: source-google-drive
- GitHub Check: Check: destination-motherduck
- GitHub Check: Check: source-hardcoded-records
- GitHub Check: Check: source-pokeapi
- GitHub Check: Manifest Server Docker Image Build
- GitHub Check: SDM Docker Image Build
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Pytest (All, Python 3.12, Ubuntu)
- GitHub Check: Pytest (All, Python 3.13, Ubuntu)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (Fast)
🔇 Additional comments (5)
airbyte_cdk/sources/file_based/file_based_stream_reader.py (3)
6-6
: LGTM!The new imports and FILE_SIZE_LIMIT constant are well-chosen. The 1.5GB limit provides a reasonable balance for file transfer operations.
Also applies to: 12-12, 14-14, 24-24, 26-26, 40-40
150-178
: LGTM!The type guard and size validation logic are solid. The error message correctly shows the 1.5 GB limit dynamically, and using
FailureType.config_error
is appropriate here.
180-198
: LGTM!The download logic is well-instrumented with detailed logging and timing information. This will be helpful for debugging and monitoring file transfer operations.
unit_tests/sources/file_based/test_file_based_stream_reader.py (2)
10-10
: LGTM!The new imports are appropriate for the test coverage being added.
Also applies to: 16-16, 18-18
69-99
: LGTM!The test helper class correctly implements all required abstract methods with minimal stubs, allowing the default
upload()
implementation to be tested. The__test__ = False
flag is a nice touch to prevent pytest confusion.
/autofix
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
airbyte_cdk/sources/file_based/remote_file.py (4)
21-24
: Minor formatting nit in the docstring.There's a missing space before the opening parenthesis in "uploading(file transferring)". Should it be "uploading (file transferring)" or perhaps just "uploading or file transferring", wdyt?
Apply this diff:
- A file in a file-based stream that supports uploading(file transferring). + A file in a file-based stream that supports uploading (file transferring).
26-28
: Consider usingdatetime
for timestamp fields.The
created_at
andupdated_at
fields are typed asOptional[str]
, but the parent class usesdatetime
forlast_modified
. Usingdatetime
objects for timestamps provides better type safety and easier manipulation. Is there a specific reason these need to be strings (e.g., varying formats from different sources), wdyt?If datetime is appropriate, apply this diff:
+from datetime import datetime + class UploadableRemoteFile(RemoteFile, ABC): """ A file in a file-based stream that supports uploading(file transferring). """ id: Optional[str] = None - created_at: Optional[str] = None - updated_at: Optional[str] = None + created_at: Optional[datetime] = None + updated_at: Optional[datetime] = None
38-43
: Consider documenting error handling behavior.The
download_to_local_directory
method could encounter various errors (network failures, disk I/O issues, permission problems). Would it be helpful to add a note in the docstring about expected exceptions or error handling behavior that implementers should follow, wdyt?For example:
@abstractmethod def download_to_local_directory(self, local_file_path: str) -> None: """ Download the file from remote source to local storage. + + Args: + local_file_path: The local path where the file should be saved. + + Raises: + Should raise appropriate exceptions for network, I/O, or permission errors. """ ...
45-57
: Both properties return the sameuri
value.The
source_file_relative_path
andfile_uri_for_logging
properties both returnself.uri
. While they convey different semantic intents, having multiple accessors for the same value might create confusion about which to use when. Is there a plan for these to diverge in behavior, or could we simplify by using just one property, wdyt?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
airbyte_cdk/sources/file_based/remote_file.py
(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
airbyte_cdk/sources/file_based/remote_file.py (1)
unit_tests/sources/file_based/test_file_based_stream_reader.py (2)
size
(504-505)download_to_local_directory
(507-508)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
- GitHub Check: Check: source-pokeapi
- GitHub Check: Check: destination-motherduck
- GitHub Check: Check: source-google-drive
- GitHub Check: Check: source-intercom
- GitHub Check: Check: source-hardcoded-records
- GitHub Check: Check: source-shopify
- GitHub Check: Pytest (All, Python 3.13, Ubuntu)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Pytest (All, Python 3.12, Ubuntu)
- GitHub Check: Pytest (Fast)
- GitHub Check: Manifest Server Docker Image Build
- GitHub Check: SDM Docker Image Build
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's very clean, I like it very much. Thanks for tackling those comments and this issue. This will definitely make the maintenance of the file upload easier
class UploadableRemoteFile(RemoteFile, ABC): | ||
""" | ||
A file in a file-based stream that supports uploading(file transferring). | ||
""" | ||
|
||
id: Optional[str] = None | ||
created_at: Optional[str] = None | ||
updated_at: Optional[str] = None | ||
|
||
@property | ||
@abstractmethod | ||
def size(self) -> int: | ||
""" | ||
Returns the file size in bytes. | ||
""" | ||
... | ||
|
||
@abstractmethod | ||
def download_to_local_directory(self, local_file_path: str) -> None: | ||
""" | ||
Download the file from remote source to local storage. | ||
""" | ||
... | ||
|
||
@property | ||
def source_file_relative_path(self) -> str: | ||
""" | ||
Returns the relative path of the source file. | ||
""" | ||
return self.uri | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice @darynaishchenko
Summary by CodeRabbit