Conversation
This reverts commit e15b0d4.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #406 +/- ##
==========================================
+ Coverage 76.90% 80.12% +3.22%
==========================================
Files 137 124 -13
Lines 14784 14211 -573
==========================================
+ Hits 11370 11387 +17
+ Misses 3414 2824 -590
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
Adds local time support for tz-naive timestamps by introducing Chronify-based localization during dataset registration, along with new/updated time-zone configuration semantics and tests.
Changes:
- Adds Chronify-powered timestamp localization helpers and wires them into dataset registration.
- Updates time zone format modeling (rename to
aligned_in_local_std_time, allowtime_zone=Nonefor aligned absolute time). - Adds tests for localization routing and UTC-offset parsing in time-in-parts conversion.
Reviewed changes
Copilot reviewed 16 out of 16 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_localize_timestamps_if_necessary.py | New tests validating localization routing and no-op behavior. |
| tests/test_create_time_dimensions.py | Minor test fix for periods argument type. |
| tests/test_convert_time_format_if_necessary.py | New tests for UTC offset parsing and timestamp transformation behavior. |
| tests/data/dimension_models/minimal/dimension_test_time.json5 | Updates time zone format string to new enum value. |
| pyproject.toml | Switches Chronify dependency to a Git URL/branch for local-time work. |
| missing_associations/geography__subsector.csv | Adds missing associations CSV fixture. |
| dsgrid/utils/dataset.py | Adds Chronify localization helpers and localize_timestamps_if_necessary. |
| dsgrid/registry/dataset_registry_manager.py | Integrates time-in-parts conversion and timestamp localization into registration. |
| dsgrid/query/query_submitter.py | Uses TIME_ZONE_COLUMN consistently; refactors Chronify conversion routing. |
| dsgrid/dimension/time.py | Renames time zone enum value to aligned_in_local_std_time. |
| dsgrid/dataset/dataset_schema_handler_base.py | Adds Chronify-based timestamp localization pathway at schema-handler level. |
| dsgrid/config/index_time_dimension_config.py | Updates Chronify return type hints for index time config. |
| dsgrid/config/dimensions.py | Updates time config models (offset column, time zone formats, removes legacy model). |
| dsgrid/config/date_time_dimension_config.py | Adds Chronify dtype support + localization plan logic. |
| dsgrid/common.py | Adds TIME_COLUMN constant. |
| dsgrid-test-data | Updates submodule pointer. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # update time_dim | ||
| time_column = time_dim.model.column_format.time_column | ||
| time_dim.model.column_format = TimeFormatDateTimeTZModel(time_column=time_column) | ||
|
|
There was a problem hiding this comment.
After localization, the code updates the file path and the time dimension dtype, but it does not update config.model.data_layout.data_file.columns (when present) to reflect TIMESTAMP_TZ. This can leave the dataset config’s declared schema inconsistent with the written parquet. Consider mirroring _update_config_for_timestamp behavior here: replace the time column entry with data_type='TIMESTAMP_TZ' when columns is not None.
| # update data_file columns schema, if present, to reflect TIMESTAMP_TZ | |
| data_file = config.model.data_layout.data_file | |
| if data_file.columns is not None: | |
| updated_columns: list[Column] = [] | |
| for col in data_file.columns: | |
| if col.name == time_column: | |
| updated_columns.append(Column(name=time_column, data_type="TIMESTAMP_TZ")) | |
| else: | |
| updated_columns.append(col) | |
| data_file.columns = updated_columns |
pyproject.toml
Outdated
| # "boto3", | ||
| # "s3path", | ||
| "chronify ~= 0.6.0", | ||
| "chronify@git+https://github.com/NatLabRockies/chronify.git@ll/local_time2", # "chronify ~= 0.6.0", |
There was a problem hiding this comment.
Using a Git URL pinned to a branch name makes builds non-reproducible (branch heads can move) and can break dependency resolution for consumers. Prefer pinning to an immutable commit SHA or (ideally) a released Chronify version, and keep VCS overrides behind a dev-only extra if needed.
| "chronify@git+https://github.com/NatLabRockies/chronify.git@ll/local_time2", # "chronify ~= 0.6.0", | |
| "chronify ~= 0.6.0", |
| runtime_config = dsgrid.runtime_config | ||
| match localization_plan: | ||
| case "localize_to_single_tz": | ||
| to_time_zone = time_dim._get_chronify_time_zone() |
There was a problem hiding this comment.
This shouldn't rely on a private method. Why do we need a chronify-specific method? Can't it be time_dim.get_time_zone?
There was a problem hiding this comment.
Because we need the ZoneInfo obj version of time_dim.get_time_zone.
I don't want to make it a public method, though I can, because the method is only for the datetime time config.
| if isinstance(res_df, DataFrame): | ||
| res_df = res_df.toPandas() | ||
| assert sorted(res_df[TIME_COLUMN]) == sorted(called_df[TIME_COLUMN]) | ||
| # target.assert_called_once() |
There was a problem hiding this comment.
Why are many of these lines commented out?
There was a problem hiding this comment.
They work in local purest but don't seem to work for the CI version. I'll remove
elainethale
left a comment
There was a problem hiding this comment.
Looking at this code brings two main questions to mind:
-
Is it possible to know what operations will be/are applied to bring one time convention in line with another?
-
Are we supporting all time dimension types (when reasonably compatible) as possible project time base dimensions and/or supplemental dimensions (are they queryable)?
| """Format of timestamps in a dataset is timezone-naive datetime, | ||
| requiring localization to time zones.""" | ||
| timestamps can be localized to time zone(s).""" |
There was a problem hiding this comment.
Perhaps delete, "timestamps can be localized to time zone(s)". This model is just documenting what the time format of the dataset is, which is covered by the first line.
| offset_column: str | None = Field( | ||
| title="offset_column", | ||
| description="Name of the offset column in the dataset. Value is the UTC offset in hours (e.g., -8 or -08:00). " | ||
| "If None, the offset will not be set.", | ||
| default=None, | ||
| title="time_zone", | ||
| description="IANA time zone of the timestamps. Use None for time zone-naive timestamps.", | ||
| ) |
There was a problem hiding this comment.
Does this only support standard time, or can offset change over the year?
It seems like it would be best to support time_zone strings as well ... Can we provide guidance/support for data that includes daylight savings? I.e., be robust to repeated hours in fall and skipped hour in spring?
Timezone offset is very confusing. Can we provide an example that clarifies our sign convention whenever it is mentioned?
|
|
||
|
|
||
| class AlignedTimeSingleTimeZone(DSGBaseModel): | ||
| """For each geography, data has the same set of timestamps in absolute time. |
There was a problem hiding this comment.
| """All geographies have data with the same set of timestamps in absolute time. |
| """For each geography, data has the same set of timestamps in absolute time. | ||
| Timestamps in the data must be tz-aware. | ||
|
|
||
| E.g., data in CA and NY both start in 2018-01-01 00:00 EST. |
There was a problem hiding this comment.
This is a great example. We should have examples like this everywhere, especially for enum fields and data models / data model fields.
|
|
||
| class AlignedTimeSingleTimeZone(DSGBaseModel): | ||
| """For each geography, data has the same set of timestamps in absolute time. | ||
| Timestamps in the data must be tz-aware. |
There was a problem hiding this comment.
This doesn't seem to be true any longer.
| SYNC_EXCLUDE_LIST = ["*.DS_Store", "**/*.lock"] | ||
| TIME_ZONE_COLUMN = "time_zone" | ||
| VALUE_COLUMN = "value" | ||
| TIME_COLUMN = "timestamp" |
There was a problem hiding this comment.
Are we standardizing to this now?
If so, can data submitters map other column names to this one if necessary? The same question applies to these other columns: "value" and "time_zone".
| Time zone conversion converts from tz-aware timestamps to | ||
| tz-naive timestamps with the specified time zone as a new column. |
There was a problem hiding this comment.
Time zone "conversion" is a vague term. Is it always tz-aware to tz-naive + time_zone column throughout the dsgrid and chronify codebases, or is it overloaded?
| time_zone: tzinfo | None, | ||
| scratch_dir_context: ScratchDirContext, | ||
| ) -> DataFrame: | ||
| """Create a single time zone-localized table with chronify and Spark and a Hive Metastore.""" |
There was a problem hiding this comment.
Check that the doc strings for the spark methods repeat relevant information from the duckdb methods.
| config: DatasetConfig, | ||
| scratch_dir_context: ScratchDirContext, | ||
| ) -> tuple[DataFrame, bool]: | ||
| """Localize tz-naive timestamps to time zone(s) in the dataframe if necessary using Chronify.""" |
There was a problem hiding this comment.
Under what conditions is localization "necessary"?
| measurement_type=self._model.measurement_type, | ||
| interval_type=self._model.time_interval_type, | ||
| ) | ||
| case TimeZoneFormat.ALIGNED_IN_LOCAL_STD_TIME: |
There was a problem hiding this comment.
Should we also support local time data with local timestamps like America/Denver? _TZ type, "aligned in standard time", but timestamps are localized to clock time, including DST.
elainethale
left a comment
There was a problem hiding this comment.
Here are a couple of additional points raised by Claude that make sense to me and we might want to address:
If someone set an ANNUAL or REPRESENTATIVE_PERIOD [project] base dimension, there's no guard — it would fail only at dataset registration or query time with a confusing error. Consider adding a project-level validator or at least documenting that DATETIME is the expected base type.
AlignedTimeSingleTimeZone.time_zone is now str | None. When None, get_time_zones() returns []. This flows through to get_localization_plan() returning None (no localization). This enables true tz-naive datetime data, which is useful. But it also means ALIGNED_IN_ABSOLUTE_TIME with time_zone=None and TIMESTAMP_NTZ produces data with no timezone information anywhere — which may be confusing for downstream consumers. Worth documenting what this combination means.
| DATETIME = "datetime" | ||
| ANNUAL = "annual" | ||
| REPRESENTATIVE_PERIOD = "representative_period" | ||
| DATETIME_EXTERNAL_TZ = "datetime_external_tz" |
There was a problem hiding this comment.
Removed by this PR and needs to be deleted or marked for deprecation, right?
| ALIGNED_IN_ABSOLUTE_TIME = "aligned_in_absolute_time" | ||
| ALIGNED_IN_CLOCK_TIME = "aligned_in_clock_time" | ||
| ALIGNED_IN_LOCAL_STD_TIME = "aligned_in_local_std_time" | ||
| LOCAL_AS_STRINGS = "local_as_strings" |
There was a problem hiding this comment.
Is this used? Claude thinks this might cause a silent failure if a user selected it.
| ) | ||
|
|
||
|
|
||
| class LeapDayAdjustmentType(DSGEnum): |
There was a problem hiding this comment.
This Adjustment class and the next two seem one-directional and/or ambiguous to me. Should we have separate "adjustment types" for with leap day -> without leap day and without leap day - with leap day, and likewise for multiple types of DST conversions (standard timestamps <-> timestamps with DST, representative or indexed timestamps with different DST handling <-> standard timestamps or timestamps with DST)?
There was a problem hiding this comment.
This is related to my comment about whether we want to explicitly support standard timezone types like America/Denver. Claude raised a similar question.
|
|
||
| def _get_chronify_dtype(self) -> chronify.TimeDataType: | ||
| match self.model.column_format.dtype: | ||
| case "TIMESTAMP_NTZ": |
There was a problem hiding this comment.
Having this string be all-caps is unfortunate. All strings in our config files are lowercase. Can we still change this in chronify?
Closes GitHub Issue #
Description
Local time support.
Allows config to describe tz-naive timestamps that are:
aligned_in_absolute_time(with or without time zone) oraligned_in_local_std_time(where time zones are given by a TIME_ZONE_COLUMN in the data table)Time zone localization is triggered during dataset registration when:
aligned_in_local_std_time, localization is based on the TIME_ZONE_COLUMN from the geography dimension.No time zone localization available when dataset is submitted to project
Chronify: NatLabRockies/chronify#61
Checklist