Skip to content

Ll/local time part2#406

Open
lixiliu wants to merge 29 commits intomainfrom
ll/local_time_part2
Open

Ll/local time part2#406
lixiliu wants to merge 29 commits intomainfrom
ll/local_time_part2

Conversation

@lixiliu
Copy link
Contributor

@lixiliu lixiliu commented Jan 23, 2026

Closes GitHub Issue #

Description

Local time support.

Allows config to describe tz-naive timestamps that are:

  • aligned_in_absolute_time(with or without time zone) or
  • aligned_in_local_std_time (where time zones are given by a TIME_ZONE_COLUMN in the data table)

Time zone localization is triggered during dataset registration when:

  • timestamps in data table are tz-naive (TIMESTAMP_NTZ) but time config has a time zone.
  • For aligned_in_local_std_time, localization is based on the TIME_ZONE_COLUMN from the geography dimension.

No time zone localization available when dataset is submitted to project

Chronify: NatLabRockies/chronify#61

Checklist

  • Tests exercising the new feature or bug fix
  • All tests pass
  • At least one code review approval
  • Consider transferring TODOs to GitHub

@codecov
Copy link

codecov bot commented Jan 23, 2026

Codecov Report

❌ Patch coverage is 75.29412% with 63 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.12%. Comparing base (f3f5546) to head (17ced0b).
⚠️ Report is 21 commits behind head on main.

Files with missing lines Patch % Lines
dsgrid/utils/dataset.py 61.70% 36 Missing ⚠️
dsgrid/config/dimensions.py 50.00% 9 Missing ⚠️
dsgrid/config/date_time_dimension_config.py 84.00% 8 Missing ⚠️
dsgrid/query/query_submitter.py 72.72% 6 Missing ⚠️
dsgrid/dimension/time.py 50.00% 3 Missing ⚠️
dsgrid/registry/dataset_registry_manager.py 98.43% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #406      +/-   ##
==========================================
+ Coverage   76.90%   80.12%   +3.22%     
==========================================
  Files         137      124      -13     
  Lines       14784    14211     -573     
==========================================
+ Hits        11370    11387      +17     
+ Misses       3414     2824     -590     
Flag Coverage Δ
Linux 80.05% <75.29%> (?)
Windows 80.12% <75.29%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@lixiliu lixiliu self-assigned this Jan 24, 2026
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds local time support for tz-naive timestamps by introducing Chronify-based localization during dataset registration, along with new/updated time-zone configuration semantics and tests.

Changes:

  • Adds Chronify-powered timestamp localization helpers and wires them into dataset registration.
  • Updates time zone format modeling (rename to aligned_in_local_std_time, allow time_zone=None for aligned absolute time).
  • Adds tests for localization routing and UTC-offset parsing in time-in-parts conversion.

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
tests/test_localize_timestamps_if_necessary.py New tests validating localization routing and no-op behavior.
tests/test_create_time_dimensions.py Minor test fix for periods argument type.
tests/test_convert_time_format_if_necessary.py New tests for UTC offset parsing and timestamp transformation behavior.
tests/data/dimension_models/minimal/dimension_test_time.json5 Updates time zone format string to new enum value.
pyproject.toml Switches Chronify dependency to a Git URL/branch for local-time work.
missing_associations/geography__subsector.csv Adds missing associations CSV fixture.
dsgrid/utils/dataset.py Adds Chronify localization helpers and localize_timestamps_if_necessary.
dsgrid/registry/dataset_registry_manager.py Integrates time-in-parts conversion and timestamp localization into registration.
dsgrid/query/query_submitter.py Uses TIME_ZONE_COLUMN consistently; refactors Chronify conversion routing.
dsgrid/dimension/time.py Renames time zone enum value to aligned_in_local_std_time.
dsgrid/dataset/dataset_schema_handler_base.py Adds Chronify-based timestamp localization pathway at schema-handler level.
dsgrid/config/index_time_dimension_config.py Updates Chronify return type hints for index time config.
dsgrid/config/dimensions.py Updates time config models (offset column, time zone formats, removes legacy model).
dsgrid/config/date_time_dimension_config.py Adds Chronify dtype support + localization plan logic.
dsgrid/common.py Adds TIME_COLUMN constant.
dsgrid-test-data Updates submodule pointer.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

# update time_dim
time_column = time_dim.model.column_format.time_column
time_dim.model.column_format = TimeFormatDateTimeTZModel(time_column=time_column)

Copy link

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After localization, the code updates the file path and the time dimension dtype, but it does not update config.model.data_layout.data_file.columns (when present) to reflect TIMESTAMP_TZ. This can leave the dataset config’s declared schema inconsistent with the written parquet. Consider mirroring _update_config_for_timestamp behavior here: replace the time column entry with data_type='TIMESTAMP_TZ' when columns is not None.

Suggested change
# update data_file columns schema, if present, to reflect TIMESTAMP_TZ
data_file = config.model.data_layout.data_file
if data_file.columns is not None:
updated_columns: list[Column] = []
for col in data_file.columns:
if col.name == time_column:
updated_columns.append(Column(name=time_column, data_type="TIMESTAMP_TZ"))
else:
updated_columns.append(col)
data_file.columns = updated_columns

Copilot uses AI. Check for mistakes.
pyproject.toml Outdated
# "boto3",
# "s3path",
"chronify ~= 0.6.0",
"chronify@git+https://github.com/NatLabRockies/chronify.git@ll/local_time2", # "chronify ~= 0.6.0",
Copy link

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a Git URL pinned to a branch name makes builds non-reproducible (branch heads can move) and can break dependency resolution for consumers. Prefer pinning to an immutable commit SHA or (ideally) a released Chronify version, and keep VCS overrides behind a dev-only extra if needed.

Suggested change
"chronify@git+https://github.com/NatLabRockies/chronify.git@ll/local_time2", # "chronify ~= 0.6.0",
"chronify ~= 0.6.0",

Copilot uses AI. Check for mistakes.
runtime_config = dsgrid.runtime_config
match localization_plan:
case "localize_to_single_tz":
to_time_zone = time_dim._get_chronify_time_zone()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't rely on a private method. Why do we need a chronify-specific method? Can't it be time_dim.get_time_zone?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we need the ZoneInfo obj version of time_dim.get_time_zone.

I don't want to make it a public method, though I can, because the method is only for the datetime time config.

if isinstance(res_df, DataFrame):
res_df = res_df.toPandas()
assert sorted(res_df[TIME_COLUMN]) == sorted(called_df[TIME_COLUMN])
# target.assert_called_once()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are many of these lines commented out?

Copy link
Contributor Author

@lixiliu lixiliu Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They work in local purest but don't seem to work for the CI version. I'll remove

@lixiliu lixiliu requested a review from daniel-thom February 16, 2026 10:28
@elainethale elainethale self-requested a review February 19, 2026 20:12
Copy link
Contributor

@elainethale elainethale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this code brings two main questions to mind:

  1. Is it possible to know what operations will be/are applied to bring one time convention in line with another?

  2. Are we supporting all time dimension types (when reasonably compatible) as possible project time base dimensions and/or supplemental dimensions (are they queryable)?

Comment on lines 254 to +255
"""Format of timestamps in a dataset is timezone-naive datetime,
requiring localization to time zones."""
timestamps can be localized to time zone(s)."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps delete, "timestamps can be localized to time zone(s)". This model is just documenting what the time format of the dataset is, which is covered by the first line.

Comment on lines +292 to 297
offset_column: str | None = Field(
title="offset_column",
description="Name of the offset column in the dataset. Value is the UTC offset in hours (e.g., -8 or -08:00). "
"If None, the offset will not be set.",
default=None,
title="time_zone",
description="IANA time zone of the timestamps. Use None for time zone-naive timestamps.",
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this only support standard time, or can offset change over the year?

It seems like it would be best to support time_zone strings as well ... Can we provide guidance/support for data that includes daylight savings? I.e., be robust to repeated hours in fall and skipped hour in spring?

Timezone offset is very confusing. Can we provide an example that clarifies our sign convention whenever it is mentioned?



class AlignedTimeSingleTimeZone(DSGBaseModel):
"""For each geography, data has the same set of timestamps in absolute time.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""All geographies have data with the same set of timestamps in absolute time.

"""For each geography, data has the same set of timestamps in absolute time.
Timestamps in the data must be tz-aware.

E.g., data in CA and NY both start in 2018-01-01 00:00 EST.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great example. We should have examples like this everywhere, especially for enum fields and data models / data model fields.


class AlignedTimeSingleTimeZone(DSGBaseModel):
"""For each geography, data has the same set of timestamps in absolute time.
Timestamps in the data must be tz-aware.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem to be true any longer.

SYNC_EXCLUDE_LIST = ["*.DS_Store", "**/*.lock"]
TIME_ZONE_COLUMN = "time_zone"
VALUE_COLUMN = "value"
TIME_COLUMN = "timestamp"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we standardizing to this now?

If so, can data submitters map other column names to this one if necessary? The same question applies to these other columns: "value" and "time_zone".

Comment on lines +371 to +372
Time zone conversion converts from tz-aware timestamps to
tz-naive timestamps with the specified time zone as a new column.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Time zone "conversion" is a vague term. Is it always tz-aware to tz-naive + time_zone column throughout the dsgrid and chronify codebases, or is it overloaded?

time_zone: tzinfo | None,
scratch_dir_context: ScratchDirContext,
) -> DataFrame:
"""Create a single time zone-localized table with chronify and Spark and a Hive Metastore."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check that the doc strings for the spark methods repeat relevant information from the duckdb methods.

config: DatasetConfig,
scratch_dir_context: ScratchDirContext,
) -> tuple[DataFrame, bool]:
"""Localize tz-naive timestamps to time zone(s) in the dataframe if necessary using Chronify."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under what conditions is localization "necessary"?

measurement_type=self._model.measurement_type,
interval_type=self._model.time_interval_type,
)
case TimeZoneFormat.ALIGNED_IN_LOCAL_STD_TIME:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also support local time data with local timestamps like America/Denver? _TZ type, "aligned in standard time", but timestamps are localized to clock time, including DST.

Copy link
Contributor

@elainethale elainethale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are a couple of additional points raised by Claude that make sense to me and we might want to address:

If someone set an ANNUAL or REPRESENTATIVE_PERIOD [project] base dimension, there's no guard — it would fail only at dataset registration or query time with a confusing error. Consider adding a project-level validator or at least documenting that DATETIME is the expected base type.
AlignedTimeSingleTimeZone.time_zone is now str | None. When None, get_time_zones() returns []. This flows through to get_localization_plan() returning None (no localization). This enables true tz-naive datetime data, which is useful. But it also means ALIGNED_IN_ABSOLUTE_TIME with time_zone=None and TIMESTAMP_NTZ produces data with no timezone information anywhere — which may be confusing for downstream consumers. Worth documenting what this combination means.

DATETIME = "datetime"
ANNUAL = "annual"
REPRESENTATIVE_PERIOD = "representative_period"
DATETIME_EXTERNAL_TZ = "datetime_external_tz"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed by this PR and needs to be deleted or marked for deprecation, right?

ALIGNED_IN_ABSOLUTE_TIME = "aligned_in_absolute_time"
ALIGNED_IN_CLOCK_TIME = "aligned_in_clock_time"
ALIGNED_IN_LOCAL_STD_TIME = "aligned_in_local_std_time"
LOCAL_AS_STRINGS = "local_as_strings"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this used? Claude thinks this might cause a silent failure if a user selected it.

)


class LeapDayAdjustmentType(DSGEnum):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This Adjustment class and the next two seem one-directional and/or ambiguous to me. Should we have separate "adjustment types" for with leap day -> without leap day and without leap day - with leap day, and likewise for multiple types of DST conversions (standard timestamps <-> timestamps with DST, representative or indexed timestamps with different DST handling <-> standard timestamps or timestamps with DST)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is related to my comment about whether we want to explicitly support standard timezone types like America/Denver. Claude raised a similar question.


def _get_chronify_dtype(self) -> chronify.TimeDataType:
match self.model.column_format.dtype:
case "TIMESTAMP_NTZ":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having this string be all-caps is unfortunate. All strings in our config files are lowercase. Can we still change this in chronify?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants