climate_ref.datasets.utils
#
Shared utility functions for dataset adapters
build_instance_id(datasets, drs_items, prefix, transform=None, *, copy=True)
#
Add an instance_id column built from DRS components.
Rows where any required DRS component is None/NA are dropped with a warning so a single malformed file does not abort the whole ingestion batch.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
datasets
|
DataFrame
|
Data catalog with one row per file. |
required |
drs_items
|
list[str]
|
Column names that make up the instance id, in order. |
required |
prefix
|
str
|
Prefix to use for the instance id (e.g. |
required |
transform
|
Callable[[str, Any], str] | None
|
Optional per-column value transform; defaults to |
None
|
copy
|
bool
|
If |
True
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
Catalog with the |
Source code in packages/climate-ref/src/climate_ref/datasets/utils.py
clean_branch_time(branch_time)
#
Clean branch time values, handling missing values and EC-Earth3 suffixes.
This handles the EC-Earth3 encoding where branch_time_in_child and
branch_time_in_parent have a trailing 'D' suffix (e.g. "123D").
We strip the 'D' and coerce the remaining value to a float,
treating any missing or malformed entries as NaN.
Source code in packages/climate-ref/src/climate_ref/datasets/utils.py
extract_version_from_path(parent)
#
Extract the dataset version from a directory path.
Splits the path into individual directory segments and matches version patterns (vYYYYMMDD or vN) against standalone segments only. When multiple segments match, the longest (most specific) match wins. Falls back to "v0" if no segment matches.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
parent
|
str
|
Parent directory path |
required |
Returns:
| Type | Description |
|---|---|
str
|
Version string (e.g., "v20250622", "v1", or "v0" as fallback) |
Source code in packages/climate-ref/src/climate_ref/datasets/utils.py
parse_cftime_dates(dt_str, calendar='standard')
#
Parse date strings to cftime.datetime objects
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dt_str
|
Series[str]
|
Series of date strings in "YYYY-MM-DD" or "YYYY-MM-DD HH:MM:SS" format |
required |
calendar
|
Series[str] | str
|
Calendar name(s). Either a single string applied to all rows, or a Series with per-row calendar values. |
'standard'
|
Source code in packages/climate-ref/src/climate_ref/datasets/utils.py
parse_drs_daterange(date_range)
#
Parse a DRS date range string into start and end dates.
The output from this is an estimated date range until the file is completely parsed.
Supports date formats used in CMIP6 and CMIP7 filenames:
- YYYY-YYYY (4 chars, yearly)
- YYYYMM-YYYYMM (6 chars, monthly)
- YYYYMMDD-YYYYMMDD (8 chars, daily)
- YYYYMMDDhhmm-YYYYMMDDhhmm (12 chars, sub-daily)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
date_range
|
str
|
Date range string |
required |
Returns:
| Type | Description |
|---|---|
tuple[str | None, str | None]
|
Tuple containing start and end dates as strings in the format "YYYY-MM-DD" |
Source code in packages/climate-ref/src/climate_ref/datasets/utils.py
sort_data_catalog(catalog)
#
Sort a dataset catalog DataFrame by instance_id and start_time (with NA values last).
This provides a stable ordering for testing and debugging.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
catalog
|
DataFrame
|
Dataset catalog DataFrame with at least "instance_id" and "start_time" columns |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
Sorted DataFrame |
Source code in packages/climate-ref/src/climate_ref/datasets/utils.py
validate_path(raw_path)
#
Validate the prefix of a dataset against the data directory