climate_ref.datasets.catalog_builder
#
Catalog builder for discovering and parsing dataset files into a DataFrame
build_catalog(paths, parsing_func, include_patterns=None, depth=0, n_jobs=1)
#
Build a catalog DataFrame by discovering and parsing dataset files
Orchestrates file discovery, parallel parsing, DataFrame construction, and INVALID_ASSET row filtering.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
paths
|
list[str]
|
Root directories to search for files |
required |
parsing_func
|
DatasetParsingFunction
|
Function that parses each file and returns a metadata dictionary.
Must return a dict with an |
required |
include_patterns
|
list[str] | None
|
Glob patterns to include (e.g. |
None
|
depth
|
int
|
Maximum directory depth to search |
0
|
n_jobs
|
int
|
Number of parallel workers for parsing.
|
1
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame containing parsed metadata for all valid files |
Raises:
| Type | Description |
|---|---|
ValueError
|
If no files matching the include patterns are found in the specified paths |
Source code in packages/climate-ref/src/climate_ref/datasets/catalog_builder.py
discover_files(paths, include_patterns=None, depth=0)
#
Discover files matching the given glob patterns within the specified paths
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
paths
|
list[str]
|
Root directories (or single files) to search |
required |
include_patterns
|
list[str] | None
|
Glob patterns to include (e.g. |
None
|
depth
|
int
|
Maximum directory depth below each root to search.
|
0
|
Returns:
| Type | Description |
|---|---|
list[str]
|
Sorted, deduplicated list of matching file paths |
Source code in packages/climate-ref/src/climate_ref/datasets/catalog_builder.py
iter_built_catalogs(paths, parsing_func, include_patterns=None, depth=0, n_jobs=1, chunk_size=10000)
#
Yield catalog DataFrames in chunks, parsing files chunk by chunk.
Peak memory is bounded by chunk_size files because each chunk's
parsed entries and DataFrame are released before the next chunk starts parsing.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
paths
|
list[str]
|
Root directories to search for files. |
required |
parsing_func
|
DatasetParsingFunction
|
Function that parses each file and returns a metadata dictionary.
Must return a dict with an |
required |
include_patterns
|
list[str] | None
|
Glob patterns to include (e.g. |
None
|
depth
|
int
|
Maximum directory depth to search. |
0
|
n_jobs
|
int
|
Number of parallel workers per chunk for parsing. |
1
|
chunk_size
|
int
|
Soft target for the number of files per chunk. |
10000
|
Yields:
| Type | Description |
|---|---|
DataFrame
|
DataFrames with parsed metadata for each chunk. Empty chunks (all invalid) are skipped. |
Source code in packages/climate-ref/src/climate_ref/datasets/catalog_builder.py
iter_discovered_chunks(paths, include_patterns=None, depth=0, chunk_size=10000)
#
Yield batches of discovered file paths in chunks.
Walks the directory tree once and yields batches of up to chunk_size paths.
Batches only flush at directory boundaries, so files within a single directory are kept together.
This assumption holds for the current directory structures used by CMIP6 and CMIP7.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
paths
|
list[str]
|
Root directories (or single files) to search. |
required |
include_patterns
|
list[str] | None
|
Glob patterns to include (e.g. |
None
|
depth
|
int
|
Maximum directory depth below each root to search. |
0
|
chunk_size
|
int
|
Soft target for the number of files per batch. A batch may exceed this if a single directory contains more matching files. |
10000
|
Yields:
| Type | Description |
|---|---|
list[str]
|
Lists of file paths. Each list is sorted and deduplicated. |
Source code in packages/climate-ref/src/climate_ref/datasets/catalog_builder.py
parse_files(assets, parsing_func, n_jobs=1)
#
Parse files using the given parsing function, optionally in parallel
Parsing is I/O-bound (opening netCDF files), so threads are used rather than processes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
assets
|
list[str]
|
List of file paths to parse |
required |
parsing_func
|
DatasetParsingFunction
|
Function to extract metadata from each file |
required |
n_jobs
|
int
|
Number of parallel workers.
|
1
|
Returns:
| Type | Description |
|---|---|
list[dict[str, Any]]
|
List of parsed metadata dictionaries, in the same order as |