climate_ref_core.dataset_registry
#
Data registries for non-published reference data
These data are placeholders until these data have been added to obs4MIPs. The CMIP7 Assessment Fas Track REF requires that reference datasets are openly licensed before it is included in any published data catalogs.
DatasetRegistryManager
#
A collection of reference datasets registries
The REF requires additional reference datasets in addition to obs4MIPs data which can be downloaded via ESGF. Each provider may have different sets of reference data that are needed. These are provider-specific datasets are datasets not yet available in obs4MIPs, or are post-processed from obs4MIPs.
A dataset registry consists of a file that contains a list of files and checksums, in combination with a base URL that is used to fetch the files. Pooch is used within the DataRegistry to manage the caching, downloading and validation of the files.
All datasets that are registered here are expected to be openly licensed and freely available.
Source code in packages/climate-ref-core/src/climate_ref_core/dataset_registry.py
200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 | |
__getitem__(item)
#
keys()
#
register(name, base_url, package, resource, cache_name=None, version=None, legacy_cache_dirs=None)
#
Register a new dataset registry
This will create a new Pooch registry and add it to the list of registries. This is typically used by a provider to register a new collections of datasets at runtime.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Name of the registry This is used to identify the registry |
required |
base_url
|
str
|
Commmon URL prefix for the files |
required |
package
|
str
|
Name of the package containing the registry resource. |
required |
resource
|
str
|
Name of the resource in the package that contains a list of files and checksums. This must be formatted in a way that is expected by pooch. |
required |
version
|
str | None
|
The version of the data. Changing the version will invalidate the cache and force a re-download of the data. |
None
|
cache_name
|
str | None
|
Name to use to generate the cache directory. This defaults to the value of |
None
|
legacy_cache_dirs
|
list[Path] | None
|
Previous cache directories to migrate files from. If provided, any files that exist in a legacy directory but not in the current cache will be moved to the new location. This avoids re-downloading data after a cache layout change. |
None
|
Source code in packages/climate-ref-core/src/climate_ref_core/dataset_registry.py
fetch_all_files(registry, name, output_dir, symlink=False, verify=True)
#
Fetch all files associated with a pooch registry and write them to an output directory.
Pooch fetches, caches and validates the downloaded files. Subsequent calls to this function will not refetch any previously downloaded files.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
registry
|
Pooch
|
Pooch directory containing a set of files that should be fetched. |
required |
name
|
str
|
Name of the registry. |
required |
output_dir
|
Path | None
|
The root directory to write the files to. The directory will be created if it doesn't exist, and matching files will be overwritten. If no directory is provided, the files will be fetched from the remote server, but not copied anywhere. |
required |
symlink
|
bool
|
If True, symlink all files to this directory. Otherwise, perform a copy. |
False
|
verify
|
bool
|
If True, verify the checksums of the local files against the registry. |
True
|
Source code in packages/climate-ref-core/src/climate_ref_core/dataset_registry.py
resolve_cache_dir(cache_name)
#
Resolve the cache directory for a registry.
If the REF_DATASET_CACHE_DIR environment variable is set, use that as the root.
Otherwise, fall back to the OS cache under climate_ref.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cache_name
|
str
|
Subdirectory name within the cache root. |
required |
Returns:
| Type | Description |
|---|---|
The resolved cache directory path.
|
|
Source code in packages/climate-ref-core/src/climate_ref_core/dataset_registry.py
validate_registry_cache(registry, name)
#
Validate that all files in a registry are cached and have correct checksums.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
registry
|
Pooch
|
Pooch registry to validate. |
required |
name
|
str
|
Name of the registry (for error messages). |
required |
Returns:
| Type | Description |
|---|---|
list[str]
|
List of error messages for any validation failures. Empty list if all files are valid. |