Skip to content

Datasets

Utilities for accessing and managing climate datasets in ClimAID.


Load Data

High-level function for loading climate projection datasets.

Retrieve a dataset by name.

If the dataset is not already cached locally, it will be downloaded from the remote source (e.g., HuggingFace or Zenodo).

Parameters

str

Key identifying the dataset (must exist in registry).

Returns

pathlib.Path

Local file path to the dataset.

Raises

ValueError

If dataset_name is not defined in the registry.

Notes

  • Datasets are cached under ~/.climaid/datasets/
  • Subsequent calls return the cached file without re-downloading
Source code in climaid\projections\loader.py
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
def load_cmip6(region="india", columns=None):
    """
    Retrieve a dataset by name.

    If the dataset is not already cached locally, it will be downloaded
    from the remote source (e.g., HuggingFace or Zenodo).

    Parameters
    ----------

    dataset_name : str 
        Key identifying the dataset (must exist in registry).

    Returns
    -------

    Path : pathlib.Path 
        Local file path to the dataset.

    Raises
    ------

    ValueError :
        If dataset_name is not defined in the registry.

    Notes
    -----

    - Datasets are cached under ~/.climaid/datasets/
    - Subsequent calls return the cached file without re-downloading
    """

    mapping = {
        "india": "cmip6_india",
        "south_asia": "cmip6_south_asia"
    }

    if region not in mapping:
        raise ValueError("region must be 'india' or 'south_asia'")

    dataset_name = mapping[region]

    path = _manager.fetch(dataset_name)

    return pd.read_parquet(path, columns=columns)

Dataset Management (Advanced)

Handles dataset downloading, caching, and retrieval.

Manages retrieval and caching of external datasets.

This class ensures that:

  • datasets are downloaded only once
  • files are stored locally for reuse
  • users can work offline after first download
Source code in climaid\datasets\manager.py
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
class DatasetManager:
    """
    Manages retrieval and caching of external datasets.

    This class ensures that:

    - datasets are downloaded only once
    - files are stored locally for reuse
    - users can work offline after first download

    """

    def __init__(self):
        self.base_dir = CACHE_DIR
        self.base_dir.mkdir(parents=True, exist_ok=True)

    def fetch(self, dataset_name: str):

        if dataset_name not in DATASETS:
            raise ValueError(f"Dataset '{dataset_name}' not found")

        meta = DATASETS[dataset_name]

        dataset_dir = self.base_dir / dataset_name / meta["version"]
        dataset_dir.mkdir(parents=True, exist_ok=True)

        file_path = dataset_dir / meta["filename"]

        # already cached
        if file_path.exists():
            return file_path

        print(f"\nDownloading {dataset_name}...")
        print(f"Saving to: {file_path}\n")

        path = pooch.retrieve(
            url=meta["url"],
            fname=meta["filename"],
            path=dataset_dir,
            progressbar=True,
            known_hash=None,
        )

        return Path(path)

Available Datasets (Internal)

Defines dataset metadata such as source URLs and versions.