Utilities
Utility functions for the ClimAID package.
This module provides a collection of helper utilities for handling climate and epidemiological datasets used throughout ClimAID.
It includes functionality for: - Data ingestion and validation - Cleaning and preprocessing of datasets - Temporal train/test splitting for modeling workflows - Basic normalization and summary statistics
These utilities are designed to support reproducible and consistent data preparation across the ClimAID pipeline.
Notes
- Functions in this module are independent of modeling and reporting layers.
- Intended for internal use, but can be used externally for custom workflows.
Author
Avik Sam
Created
November 2025
build_district_tree(district_list)
Convert district keys like: IND_pune_maharashtra
into hierarchical structure: country → state → district
Source code in climaid\utils.py
233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 | |
check_data_consistency(df, key_cols=['District', 'time'])
Verify that key identifiers (e.g., District, time) are unique and complete.
Source code in climaid\utils.py
122 123 124 125 126 127 128 129 130 131 132 133 | |
clean_numeric_column(series)
Clean messy numeric data like '7.71-09' or '295.2005/092' and convert to float.
Source code in climaid\utils.py
67 68 69 70 71 72 73 74 | |
ensure_directory(path)
Create directory if it does not exist.
Source code in climaid\utils.py
55 56 57 58 59 60 | |
load_csv_safe(filepath, parse_dates=['time'])
Safely load a CSV file and parse datetime columns if present.
Source code in climaid\utils.py
40 41 42 43 44 45 46 47 48 49 50 51 52 | |
normalize_features(df, cols)
Normalize selected numeric columns to [0, 1] range using MinMaxScaler.
Source code in climaid\utils.py
77 78 79 80 81 82 83 | |
pretty_country(code)
Convert ISO3 country code to human readable name.
Example
pretty_country("IND") 'India'
Source code in climaid\utils.py
209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 | |
print_summary(df, label='Data')
Print summary statistics and missing value info.
Source code in climaid\utils.py
108 109 110 111 112 113 114 115 116 117 | |
split_train_test(df, date_col='time', cutoff_year=2020)
Split a dataset into training and testing subsets by year.
- Training: all data before cutoff_year
- Testing: all data after cutoff_year
Source code in climaid\utils.py
90 91 92 93 94 95 96 97 98 99 100 101 | |