Config Reference#
Complete reference for every key in the pipeline YAML config. All keys are optional unless marked required.
pipeline#
pipeline:
name: dengue # used in logger names and artifact paths
title: "Dengue Intelligence — GBA" # used in report document title
Key |
Type |
Default |
Notes |
|---|---|---|---|
|
string |
|
Short identifier, no spaces |
|
string |
|
Appears in PDF report header |
run#
run:
run_date: "2026-03-18" # leave blank to use today
Key |
Type |
Default |
Notes |
|---|---|---|---|
|
YYYY-MM-DD |
today |
Reference date for cutoff and prediction window |
data.case_download#
Controls where raw case CSV files are read from.
data:
case_download:
enabled: true
source_backend: "filesystem" # or "s3"
source_path: "datasets/raw_linelist_data/AP_IHIP"
Key |
Type |
Default |
Notes |
|---|---|---|---|
|
bool |
|
Set |
|
|
|
|
|
string |
|
Filesystem dir or |
|
list[string] |
|
Explicit file list; if empty, all files under |
|
bool |
|
Cache S3 downloads locally |
|
string |
|
Local cache directory |
|
string |
|
|
data.case_parse#
Required section.
data:
case_parse:
region_types:
- "district"
date_start: "2021-09-01"
date_end: ""
Key |
Type |
Default |
Notes |
|---|---|---|---|
|
required list[string] |
— |
e.g. |
|
required YYYY-MM-DD |
— |
Earliest case date to include |
|
YYYY-MM-DD |
run_date |
Latest case date; empty → run_date |
Accepted region types: corp, zone, ward, district, subdistrict
data.case_sufficiency#
Early gate that aborts the run if case data is too sparse.
data:
case_sufficiency:
enabled: true
min_total_rows: 30
min_distinct_regions: 2
min_date_span_days: 14
Key |
Type |
Default |
Notes |
|---|---|---|---|
|
bool |
|
|
|
int |
|
Minimum case rows across all regions |
|
int |
|
Minimum distinct region IDs |
|
int |
|
Minimum date range in days |
data.geojson#
data:
geojson:
base_path: "datasets/geojsons/geojsons_GBA"
Key |
Type |
Default |
Notes |
|---|---|---|---|
|
string |
|
Root folder; files resolved as |
data.weather_download#
data:
weather_download:
enabled: true
source_backend: "filesystem"
source_path: "ap_datasets/parsednetcdf/district"
region_type: "district"
Key |
Type |
Default |
Notes |
|---|---|---|---|
|
bool |
|
Must be |
|
|
|
|
|
string |
|
Directory of pre-parsed CSVs, or |
|
|
|
|
|
string |
|
Path to local |
|
string |
|
Where parsed per-region CSVs are written (NetCDF/CDS mode) |
|
string |
|
Subfolder used when looking up GeoJSON boundaries |
|
list[string] |
|
NetCDF variable short names |
|
float |
|
Max distance from region boundary for ERA5 grid-point filtering |
|
float |
|
Grid snap resolution for auto-computed region bounds |
|
YYYY-MM-DD |
|
CDS download start |
|
YYYY-MM-DD |
run_date |
CDS download end |
data.weather_parse#
data:
weather_parse:
region_type: "district"
weather_variables:
- "2mTemperature"
- "totalPrecipitation"
- "2mDewpointTemperature"
daily_agg:
- {name: "2mTemperature", op: "max", output_name: "2mTemperature_max"}
- {name: "2mTemperature", op: "min", output_name: "2mTemperature_min"}
- {name: "2mTemperature", op: "mean"}
- {name: "2mDewpointTemperature", op: "mean"}
- {name: "totalPrecipitation", op: "sum"}
rolling_n_days: 7
rolling_agg:
- {name: "2mTemperature", op: "mean"}
- {name: "2mDewpointTemperature", op: "mean"}
- {name: "totalPrecipitation", op: "sum"}
sampling_rate: 7
intermediate_col_rename:
"2mTemperature_max": "t2m_max"
"2mTemperature_min": "t2m_min"
"2mTemperature": "t2m_mean"
"2mDewpointTemperature": "d2m_mean"
"totalPrecipitation": "tp_sum"
Key |
Type |
Default |
Notes |
|---|---|---|---|
|
string |
|
Must match |
|
list[string] |
|
Variables to read from parsed CSVs |
|
list[{name, op, output_name?}] |
mean/sum defaults |
Aggregations applied per day. |
|
int |
|
Rolling window size in days |
|
list[{name, op}] |
mean/sum defaults |
Aggregations applied over rolling window |
|
int |
|
Sample every N days (7 = weekly) |
|
dict |
see above |
Renames columns in intermediate files |
|
bool |
|
Write |
|
bool |
|
Write |
cutoff#
cutoff:
case_min_regions: 2
weather_min_regions: 5
Key |
Type |
Default |
Notes |
|---|---|---|---|
|
int |
|
Minimum regions with recent case data to determine cutoff |
|
int |
|
Minimum regions with recent weather data |
thresholds#
thresholds:
region_type: "district"
n_weeks: 4
historical_n_years: 4
excluded_years: []
included_years: []
Key |
Type |
Default |
Notes |
|---|---|---|---|
|
string |
|
Must match |
|
int |
|
Number of future weeks to predict |
|
int | null |
|
Limit historical data to last N years |
|
list[int] |
|
Years to skip (e.g. COVID anomaly) |
|
list[int] |
|
If non-empty, only use these years |
model#
model:
spatial_res: "district"
data_features:
- "case"
- "recordDate"
- "recordYear"
- "recordMonth"
- "ISOWeek"
- "t2m_mean"
- "tp_sum"
- "d2m_mean"
lag:
lag_temp: [12]
lag_rf: [4]
years_to_exclude: []
years_to_include: []
list_alpha: [1.0, 2.0]
Key |
Type |
Default |
Notes |
|---|---|---|---|
|
string |
|
Must match |
|
list[string] |
temperature/rain/case defaults |
Feature columns fed to the model |
|
list[int] |
|
Temperature lag in weeks |
|
list[int] |
|
Rainfall lag in weeks |
|
list[int] |
|
Exclude from model training |
|
list[int] |
|
If non-empty, train only on these years |
|
list[float] |
|
Ridge regression alpha values to try |
assess#
assess:
total_district_regions: 26
Set total_{region_type}_regions for each region type present in your data.
Key |
Type |
Default |
Notes |
|---|---|---|---|
|
int |
— |
Total number of corp regions in the geography |
|
int |
— |
Total number of zone regions |
|
int |
— |
Total number of ward regions |
|
int |
— |
Total number of districts |
|
int |
— |
Total number of subdistricts |
Used to compute coverage ratios for threshold method assessment. If omitted for a region type, defaults to 10.
maps#
maps:
output_dir: "plots"
figure_title: "Andhra Pradesh Dengue Risk Map"
Key |
Type |
Default |
Notes |
|---|---|---|---|
|
string |
|
Relative path under artifacts for map PNGs |
|
string |
|
Map title (first line); prediction date is added as second line |
report#
report:
output_dir: "reports"
compile_pdf: true
caption_primary: "AP districts"
caption_secondary: ""
bundle_prefix: "Report"
document_title: ""
Key |
Type |
Default |
Notes |
|---|---|---|---|
|
string |
|
Relative path under artifacts for report files |
|
bool |
|
Compile LaTeX bundle to PDF (requires |
|
string |
|
Region label for primary region type in captions |
|
string |
|
Region label for secondary region type |
|
string |
|
Prefix for the output zip filename |
|
string |
derived from |
PDF document title; auto-set if blank |
report_distribution#
Metadata embedded in the report and used in email notifications.
report_distribution:
system_name: "Dengue Early Warning System"
organization: "ARTPARK, IISc Bengaluru"
state: "Andhra Pradesh"
region: "Andhra Pradesh (26 Districts)"
department: "Directorate of Public Health, GoAP"
contact_email: ""
footer_note: ""
email#
email:
enabled: false
on: ["success", "failed"]
to: ["recipient@example.com"]
Key |
Type |
Default |
Notes |
|---|---|---|---|
|
bool |
|
|
|
list[string] |
|
When to send: |
|
list[string] |
|
Recipient addresses |
SMTP credentials are set via environment variables — see .env.example.
logging#
logging:
level: INFO
Key |
Type |
Default |
Notes |
|---|---|---|---|
|
string |
— |
|
storages#
storages:
artifacts:
kind: filesystem
filesystem:
base_path: "./artifacts"
Filesystem:
Key |
Type |
Notes |
|---|---|---|
|
|
|
|
string |
Root directory for all run artifacts |
S3:
storages:
artifacts:
kind: s3
s3:
bucket: your-bucket
base_prefix: artifacts/
region: ap-south-1
Key |
Type |
Notes |
|---|---|---|
|
|
|
|
string |
S3 bucket name |
|
string |
Key prefix (folder) within the bucket |
|
string |
AWS region |