Input Data Specification¶
The Acestor pipeline requires three types of input data:
Cases Data: Daily dengue case counts at the specified granularity
Configuration File: Pipeline configuration settings
GeoJSON Files: Geographic boundary data for regions
1. Cases Data Specification¶
The primary input to the Acestor pipeline is the daily dengue case data at the granularity level specified in the configuration file.
File Location¶
datasets/cases_{granularity}_daily.csv
Where {granularity} is specified in the dengue_pipeline.yaml configuration file (e.g., district, subdistrict, etc.)
File Format¶
The file is a comma-separated values (CSV) file with the following structure:
Column Specifications¶
Column Name |
Data Type |
Description |
|---|---|---|
|
String |
LGD Standard identifier for the region in the format {region_type}_{lgd_code} (e.g., “district_524”, “subdistrict_2345”) |
|
String |
Date of the observation in YYYY-MM-DD format (e.g., “2021-12-11”) |
|
Float |
Number of confirmed dengue cases on that date in the region |
Data Characteristics¶
Temporal Coverage: Model performs better with more amount of data - Threshold calculation logic is of 2 types. One using historical data and other using past n weeks data. So data must either be recent upto current day, or there must be data of previous years for the model to work accurately.
Spatial Granularity: Configurable (district, subdistrict, etc.) as specified in
dengue_pipeline.yamlGeographic Scope: Indian states and their administrative divisions
Primary Metrics:
Number of confirmed dengue cases per day per region
Converting from Linelist Data¶
If your starting point is linelist (individual case records) rather than aggregated daily case counts, you can add a preprocessing step to convert it to the required daily cases format.
The Acestor module includes a linelist-to-cases conversion function for Karnataka linelist data (parse_linelist_to_no_of_cases in ParseCaseData.py) that can be used as a reference implementation. This function:
Takes raw linelist data with individual case records
Filters and aggregates by date and region
Outputs the required
cases_{granularity}_daily.csvformat
See ParseCaseData.py:parse_linelist_to_no_of_cases for the Karnataka implementation example.
Example Records¶
region_id,date,case
district_524,2021-12-11,0.0
district_524,2021-12-12,0.0
district_524,2021-12-13,1.0
district_525,2021-12-11,2.0
district_525,2021-12-12,0.0
Data Quality Requirements¶
The pipeline performs strict data quality checks before processing. Your data must meet the following requirements:
Data Completeness
Case Values: Case counts are non-negative floats
Date Format: All dates follow the YYYY-MM-DD (ISO 8601) format
Identifiers: region_id values must match the corresponding GeoJSON filenames and are based on LGD Standard
Continuity Requirements
The pipeline requires continuous weekly data to function properly:
Weekly Coverage: Each ISO week must contain at least 4 days of data
No Gaps: Data must be continuous with no missing weeks between the start and end dates
Cross-Region Consistency: All regions must have overlapping continuous data periods
Minimum Data Requirements
The amount of data determines what operations the pipeline can perform:
Data Duration |
Pipeline Capability |
|---|---|
< 4 months |
❌ Insufficient - Pipeline will exit with an error |
4-12 months |
⚠ Can generate alert thresholds only (no predictions) |
≥ 12 months |
✓ Full functionality - Can generate thresholds AND run predictions |
Data Validation Process
When you run the pipeline, it automatically:
Checks each region for continuous weekly data (≥4 days per ISO week)
Identifies the longest continuous data range for each region
Finds the common overlapping period across all regions
Calculates the data span in months
Determines if there’s sufficient data to proceed
If validation fails, the pipeline will log detailed error messages indicating:
Which regions lack continuous data
The actual data span available
What minimum data is required
Example Error Messages
ERROR: Insufficient data: Only 3.2 months of continuous data available
ERROR: Required: At least 4 months to generate thresholds, 12 months to run predictions
WARNING: Only 8.5 months of data available
WARNING: Can generate thresholds but cannot run predictions (need >= 12 months)
2. Configuration File Specification¶
The pipeline is configured using a YAML configuration file located at config/dengue_pipeline.yaml.
File Location¶
config/dengue_pipeline.yaml
Configuration Parameters¶
Parameter |
Data Type |
Description |
|---|---|---|
|
String (Path) |
Root directory path for the project. All relative paths are resolved from this location. |
|
Boolean |
Set to |
|
String |
Name of the region for weather data downloads (e.g., “Karnataka”, “Maharashtra”). |
|
String (Path) |
Path to the directory containing weather data files (absolute or relative to root_dir). |
|
String (Path) |
Path to the folder containing GeoJSON files for region boundaries (absolute or relative to root_dir). |
|
String (Path) |
Path to the raw linelist data directory (if using linelist data as input). |
|
String |
Spatial granularity level for analysis. Common values: |
|
String (S3 URI) |
S3 bucket location for IHIP data fetching (optional, used only if fetching data from IHIP). |
|
Boolean |
Set to |
|
List of Strings |
List of email addresses to receive reports, in format “Name <email@example.com>”. |
Example Configuration¶
# main pipeline
root_dir: /Users/akhil/workspace/acestor
debug: true # Set to true to process only last 3 years of data for testing
region_name: Karnataka # Name of the region for weather data downloads
weather_data_path: /Users/akhil/workspace/acestor/weather_data_from_s3
geojson_folder: /Users/akhil/workspace/acestor/geojsons # Path to geojson folder
raw_linelist_path: /Users/akhil/workspace/acestor/datasets/raw_linelist_data/KA_linelist
granularity: district
# fetch ihip data
ihip_s3_location: s3://dsih-artpark-01-raw-data/EPRDS34-KA_IHIP_Dengue_LL/KA/
# email configuration
enable_email_reports: true
email_recipients:
- "User Name <user@example.com>"
Configuration Notes¶
The
granularityparameter is crucial as it determines:The expected input file name:
datasets/cases_{granularity}_daily.csvThe spatial level at which the model runs
Which GeoJSON files are required
When
debugis enabled, only the most recent 3 years of data is processed, significantly reducing processing time for testing.All path parameters support both absolute and relative paths (relative to
root_dir).
Email Configuration¶
The pipeline can automatically send email reports with pipeline status, attached results, and maps after completion.
Environment Variables
Email functionality requires SMTP credentials to be configured in a .env file in the project root:
# SMTP Configuration
SMTP_SERVER=smtp.gmail.com
PORT=587
EMAIL=your.email@gmail.com
PASSWORD=your_app_specific_password
For Gmail Users:
Enable 2-factor authentication on your Google account
Generate an App-Specific Password at https://myaccount.google.com/apppasswords
Use the app-specific password (not your regular Gmail password)
Email Report Contents:
When enabled, the pipeline sends an HTML email report containing:
Pipeline status for each major step (data processing, thresholds, predictions, maps)
Color-coded status indicators (green for success, red for errors, orange for warnings)
Attached files:
thresholds_df.csv- Generated alert thresholdsPredictions_*.csv- Model prediction resultsMap images (PNG/PDF) if
-gmflag was used
Pipeline execution time and completion summary
Error Handling:
If email sending fails, the pipeline logs the error but continues execution
The pipeline will complete successfully even if email delivery fails
Check the log files for email-related error messages
3. GeoJSON File Specification¶
The pipeline requires GeoJSON files for each region being analyzed. These files define the geographic boundaries used for spatial analysis and visualization.
Folder Structure¶
GeoJSON files should be organized in the following directory structure:
{geojson_folder}/
└── {state_name}/
├── districts/
│ ├── district_{lgd_code}.geojson
│ └── district_{lgd_code}.geojson
└── subdistricts/
├── subdistrict_{lgd_code}.geojson
└── subdistrict_{lgd_code}.geojson
Where:
- {geojson_folder} is the base path specified in dengue_pipeline.yaml
- {state_name} is the name of the state (e.g., “Karnataka”, “Maharashtra”)
- Files are organized in districts/ or subdistricts/ subdirectories based on granularity
File Location¶
{geojson_folder}/{state_name}/{granularity_plural}/{region_type}_{lgd_code}.geojson
Example paths:
geojsons/Karnataka/districts/district_524.geojson
geojsons/Karnataka/subdistricts/subdistrict_2345.geojson
geojsons/Maharashtra/districts/district_360.geojson
Naming Convention¶
GeoJSON files must follow the LGD Standard naming format:
District:
district_529.geojson(where 529 is the LGD district code)Subdistrict:
subdistrict_2345.geojson(where 2345 is the LGD subdistrict code)State:
state_29.geojson(where 29 is the LGD state code)
This naming must match the identifiers used in the region_id column of the cases data file.
File Format¶
GeoJSON files must follow the standard GeoJSON specification (RFC 7946) with the following structure:
{
"id": "district_123",
"type": "Feature",
"properties": {
"regionName": "DISTRICT NAME",
"Shape_Leng": 123.456,
"Shape_Area": 789.012,
"regionType": "district",
"parentID": "state_12",
"parentName": "STATE NAME"
},
"geometry": {
"type": "Polygon",
"coordinates": [
[
[75.123, 15.456],
[75.234, 15.567],
[75.345, 15.678],
[75.123, 15.456]
]
]
}
}
Requirements¶
Feature ID: Each feature must have a top-level
idfield matching the filename pattern (e.g., “district_524”)Coordinate System: WGS84 (EPSG:4326) - longitude/latitude coordinates
Geometry Types: Typically
PolygonorMultiPolygonfor administrative boundariesProperties: The following properties are strictly required for each feature:
regionName: Name of the region (e.g., “BAGALKOTE”, “KARNATAKA”)regionType: Type of administrative division (e.g., “district”, “subdistrict”, “state”)parentID: ID of the parent administrative unit (e.g., “state_29”)parentName: Name of the parent administrative unit (e.g., “KARNATAKA”)Shape_Leng: Perimeter/length of the shape (float)Shape_Area: Area of the shape (float)
File Organization: All GeoJSON files must be in the folder structure specified by
geojson_folderin the configuration
GeoJSON Quality Notes¶
ID Matching: The
idfield in the GeoJSON must exactly match:The filename (e.g., file
district_524.geojsonmust have"id": "district_524")The corresponding
region_idvalues in the cases data CSV
Geometry Accuracy: Ensure boundary geometries are accurate and up-to-date
Coordinate Order: Coordinate pairs must be in [longitude, latitude] order (per GeoJSON specification)
Performance: Complex polygons may need simplification for better performance
Validation: Files must be valid GeoJSON (can be validated using tools like geojsonlint.com)
Naming Convention: Region names in
regionNameshould follow LGD Standard naming conventions (typically uppercase)