IHIP Data Fetcher and Parser¶
Overview¶
The IHIP Data Fetcher (fetch_and_parse_ihip_data.py) is a standalone script that automates the complete pipeline for fetching, processing, and aggregating IHIP (Integrated Health Information Platform) dengue surveillance data from AWS S3. It transforms raw linelist data into analysis-ready datasets for epidemiological modeling.
Location: src/acestor/fetch_and_parse_ihip_data.py
Pipeline Workflow¶
The script executes four sequential steps:
S3 Download: Fetches all IHIP linelist files from configured S3 bucket
Format Conversion: Converts XLSX files to CSV format for standardized processing
Data Merging: Consolidates multiple CSV files into a single unified linelist
Aggregation: Creates district-level daily positive case counts with standardized region identifiers
Data Source Configuration¶
The script retrieves data from AWS S3 using the following configuration:
Parameter |
Details |
|---|---|
S3 Location |
Defined in |
File Format |
XLSX spreadsheets containing dengue test records |
Authentication |
AWS credentials from |
Bucket |
Parsed from S3 URI (e.g., |
Data Transformations¶
Column Mapping¶
The script performs the following column transformations:
Original Column |
Final Column |
Action |
|---|---|---|
|
|
Renamed |
|
|
Renamed |
|
|
Renamed |
|
|
Renamed |
|
(merged into |
Priority 1 |
|
(merged into |
Priority 2 |
|
(merged into |
Priority 3 |
Date Consolidation Logic¶
A single date column is created using priority-based selection:
First priority: Use
Date Of Onsetif availableSecond priority: Use
Sample Collected Dateif onset date is missingThird priority: Use
Test Performed Dateif both previous dates are missing
All dates are standardized to YYYY-MM-DD format using pandas.to_datetime().
Note
Unparseable dates are set to null and logged as warnings.
District Name Corrections¶
The script corrects known spelling discrepancies between IHIP data and the region ID database:
IHIP Spelling |
Corrected Name (in regionids.csv) |
|---|---|
Uttar Kannad |
UTTARA KANNADA |
Dakshin Kannad |
DAKSHINA KANNADA |
Bagalkot |
BAGALKOTE |
Chikballapur |
CHIKKABALLAPURA |
These corrections ensure accurate matching with standardized region identifiers.
Region ID Mapping¶
The script merges with data/regionids.csv to add standardized identifiers:
district.ID: Standard district identifier (e.g.,
district_550)state.ID: Standard state identifier (e.g.,
state_29)
The mapping process:
Normalizes district and state names (uppercase, strip whitespace)
Applies district name corrections
Performs left join with region ID database
Logs warnings for unmatched districts
Case Aggregation¶
The aggregation logic creates daily district-level summaries:
Grouping: By
dateanddistrict.nameFiltering: Only records with
test_result = "POSITIVE"(case-insensitive)Counting: Number of positive test results per group
Preserving: District name, state name, district ID, and state ID
Output Files¶
The script generates three output artifacts:
1. Raw Downloads (data/ihip/linelist/)¶
Individual files downloaded from S3:
Original XLSX files
Converted CSV files
Intermediate processing artifacts
2. Consolidated Linelist (data/ihip/linelist.csv)¶
Merged dataset with all individual test records.
Column |
Description |
|---|---|
|
Test/onset date (YYYY-MM-DD) |
|
Sub-district name |
|
District name (corrected) |
|
State name |
|
Test result (e.g., POSITIVE, NEGATIVE) |
3. Aggregated Cases (datasets/cases_district_daily.csv)¶
Daily positive case counts by district with region identifiers.
Column |
Description |
|---|---|
|
Date of cases (YYYY-MM-DD) |
|
Standard district identifier (e.g., district_550) |
|
District name (uppercase, corrected) |
|
Standard state identifier (e.g., state_29) |
|
State name |
|
Number of positive test results for this district on this date |
Data Characteristics:
One row per district per date (only dates with positive cases)
Sorted by date and district name
All district names in uppercase
Missing region IDs logged as warnings
Usage¶
Basic Command¶
Run with default output location:
python src/acestor/fetch_and_parse_ihip_data.py
This outputs the aggregated cases file to datasets/cases_district_daily.csv.
Custom Output File¶
Specify a custom output location:
python src/acestor/fetch_and_parse_ihip_data.py -o path/to/custom_output.csv
Or using the long form:
python src/acestor/fetch_and_parse_ihip_data.py --output-file path/to/custom_output.csv
Help¶
Display command-line options:
python src/acestor/fetch_and_parse_ihip_data.py -h
Command-Line Arguments¶
Flag |
Type |
Description |
|---|---|---|
|
string |
Output file path for aggregated cases (default: |
Configuration Requirements¶
Required Files¶
The script requires the following files to be present:
config/dengue_pipeline.yaml
Must contain the
ihip_s3_locationparameter:ihip_s3_location: s3://bucket-name/prefix/path/
.env
Must contain AWS credentials:
AWS_ACCESS_KEY=your_access_key AWS_SECRET_KEY=your_secret_key
data/regionids.csv
Region ID mapping database with columns:
regionID: Identifier (e.g., district_550, state_29)regionName: Human-readable nameparentID: Parent region identifier
Logging and Monitoring¶
Console Output¶
The script provides real-time progress updates including:
Step-by-step progress through the pipeline
File download progress bars (via
tqdm)Row counts and column information
Summary statistics (total cases, date ranges, unique districts)
Log Files¶
Detailed logs are written to:
logs/fetch_ihip_data_YYYYMMDD-HHMMSS.log
Log entries include:
Timestamps for all operations
File processing details
Data quality warnings
Error messages with stack traces
Warnings¶
The script logs warnings for:
Missing columns: When expected columns are not found in source files
Unmatched districts: Districts that couldn’t be mapped to region IDs
Unparseable dates: Date values that couldn’t be converted to YYYY-MM-DD format
Empty results: No positive test results found in the data
Dependencies¶
Package |
Purpose |
|---|---|
|
AWS S3 client for file downloads |
|
Data manipulation, CSV/XLSX handling |
|
Configuration file parsing |
|
Environment variable management |
|
Progress bar display |
|
XLSX file reading (pandas dependency) |
Error Handling¶
Common Errors¶
AWS Authentication Failure
ValueError: AWS credentials not found in environment variables
Solution: Ensure .env file contains valid AWS_ACCESS_KEY and AWS_SECRET_KEY.
Missing Configuration
ValueError: ihip_s3_location not found in dengue_pipeline.yaml
Solution: Add ihip_s3_location parameter to configuration file.
No Files in S3
WARNING: No files found in s3://bucket/prefix/
Solution: Verify S3 path is correct and bucket contains IHIP data files.
Region ID Mismatches
WARNING: Could not match N districts to region IDs: ['DISTRICT1', 'DISTRICT2', ...]
Solution: Add missing district name corrections to the district_name_fixes dictionary in the script.