IHIP Data Fetcher and Parser

Overview

The IHIP Data Fetcher (fetch_and_parse_ihip_data.py) is a standalone script that automates the complete pipeline for fetching, processing, and aggregating IHIP (Integrated Health Information Platform) dengue surveillance data from AWS S3. It transforms raw linelist data into analysis-ready datasets for epidemiological modeling.

Location: src/acestor/fetch_and_parse_ihip_data.py

Pipeline Workflow

The script executes four sequential steps:

  1. S3 Download: Fetches all IHIP linelist files from configured S3 bucket

  2. Format Conversion: Converts XLSX files to CSV format for standardized processing

  3. Data Merging: Consolidates multiple CSV files into a single unified linelist

  4. Aggregation: Creates district-level daily positive case counts with standardized region identifiers

Data Source Configuration

The script retrieves data from AWS S3 using the following configuration:

Data Source Parameters

Parameter

Details

S3 Location

Defined in config/dengue_pipeline.yaml (ihip_s3_location)

File Format

XLSX spreadsheets containing dengue test records

Authentication

AWS credentials from .env file (AWS_ACCESS_KEY, AWS_SECRET_KEY)

Bucket

Parsed from S3 URI (e.g., s3://dsih-artpark-01-raw-data/...)

Data Transformations

Column Mapping

The script performs the following column transformations:

Column Renaming

Original Column

Final Column

Action

Sub District

subdistrict.name

Renamed

District

district.name

Renamed

State

state.name

Renamed

Test Result

test_result

Renamed

Date Of Onset

(merged into date)

Priority 1

Sample Collected Date

(merged into date)

Priority 2

Test Performed Date

(merged into date)

Priority 3

Date Consolidation Logic

A single date column is created using priority-based selection:

  1. First priority: Use Date Of Onset if available

  2. Second priority: Use Sample Collected Date if onset date is missing

  3. Third priority: Use Test Performed Date if both previous dates are missing

All dates are standardized to YYYY-MM-DD format using pandas.to_datetime().

Note

Unparseable dates are set to null and logged as warnings.

District Name Corrections

The script corrects known spelling discrepancies between IHIP data and the region ID database:

District Name Fixes

IHIP Spelling

Corrected Name (in regionids.csv)

Uttar Kannad

UTTARA KANNADA

Dakshin Kannad

DAKSHINA KANNADA

Bagalkot

BAGALKOTE

Chikballapur

CHIKKABALLAPURA

These corrections ensure accurate matching with standardized region identifiers.

Region ID Mapping

The script merges with data/regionids.csv to add standardized identifiers:

  • district.ID: Standard district identifier (e.g., district_550)

  • state.ID: Standard state identifier (e.g., state_29)

The mapping process:

  1. Normalizes district and state names (uppercase, strip whitespace)

  2. Applies district name corrections

  3. Performs left join with region ID database

  4. Logs warnings for unmatched districts

Case Aggregation

The aggregation logic creates daily district-level summaries:

  • Grouping: By date and district.name

  • Filtering: Only records with test_result = "POSITIVE" (case-insensitive)

  • Counting: Number of positive test results per group

  • Preserving: District name, state name, district ID, and state ID

Output Files

The script generates three output artifacts:

1. Raw Downloads (data/ihip/linelist/)

Individual files downloaded from S3:

  • Original XLSX files

  • Converted CSV files

  • Intermediate processing artifacts

2. Consolidated Linelist (data/ihip/linelist.csv)

Merged dataset with all individual test records.

Linelist Schema

Column

Description

date

Test/onset date (YYYY-MM-DD)

subdistrict.name

Sub-district name

district.name

District name (corrected)

state.name

State name

test_result

Test result (e.g., POSITIVE, NEGATIVE)

3. Aggregated Cases (datasets/cases_district_daily.csv)

Daily positive case counts by district with region identifiers.

Cases Schema

Column

Description

date

Date of cases (YYYY-MM-DD)

district.ID

Standard district identifier (e.g., district_550)

district.name

District name (uppercase, corrected)

state.ID

Standard state identifier (e.g., state_29)

state.name

State name

case

Number of positive test results for this district on this date

Data Characteristics:

  • One row per district per date (only dates with positive cases)

  • Sorted by date and district name

  • All district names in uppercase

  • Missing region IDs logged as warnings

Usage

Basic Command

Run with default output location:

python src/acestor/fetch_and_parse_ihip_data.py

This outputs the aggregated cases file to datasets/cases_district_daily.csv.

Custom Output File

Specify a custom output location:

python src/acestor/fetch_and_parse_ihip_data.py -o path/to/custom_output.csv

Or using the long form:

python src/acestor/fetch_and_parse_ihip_data.py --output-file path/to/custom_output.csv

Help

Display command-line options:

python src/acestor/fetch_and_parse_ihip_data.py -h

Command-Line Arguments

Flag

Type

Description

-o, --output-file

string

Output file path for aggregated cases (default: datasets/cases_district_daily.csv)

Configuration Requirements

Required Files

The script requires the following files to be present:

  1. config/dengue_pipeline.yaml

    Must contain the ihip_s3_location parameter:

    ihip_s3_location: s3://bucket-name/prefix/path/
    
  2. .env

    Must contain AWS credentials:

    AWS_ACCESS_KEY=your_access_key
    AWS_SECRET_KEY=your_secret_key
    
  3. data/regionids.csv

    Region ID mapping database with columns:

    • regionID: Identifier (e.g., district_550, state_29)

    • regionName: Human-readable name

    • parentID: Parent region identifier

Logging and Monitoring

Console Output

The script provides real-time progress updates including:

  • Step-by-step progress through the pipeline

  • File download progress bars (via tqdm)

  • Row counts and column information

  • Summary statistics (total cases, date ranges, unique districts)

Log Files

Detailed logs are written to:

logs/fetch_ihip_data_YYYYMMDD-HHMMSS.log

Log entries include:

  • Timestamps for all operations

  • File processing details

  • Data quality warnings

  • Error messages with stack traces

Warnings

The script logs warnings for:

  • Missing columns: When expected columns are not found in source files

  • Unmatched districts: Districts that couldn’t be mapped to region IDs

  • Unparseable dates: Date values that couldn’t be converted to YYYY-MM-DD format

  • Empty results: No positive test results found in the data

Dependencies

Package

Purpose

boto3

AWS S3 client for file downloads

pandas

Data manipulation, CSV/XLSX handling

pyyaml

Configuration file parsing

python-dotenv

Environment variable management

tqdm

Progress bar display

openpyxl

XLSX file reading (pandas dependency)

Error Handling

Common Errors

AWS Authentication Failure

ValueError: AWS credentials not found in environment variables

Solution: Ensure .env file contains valid AWS_ACCESS_KEY and AWS_SECRET_KEY.

Missing Configuration

ValueError: ihip_s3_location not found in dengue_pipeline.yaml

Solution: Add ihip_s3_location parameter to configuration file.

No Files in S3

WARNING: No files found in s3://bucket/prefix/

Solution: Verify S3 path is correct and bucket contains IHIP data files.

Region ID Mismatches

WARNING: Could not match N districts to region IDs: ['DISTRICT1', 'DISTRICT2', ...]

Solution: Add missing district name corrections to the district_name_fixes dictionary in the script.