IHIP Data Fetcher and Parser
=============================

Overview
--------

The IHIP Data Fetcher (``fetch_and_parse_ihip_data.py``) is a standalone script that automates the complete pipeline for fetching, processing, and aggregating IHIP (Integrated Health Information Platform) dengue surveillance data from AWS S3. It transforms raw linelist data into analysis-ready datasets for epidemiological modeling.

Location: ``src/acestor/fetch_and_parse_ihip_data.py``

Pipeline Workflow
-----------------

The script executes four sequential steps:

1. **S3 Download**: Fetches all IHIP linelist files from configured S3 bucket
2. **Format Conversion**: Converts XLSX files to CSV format for standardized processing
3. **Data Merging**: Consolidates multiple CSV files into a single unified linelist
4. **Aggregation**: Creates district-level daily positive case counts with standardized region identifiers

Data Source Configuration
--------------------------

The script retrieves data from AWS S3 using the following configuration:

.. list-table:: Data Source Parameters
   :header-rows: 1
   :widths: 30 70

   * - Parameter
     - Details
   * - **S3 Location**
     - Defined in ``config/dengue_pipeline.yaml`` (``ihip_s3_location``)
   * - **File Format**
     - XLSX spreadsheets containing dengue test records
   * - **Authentication**
     - AWS credentials from ``.env`` file (``AWS_ACCESS_KEY``, ``AWS_SECRET_KEY``)
   * - **Bucket**
     - Parsed from S3 URI (e.g., ``s3://dsih-artpark-01-raw-data/...``)

Data Transformations
--------------------

Column Mapping
~~~~~~~~~~~~~~

The script performs the following column transformations:

.. list-table:: Column Renaming
   :header-rows: 1
   :widths: 40 40 20

   * - Original Column
     - Final Column
     - Action
   * - ``Sub District``
     - ``subdistrict.name``
     - Renamed
   * - ``District``
     - ``district.name``
     - Renamed
   * - ``State``
     - ``state.name``
     - Renamed
   * - ``Test Result``
     - ``test_result``
     - Renamed
   * - ``Date Of Onset``
     - (merged into ``date``)
     - Priority 1
   * - ``Sample Collected Date``
     - (merged into ``date``)
     - Priority 2
   * - ``Test Performed Date``
     - (merged into ``date``)
     - Priority 3

Date Consolidation Logic
~~~~~~~~~~~~~~~~~~~~~~~~

A single ``date`` column is created using priority-based selection:

1. **First priority**: Use ``Date Of Onset`` if available
2. **Second priority**: Use ``Sample Collected Date`` if onset date is missing
3. **Third priority**: Use ``Test Performed Date`` if both previous dates are missing

All dates are standardized to **YYYY-MM-DD** format using ``pandas.to_datetime()``.

.. note::
   Unparseable dates are set to null and logged as warnings.

District Name Corrections
~~~~~~~~~~~~~~~~~~~~~~~~~

The script corrects known spelling discrepancies between IHIP data and the region ID database:

.. list-table:: District Name Fixes
   :header-rows: 1
   :widths: 50 50

   * - IHIP Spelling
     - Corrected Name (in regionids.csv)
   * - Uttar Kannad
     - UTTARA KANNADA
   * - Dakshin Kannad
     - DAKSHINA KANNADA
   * - Bagalkot
     - BAGALKOTE
   * - Chikballapur
     - CHIKKABALLAPURA

These corrections ensure accurate matching with standardized region identifiers.

Region ID Mapping
~~~~~~~~~~~~~~~~~

The script merges with ``data/regionids.csv`` to add standardized identifiers:

- **district.ID**: Standard district identifier (e.g., ``district_550``)
- **state.ID**: Standard state identifier (e.g., ``state_29``)

The mapping process:

1. Normalizes district and state names (uppercase, strip whitespace)
2. Applies district name corrections
3. Performs left join with region ID database
4. Logs warnings for unmatched districts

Case Aggregation
~~~~~~~~~~~~~~~~

The aggregation logic creates daily district-level summaries:

- **Grouping**: By ``date`` and ``district.name``
- **Filtering**: Only records with ``test_result = "POSITIVE"`` (case-insensitive)
- **Counting**: Number of positive test results per group
- **Preserving**: District name, state name, district ID, and state ID

Output Files
------------

The script generates three output artifacts:

1. Raw Downloads (``data/ihip/linelist/``)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Individual files downloaded from S3:

- Original XLSX files
- Converted CSV files
- Intermediate processing artifacts

2. Consolidated Linelist (``data/ihip/linelist.csv``)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Merged dataset with all individual test records.

.. list-table:: Linelist Schema
   :header-rows: 1
   :widths: 30 70

   * - Column
     - Description
   * - ``date``
     - Test/onset date (YYYY-MM-DD)
   * - ``subdistrict.name``
     - Sub-district name
   * - ``district.name``
     - District name (corrected)
   * - ``state.name``
     - State name
   * - ``test_result``
     - Test result (e.g., POSITIVE, NEGATIVE)

3. Aggregated Cases (``datasets/cases_district_daily.csv``)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Daily positive case counts by district with region identifiers.

.. list-table:: Cases Schema
   :header-rows: 1
   :widths: 25 75

   * - Column
     - Description
   * - ``date``
     - Date of cases (YYYY-MM-DD)
   * - ``district.ID``
     - Standard district identifier (e.g., district_550)
   * - ``district.name``
     - District name (uppercase, corrected)
   * - ``state.ID``
     - Standard state identifier (e.g., state_29)
   * - ``state.name``
     - State name
   * - ``case``
     - Number of positive test results for this district on this date

**Data Characteristics**:

- One row per district per date (only dates with positive cases)
- Sorted by date and district name
- All district names in uppercase
- Missing region IDs logged as warnings

Usage
-----

Basic Command
~~~~~~~~~~~~~

Run with default output location:

.. code-block:: bash

    python src/acestor/fetch_and_parse_ihip_data.py

This outputs the aggregated cases file to ``datasets/cases_district_daily.csv``.

Custom Output File
~~~~~~~~~~~~~~~~~~

Specify a custom output location:

.. code-block:: bash

    python src/acestor/fetch_and_parse_ihip_data.py -o path/to/custom_output.csv

Or using the long form:

.. code-block:: bash

    python src/acestor/fetch_and_parse_ihip_data.py --output-file path/to/custom_output.csv

Help
~~~~

Display command-line options:

.. code-block:: bash

    python src/acestor/fetch_and_parse_ihip_data.py -h

Command-Line Arguments
----------------------

.. list-table::
   :header-rows: 1
   :widths: 20 20 60

   * - Flag
     - Type
     - Description
   * - ``-o``, ``--output-file``
     - string
     - Output file path for aggregated cases (default: ``datasets/cases_district_daily.csv``)

Configuration Requirements
--------------------------

Required Files
~~~~~~~~~~~~~~

The script requires the following files to be present:

1. **config/dengue_pipeline.yaml**

   Must contain the ``ihip_s3_location`` parameter:

   .. code-block:: yaml

       ihip_s3_location: s3://bucket-name/prefix/path/

2. **.env**

   Must contain AWS credentials:

   .. code-block:: bash

       AWS_ACCESS_KEY=your_access_key
       AWS_SECRET_KEY=your_secret_key

3. **data/regionids.csv**

   Region ID mapping database with columns:

   - ``regionID``: Identifier (e.g., district_550, state_29)
   - ``regionName``: Human-readable name
   - ``parentID``: Parent region identifier

Logging and Monitoring
----------------------

Console Output
~~~~~~~~~~~~~~

The script provides real-time progress updates including:

- Step-by-step progress through the pipeline
- File download progress bars (via ``tqdm``)
- Row counts and column information
- Summary statistics (total cases, date ranges, unique districts)

Log Files
~~~~~~~~~

Detailed logs are written to:

.. code-block:: text

    logs/fetch_ihip_data_YYYYMMDD-HHMMSS.log

Log entries include:

- Timestamps for all operations
- File processing details
- Data quality warnings
- Error messages with stack traces

Warnings
~~~~~~~~

The script logs warnings for:

- **Missing columns**: When expected columns are not found in source files
- **Unmatched districts**: Districts that couldn't be mapped to region IDs
- **Unparseable dates**: Date values that couldn't be converted to YYYY-MM-DD format
- **Empty results**: No positive test results found in the data

Dependencies
------------

.. list-table::
   :header-rows: 1
   :widths: 25 75

   * - Package
     - Purpose
   * - ``boto3``
     - AWS S3 client for file downloads
   * - ``pandas``
     - Data manipulation, CSV/XLSX handling
   * - ``pyyaml``
     - Configuration file parsing
   * - ``python-dotenv``
     - Environment variable management
   * - ``tqdm``
     - Progress bar display
   * - ``openpyxl``
     - XLSX file reading (pandas dependency)

Error Handling
--------------

Common Errors
~~~~~~~~~~~~~

**AWS Authentication Failure**

.. code-block:: text

    ValueError: AWS credentials not found in environment variables

**Solution**: Ensure ``.env`` file contains valid ``AWS_ACCESS_KEY`` and ``AWS_SECRET_KEY``.

**Missing Configuration**

.. code-block:: text

    ValueError: ihip_s3_location not found in dengue_pipeline.yaml

**Solution**: Add ``ihip_s3_location`` parameter to configuration file.

**No Files in S3**

.. code-block:: text

    WARNING: No files found in s3://bucket/prefix/

**Solution**: Verify S3 path is correct and bucket contains IHIP data files.

**Region ID Mismatches**

.. code-block:: text

    WARNING: Could not match N districts to region IDs: ['DISTRICT1', 'DISTRICT2', ...]

**Solution**: Add missing district name corrections to the ``district_name_fixes`` dictionary in the script.