Infrastructure Data Normalization¶

Raw data from external partners (like KMZs from telecom providers) often arrives in proprietary formats. GigaSpatial provides dedicated schemas and processors to normalize this data into a standardized infrastructure database.

The "Why": Schemas and Processors¶

Schemas (e.g., TransmissionNodeTable): Define the "Structure" of how infrastructure data should look, ensuring that fields like node_type and transmission_medium are consistent across different sources.
Processors (e.g., EntityProcessor): Handle the "Action" of cleaning and deduplicating raw dataframes to match the schema requirements.

The Workflow¶

1. Reading Raw Datasets¶

GigaSpatial provides read_dataset as a unified entry point for KMZ, GPKG, Parquet, and Shapefiles.

from gigaspatial.core.io import read_dataset

# Load a raw KMZ or GPKG file from partner data
raw_df = read_dataset("/path/to/partner_data/kenya_fiber.kmz")

2. Normalizing to a Schema¶

Once loaded, we wrap the dataframe in a Schema class to enforce consistency and apply default values.

from gigaspatial.core.schemas.transmission_node import TransmissionNodeTable

# Initialize the table schema with our raw data
# This allows the library to map non-standard column names to our core schema
node_table = TransmissionNodeTable(raw_df)

3. Cleaning and Deduplication¶

We use the EntityProcessor to perform high-level cleaning operations like geometry validation and row-level deduplication.

from gigaspatial.core.schemas.entity import EntityProcessor

# Initialize the processor for our node table
processor = EntityProcessor(node_table.df)

# Execute standard cleaning pipeline
# This removes duplicates based on coordinates and name
cleaned_df = processor.process(drop_duplicates=True)

Rationale for this Combination¶

By separating the Reading (DataStore/IO), Normalization (Schemas), and Processing (EntityProcessor), GigaSpatial allows for repeatable data cleaning pipelines.

For example, when processing ken-joints-manholes-handholes.gpkg seen in our production notebooks, this workflow allows the library to: 1. Normalize partner-specific names (like Joint 5 - KITUI) into a standard TransmissionNode. 2. Automatically extract longitude/latitude from nested KMZ geometries. 3. Identify "logical" nodes (exchanges) that serves as primary connectivity hubs.

This "Giga-ready" standardized data can then be passed into the PoiViewGenerator for catchment analysis.