Infrastructure Data Normalization¶
Raw data from external partners (like KMZs from telecom providers) often arrives in proprietary formats. GigaSpatial provides dedicated schemas and processors to normalize this data into a standardized infrastructure database.
The "Why": Schemas and Processors¶
- Schemas (e.g.,
TransmissionNodeTable): Define the "Structure" of how infrastructure data should look, ensuring that fields likenode_typeandtransmission_mediumare consistent across different sources. - Processors (e.g.,
EntityProcessor): Handle the "Action" of cleaning and deduplicating raw dataframes to match the schema requirements.
The Workflow¶
1. Reading Raw Datasets¶
GigaSpatial provides read_dataset as a unified entry point for KMZ, GPKG, Parquet, and Shapefiles.
from gigaspatial.core.io import read_dataset
# Load a raw KMZ or GPKG file from partner data
raw_df = read_dataset("/path/to/partner_data/kenya_fiber.kmz")
2. Normalizing to a Schema¶
Once loaded, we wrap the dataframe in a Schema class to enforce consistency and apply default values.
from gigaspatial.core.schemas.transmission_node import TransmissionNodeTable
# Initialize the table schema with our raw data
# This allows the library to map non-standard column names to our core schema
node_table = TransmissionNodeTable(raw_df)
3. Cleaning and Deduplication¶
We use the EntityProcessor to perform high-level cleaning operations like geometry validation and row-level deduplication.
from gigaspatial.core.schemas.entity import EntityProcessor
# Initialize the processor for our node table
processor = EntityProcessor(node_table.df)
# Execute standard cleaning pipeline
# This removes duplicates based on coordinates and name
cleaned_df = processor.process(drop_duplicates=True)
Rationale for this Combination¶
By separating the Reading (DataStore/IO), Normalization (Schemas), and Processing (EntityProcessor), GigaSpatial allows for repeatable data cleaning pipelines.
For example, when processing ken-joints-manholes-handholes.gpkg seen in our production notebooks, this workflow allows the library to: 1. Normalize partner-specific names (like Joint 5 - KITUI) into a standard TransmissionNode. 2. Automatically extract longitude/latitude from nested KMZ geometries. 3. Identify "logical" nodes (exchanges) that serves as primary connectivity hubs.
This "Giga-ready" standardized data can then be passed into the PoiViewGenerator for catchment analysis.