SNAP Data Ingestion
This page simulates how datasets are onboarded into SNAP from external providers and internal collection workflows.
Ingestion pipeline stages
- Source registration
- Schema mapping
- Validation checks
- Transformation and harmonization
- Load into staging
- Quality review
- Publish to catalog
Source registration
At registration time, SNAP stores:
- Source organization
- Endpoint or delivery method
- File format
- Update frequency
- Contact owner
- Licensing terms
Accepted formats (debug set)
- CSV (UTF-8)
- JSON / NDJSON
- Parquet
- XLSX (with explicit tab selection)
Validation checks
Validation is performed before publication:
- Required columns exist
- Data types match schema
- Date fields are parseable and normalized
- Categorical values match allowed lists
- Numeric fields pass bounds checks
Harmonization rules
To align heterogeneous inputs:
- Region codes are converted to standard NUTS mappings.
- Time references are normalized to ISO date conventions.
- Units are converted to canonical forms where possible.
- Missing values are tagged with explicit null reasons.
Ingestion status model
Every ingestion run has a status:
queuedrunningfailed-validationloaded-stagingpublished
Operational metrics
Useful debug metrics:
- Total files processed
- Success/failure ratio
- Average processing duration
- Rows rejected by validation
Simulated CLI snippets
echo "ingest source: eu-lfs"
echo "validate schema: employment-v2"
echo "publish dataset: regional-employment-rate"
Failure handling
When ingestion fails:
- The dataset remains hidden from public catalog.
- A run log is retained with per-rule failures.
- Data steward receives a notification entry.
Debug checklist
- Run appears in ingestion history.
- Validation summary is visible.
- Published datasets include refresh timestamp.
