Reporting and Statistics Design
1. Goals
-
Provide a single source of truth for building time-series facts.
-
Derive deterministic global and spatial statistics from facts.
-
Reuse the same reducer library for single runs and ensembles.
-
Standardize output artifacts (Parquet + manifest/provenance/spec/catalog).
-
Keep the design MPI-friendly, mergeable, and extensible.
2. Scope
-
This document defines Phase 2 contracts: schemas, interfaces, and integration points.
-
Implementation details are deferred to Phase 3.
3. Artifacts and Data Flow
| Artifact | Purpose | Location (logical) |
|---|---|---|
|
Facts store (building time series) |
|
|
Building spatial index (district mappings) |
|
|
Global stats (city/district/type) |
|
|
Optional per-building summary |
|
|
Per-building output time series (HDF5) |
|
|
Run artifact index + schema ids |
|
|
Run provenance (git, cmdline, env) |
|
|
Reporting window + KPI selection |
|
|
KPI definitions and aggregations |
|
4. Schema IDs and Versions
| Schema ID | Version | Artifact |
|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Manifest entries must include schema_id, schema_version, and sha256 for each artifact.
5. Manifest Schema (v4)
The run-root manifest.json provides a complete, checksummed inventory of artifacts.
Key fields:
-
schema_version: Manifest schema version (4for the enriched format). -
artifact_base: Absolute path to the run root. -
repository_root: Optional absolute path to the source repository. -
artifacts: List of datasets withschema_id,schema_version,sha256, and rank metadata. -
refs: References tosetup.json,provenance.json, and optionalreport_spec.yamlandkpi_catalog.yaml. -
Legacy artifacts use schema ids prefixed with
kub.legacy..
{
"schema_version": 4,
"run_id": "run_2025-12-27_10-17-24",
"run_type": "Single",
"artifact_base": "/path/to/run_2025-12-27_10-17-24",
"refs": {
"setup": "setup.json",
"provenance": "provenance.json",
"report_spec": "report_spec.yaml",
"kpi_catalog": "kpi_catalog.yaml"
},
"artifacts": [
{
"dataset": "global_kpi_summary",
"path": "database/stats/global_kpi_summary.parquet",
"type": "shared",
"schema_id": "kub.global_kpi_summary",
"schema_version": "1.0",
"sha256": "..."
},
{
"dataset": "building_kpi_ts",
"path": "database/report/building_kpi_ts.part-r*.parquet",
"type": "partitioned",
"schema_id": "kub.building_kpi_ts",
"schema_version": "1.0",
"partitioned": true,
"files": [
{ "path": "database/report/building_kpi_ts.part-r00000.parquet", "sha256": "...", "rank": 0 }
],
"sha256": "..."
}
]
}
6. Manifest Validation
Use the validation helper to check file existence and checksum integrity:
./validate_manifest /path/to/run_2025-12-27_10-17-24/manifest.json
7. Parquet Schemas (v1.0)
7.1. Facts Store: building_kpi_ts.parquet
| Column | Type | Nullable | Description |
|---|---|---|---|
|
|
no |
Timestamp in seconds (UTC, run clock). |
|
|
no |
Building identifier. |
|
|
no |
KPI identifier (from catalog). |
|
|
no |
Observed value. |
|
|
yes |
Unit (optional). |
|
|
yes |
Run identifier (for multi-run ingestion). |
|
|
yes |
Scenario identifier if available. |
|
|
yes |
Ensemble sample identifier. |
7.2. Global Summary: global_kpi_summary.parquet
| Column | Type | Nullable | Description |
|---|---|---|---|
|
|
no |
Aggregation level: |
|
|
no |
Entity identifier (city id, district id, type). |
|
|
no |
KPI identifier. |
|
|
no |
Metric name (mean, stddev, p50, peak_total, etc.). |
|
|
no |
Metric value. |
|
|
yes |
Unit (optional). |
|
|
yes |
Sample count (buildings or timesteps). |
|
|
yes |
Report window start. |
|
|
yes |
Report window end. |
|
|
yes |
EPC proxy mode if relevant. |
7.3. Building Summary: building_kpi_summary.parquet
| Column | Type | Nullable | Description |
|---|---|---|---|
|
|
yes |
Ensemble sample identifier. |
|
|
no |
Building identifier. |
|
|
no |
KPI identifier. |
|
|
no |
Metric name (mean, stddev, p50, etc.). |
|
|
no |
Metric value. |
|
|
yes |
Unit (optional). |
|
|
yes |
Sample count (timesteps or samples). |
|
|
yes |
Report window start. |
|
|
yes |
Report window end. |
|
|
yes |
EPC/DPE class label. |
|
|
yes |
EPC proxy mode. |
7.4. Spatial Index: building_spatial_index.parquet
| Column | Type | Nullable | Description |
|---|---|---|---|
|
|
no |
Building identifier. |
|
|
no |
Longitude (EPSG:4326). |
|
|
no |
Latitude (EPSG:4326). |
|
|
no |
Scheme id (e.g., |
|
|
no |
District identifier under scheme. |
|
|
yes |
LAU identifier (if available). |
|
|
yes |
NUTS1 id. |
|
|
yes |
NUTS2 id. |
|
|
yes |
NUTS3 id. |
8. HDF5 Schemas (v1.0)
8.1. Building Outputs Time Series: building_outputs_ts.h5
The HDF5 file stores per-building, per-output time series. The logical schema
is described below; the physical layout is an aggregated group with arrays
under /aggregated/variables/<output_name>/.
8.1.1. Single Run (Logical Schema)
| Field | Type | Description |
|---|---|---|
|
|
Simulation time (seconds from start). |
|
|
Building identifier. |
|
|
Output variable name (e.g., |
|
|
Simulated value (single run). |
|
|
Physical unit (e.g., |
8.1.2. Ensemble Run (Logical Schema)
| Field | Type | Description |
|---|---|---|
|
|
Simulation time (seconds from start). |
|
|
Building identifier. |
|
|
Output variable name. |
|
|
Mean across samples. |
|
|
Standard deviation across samples. |
|
|
Minimum across samples. |
|
|
Maximum across samples. |
|
|
5th percentile. |
|
|
Median (50th percentile). |
|
|
95th percentile. |
|
|
Number of samples contributing. |
9. Report Spec and KPI Catalog (YAML/JSON)
9.1. report_spec.yaml
Minimum fields:
-
window.mode:full | last_day | last_week | last_month | custom -
window.start_time,window.end_time: required for custom mode -
levels:["city", "district", "typology", "building"] -
kpis: list of KPI ids to compute -
building_level: boolean to enablebuilding_kpi_summary.parquet -
export_legacy_json: boolean to keepreport.jsonduring transition -
unit_validation_mode:strict | warn | off
10. Reducer Library (Contracts)
Reducer concepts and interfaces live under src/cpp/ktirio/ub/stats/:
-
Mergeable,Finalizable,ReducerTypeconcepts -
IReducerinterface withupdate,merge,finalize,clone -
ReducerFactoryto create reducers by KPI definition
template <typename T>
concept Mergeable = requires( T a, const T& b )
{
{ a.merge( b ) } -> std::same_as<void>;
};
template <typename T>
concept Finalizable = requires( const T a )
{
typename T::result_type;
{ a.finalize() } -> std::convertible_to<typename T::result_type>;
};
11. StatsEngine (Facade + Strategy)
The engine provides a unified entry point for single-run and ensemble workflows.
Standard pipeline (Template Method):
-
Load inputs
-
Validate inputs
-
Compute (delegates to strategy)
-
Write outputs (Parquet + manifest updates)
class StatsEngine
{
public:
explicit StatsEngine( StatsEngineInputs inputs );
StatsEngineOutputs run();
};
12. Ensemble Integration Points
Current entry points that will call the StatsEngine:
-
CemRunner::emitAnalytics(src/cpp/ktirio/ub/app/cemrunner.cpp:576) -
Ensemble rounds and sample execution (
src/cpp/ktirio/ub/ensemble/ensembleworkflow_rounds.cpp:171) -
Ensemble time-series IO (
src/cpp/ktirio/ub/ensemble/ensembleworkflow_internal.hpp:260) -
Single-run reporting aggregation (
src/cpp/ktirio/ub/cityenergymodel.cpp:1215)
13. Spatial Index Strategy
-
Use
building_spatial_index.parquetas a reusable pre-step. -
If districts are missing, compute from centroid using scheme-specific fallback.
-
Schemes:
eu_grid_100km(default),h3_resolution_7,lau,custom.
14. EPC/DPE Proxy
-
Compute
E_primary_m2_y(kWh/m²/year). -
Classification modes:
-
country_rules(if rules exist) -
distribution_screening(quantile-based)
-
-
Expose probabilities
P(class)andP(passoire=F/G)from ensemble histograms. -
Outputs must include
epc_modeand a disclaimer: "screening indicator, not legal classification".
15. Unit Validation
StatsEngine validates KPI units at load time.
-
Accepted canonical units:
W,kW,kWh,degC,K,m3/s,kg_per_m3,1,%,Pa,ppm,kWh/m2/y. -
Synonyms are normalized (e.g.,
celsius→degC,kg/m3→kg_per_m3,kWh/m2/year→kWh/m2/y). -
Custom units must be prefixed with
custom:to bypass validation. -
Behavior is controlled by
unit_validation_modeinreport_spec.yaml:-
strict: throw on invalid units -
warn: log warnings and continue -
off: skip validation
-
16. Manifest and Provenance
Manifest must include:
-
schema_id+schema_versionper artifact -
sha256checksum for integrity checks -
References to
setup.json,provenance.json,report_spec.yaml,kpi_catalog.yaml
Provenance must capture (schema_version 2):
-
git commit hash and dirty state
-
command line and environment
-
timestamps and resource summary
-
cem_versionandfeelpp_version
17. Fallback Plan
-
If Arrow/Parquet is unavailable, fall back to JSON/CSV.
-
Manifest must mark
backend=jsonand list schema ids withschema_version. -
Document a migration path back to Parquet when enabled.
18. UML (Facade + Strategy + Factory)
+------------------+ +---------------------+
| StatsEngine |-------> | IComputeStrategy |
| (Facade/Template)| | (Strategy) |
+------------------+ +----------+----------+
| |
v v
+------------------+ +---------------------+
| ReducerFactory |-------> | IReducer |
| (Factory) | | (Mergeable) |
+------------------+ +---------------------+
19. Arrow Schema Definitions
Code-ready Arrow schema builders are defined in:
-
src/cpp/ktirio/ub/stats/schema.hppShared artifacts appear once with"type": "shared". Partitioned artifacts are grouped under a single entry with"type": "partitioned"and afilesarray containing per-rank paths and checksums.