Reporting and Statistics Design

1. Goals

  • Provide a single source of truth for building time-series facts.

  • Derive deterministic global and spatial statistics from facts.

  • Reuse the same reducer library for single runs and ensembles.

  • Standardize output artifacts (Parquet + manifest/provenance/spec/catalog).

  • Keep the design MPI-friendly, mergeable, and extensible.

2. Scope

  • This document defines Phase 2 contracts: schemas, interfaces, and integration points.

  • Implementation details are deferred to Phase 3.

3. Artifacts and Data Flow

Artifact Purpose Location (logical)

building_kpi_ts.parquet

Facts store (building time series)

outputs/timeseries/

building_spatial_index.parquet

Building spatial index (district mappings)

outputs/mappings/

global_kpi_summary.parquet

Global stats (city/district/type)

outputs/stats/

building_kpi_summary.parquet

Optional per-building summary

outputs/stats/

building_outputs_ts.h5

Per-building output time series (HDF5)

database/timeseries/ (single) / ensemble/stats/ (ensemble)

manifest.json

Run artifact index + schema ids

run_root/

provenance.json

Run provenance (git, cmdline, env)

run_root/

report_spec.yaml

Reporting window + KPI selection

run_root/

kpi_catalog.yaml

KPI definitions and aggregations

run_root/

4. Schema IDs and Versions

Schema ID Version Artifact

kub.building_kpi_ts

1.0

building_kpi_ts.parquet

kub.global_kpi_summary

1.0

global_kpi_summary.parquet

kub.building_kpi_summary

1.0

building_kpi_summary.parquet

kub.building_spatial_index

1.0

building_spatial_index.parquet

kub.building_outputs_ts_hdf5

1.0

building_outputs_ts.h5

Manifest entries must include schema_id, schema_version, and sha256 for each artifact.

5. Manifest Schema (v4)

The run-root manifest.json provides a complete, checksummed inventory of artifacts.

Key fields:

  • schema_version: Manifest schema version (4 for the enriched format).

  • artifact_base: Absolute path to the run root.

  • repository_root: Optional absolute path to the source repository.

  • artifacts: List of datasets with schema_id, schema_version, sha256, and rank metadata.

  • refs: References to setup.json, provenance.json, and optional report_spec.yaml and kpi_catalog.yaml.

  • Legacy artifacts use schema ids prefixed with kub.legacy..

{
  "schema_version": 4,
  "run_id": "run_2025-12-27_10-17-24",
  "run_type": "Single",
  "artifact_base": "/path/to/run_2025-12-27_10-17-24",
  "refs": {
    "setup": "setup.json",
    "provenance": "provenance.json",
    "report_spec": "report_spec.yaml",
    "kpi_catalog": "kpi_catalog.yaml"
  },
  "artifacts": [
    {
      "dataset": "global_kpi_summary",
      "path": "database/stats/global_kpi_summary.parquet",
      "type": "shared",
      "schema_id": "kub.global_kpi_summary",
      "schema_version": "1.0",
      "sha256": "..."
    },
    {
      "dataset": "building_kpi_ts",
      "path": "database/report/building_kpi_ts.part-r*.parquet",
      "type": "partitioned",
      "schema_id": "kub.building_kpi_ts",
      "schema_version": "1.0",
      "partitioned": true,
      "files": [
        { "path": "database/report/building_kpi_ts.part-r00000.parquet", "sha256": "...", "rank": 0 }
      ],
      "sha256": "..."
    }
  ]
}

6. Manifest Validation

Use the validation helper to check file existence and checksum integrity:

./validate_manifest /path/to/run_2025-12-27_10-17-24/manifest.json

7. Parquet Schemas (v1.0)

7.1. Facts Store: building_kpi_ts.parquet

Column Type Nullable Description

ts

timestamp_s

no

Timestamp in seconds (UTC, run clock).

building_id

int64

no

Building identifier.

kpi_id

string

no

KPI identifier (from catalog).

value

float64

no

Observed value.

unit

string

yes

Unit (optional).

run_id

string

yes

Run identifier (for multi-run ingestion).

scenario_id

string

yes

Scenario identifier if available.

sample_id

string

yes

Ensemble sample identifier.

7.2. Global Summary: global_kpi_summary.parquet

Column Type Nullable Description

level

string

no

Aggregation level: city, district, typology.

entity_id

string

no

Entity identifier (city id, district id, type).

kpi_id

string

no

KPI identifier.

metric

string

no

Metric name (mean, stddev, p50, peak_total, etc.).

value

float64

no

Metric value.

unit

string

yes

Unit (optional).

count

int64

yes

Sample count (buildings or timesteps).

window_start

timestamp_s

yes

Report window start.

window_end

timestamp_s

yes

Report window end.

epc_mode

string

yes

EPC proxy mode if relevant.

7.3. Building Summary: building_kpi_summary.parquet

Column Type Nullable Description

sample_id

string

yes

Ensemble sample identifier.

building_id

int64

no

Building identifier.

kpi_id

string

no

KPI identifier.

metric

string

no

Metric name (mean, stddev, p50, etc.).

value

float64

no

Metric value.

unit

string

yes

Unit (optional).

count

int64

yes

Sample count (timesteps or samples).

window_start

timestamp_s

yes

Report window start.

window_end

timestamp_s

yes

Report window end.

epc_class

string

yes

EPC/DPE class label.

epc_mode

string

yes

EPC proxy mode.

7.4. Spatial Index: building_spatial_index.parquet

Column Type Nullable Description

building_id

int64

no

Building identifier.

centroid_lon

float64

no

Longitude (EPSG:4326).

centroid_lat

float64

no

Latitude (EPSG:4326).

district_scheme

string

no

Scheme id (e.g., eu_grid_100km, h3_resolution_7, lau, custom).

district_id

string

no

District identifier under scheme.

lau_id

string

yes

LAU identifier (if available).

nuts1

string

yes

NUTS1 id.

nuts2

string

yes

NUTS2 id.

nuts3

string

yes

NUTS3 id.

8. HDF5 Schemas (v1.0)

8.1. Building Outputs Time Series: building_outputs_ts.h5

The HDF5 file stores per-building, per-output time series. The logical schema is described below; the physical layout is an aggregated group with arrays under /aggregated/variables/<output_name>/.

8.1.1. Single Run (Logical Schema)

Field Type Description

timestamp

float64

Simulation time (seconds from start).

building_id

string

Building identifier.

output_name

string

Output variable name (e.g., heating_power).

value

float64

Simulated value (single run).

unit

string

Physical unit (e.g., W, degC).

8.1.2. Ensemble Run (Logical Schema)

Field Type Description

timestamp

float64

Simulation time (seconds from start).

building_id

string

Building identifier.

output_name

string

Output variable name.

mean

float64

Mean across samples.

std

float64

Standard deviation across samples.

min

float64

Minimum across samples.

max

float64

Maximum across samples.

p05

float64

5th percentile.

p50

float64

Median (50th percentile).

p95

float64

95th percentile.

n_samples

int32

Number of samples contributing.

9. Report Spec and KPI Catalog (YAML/JSON)

9.1. report_spec.yaml

Minimum fields:

  • window.mode: full | last_day | last_week | last_month | custom

  • window.start_time, window.end_time: required for custom mode

  • levels: ["city", "district", "typology", "building"]

  • kpis: list of KPI ids to compute

  • building_level: boolean to enable building_kpi_summary.parquet

  • export_legacy_json: boolean to keep report.json during transition

  • unit_validation_mode: strict | warn | off

9.2. kpi_catalog.yaml

Minimum fields per KPI:

  • id

  • unit

  • reducer (factory id)

  • aggregations (mean, min, max, p50, p95, etc.)

  • thresholds (optional, for exceedance/EP-classes)

10. Reducer Library (Contracts)

Reducer concepts and interfaces live under src/cpp/ktirio/ub/stats/:

  • Mergeable, Finalizable, ReducerType concepts

  • IReducer interface with update, merge, finalize, clone

  • ReducerFactory to create reducers by KPI definition

template <typename T>
concept Mergeable = requires( T a, const T& b )
{
    { a.merge( b ) } -> std::same_as<void>;
};

template <typename T>
concept Finalizable = requires( const T a )
{
    typename T::result_type;
    { a.finalize() } -> std::convertible_to<typename T::result_type>;
};

11. StatsEngine (Facade + Strategy)

The engine provides a unified entry point for single-run and ensemble workflows.

Standard pipeline (Template Method):

  1. Load inputs

  2. Validate inputs

  3. Compute (delegates to strategy)

  4. Write outputs (Parquet + manifest updates)

class StatsEngine
{
public:
    explicit StatsEngine( StatsEngineInputs inputs );
    StatsEngineOutputs run();
};

11.1. MapReduce Key

Key: (level, entity_id, kpi_id, metric) Reducers must be associative and commutative to ensure MPI determinism.

12. Ensemble Integration Points

Current entry points that will call the StatsEngine:

  • CemRunner::emitAnalytics (src/cpp/ktirio/ub/app/cemrunner.cpp:576)

  • Ensemble rounds and sample execution (src/cpp/ktirio/ub/ensemble/ensembleworkflow_rounds.cpp:171)

  • Ensemble time-series IO (src/cpp/ktirio/ub/ensemble/ensembleworkflow_internal.hpp:260)

  • Single-run reporting aggregation (src/cpp/ktirio/ub/cityenergymodel.cpp:1215)

13. Spatial Index Strategy

  • Use building_spatial_index.parquet as a reusable pre-step.

  • If districts are missing, compute from centroid using scheme-specific fallback.

  • Schemes: eu_grid_100km (default), h3_resolution_7, lau, custom.

14. EPC/DPE Proxy

  • Compute E_primary_m2_y (kWh/m²/year).

  • Classification modes:

    • country_rules (if rules exist)

    • distribution_screening (quantile-based)

  • Expose probabilities P(class) and P(passoire=F/G) from ensemble histograms.

  • Outputs must include epc_mode and a disclaimer: "screening indicator, not legal classification".

15. Unit Validation

StatsEngine validates KPI units at load time.

  • Accepted canonical units: W, kW, kWh, degC, K, m3/s, kg_per_m3, 1, %, Pa, ppm, kWh/m2/y.

  • Synonyms are normalized (e.g., celsiusdegC, kg/m3kg_per_m3, kWh/m2/yearkWh/m2/y).

  • Custom units must be prefixed with custom: to bypass validation.

  • Behavior is controlled by unit_validation_mode in report_spec.yaml:

    • strict: throw on invalid units

    • warn: log warnings and continue

    • off: skip validation

16. Manifest and Provenance

Manifest must include:

  • schema_id + schema_version per artifact

  • sha256 checksum for integrity checks

  • References to setup.json, provenance.json, report_spec.yaml, kpi_catalog.yaml

Provenance must capture (schema_version 2):

  • git commit hash and dirty state

  • command line and environment

  • timestamps and resource summary

  • cem_version and feelpp_version

17. Fallback Plan

  • If Arrow/Parquet is unavailable, fall back to JSON/CSV.

  • Manifest must mark backend=json and list schema ids with schema_version.

  • Document a migration path back to Parquet when enabled.

18. UML (Facade + Strategy + Factory)

+------------------+         +---------------------+
|   StatsEngine    |-------> |   IComputeStrategy  |
| (Facade/Template)|         |  (Strategy)         |
+------------------+         +----------+----------+
         |                               |
         v                               v
+------------------+         +---------------------+
| ReducerFactory   |-------> |   IReducer          |
|  (Factory)       |         |  (Mergeable)        |
+------------------+         +---------------------+

19. Arrow Schema Definitions

Code-ready Arrow schema builders are defined in:

  • src/cpp/ktirio/ub/stats/schema.hpp Shared artifacts appear once with "type": "shared". Partitioned artifacts are grouped under a single entry with "type": "partitioned" and a files array containing per-rank paths and checksums.