CEMDB Data Layout

1. Overview

The City Energy Model Database (CEMDB) uses a hierarchical directory structure to organize simulation inputs, resources, and outputs. This layout includes a clear separation between single and ensemble runs.

2. Core Design Principles

  1. Location-based organization for GIS, weather, and preprocessing data

  2. Run isolation via unique run identifiers

  3. Run type separation between single and ensemble simulations

  4. Resource sharing for FMU datasets

  5. Ensemble member isolation with per-member databases

  6. Ensemble aggregation stored at run level (not per member)

3. Directory Structure

3.1. Root Structure

cemdb/
|-- simulators/              # Shared FMU datasets (read-only across all simulations)
|   |-- ideal/               # Ideal heating system FMUs
|   |-- boiler/              # Boiler heating system FMUs
|   `-- heatPump/            # Heat pump system FMUs
`-- locations/               # Location-specific data
    `-- <location>/          # e.g., "kernante", "strasbourg", "paris"
        |-- gis/             # GIS input data
        |-- weather/         # Weather data files
        |-- scenarios/       # Scenario definitions
        |-- preprocessing/   # Cached preprocessing data
        `-- simulations/     # Simulation outputs
            |-- single/
            |   `-- run_<timestamp>/
            |       |-- manifest.json
            |       |-- setup.json
            |       `-- database/
            `-- ensemble/
                `-- run_<timestamp>/
                    |-- manifest.json
                    |-- setup.json
                    |-- ensemble/      # Aggregated ensemble statistics
                    |-- sample_000/
                    |   `-- database/
                    |-- sample_001/
                    |   `-- database/
                    `-- database/      # Run-level logs/config

4. Key Directories

4.1. Simulators (cemdb/simulators/)

Purpose: Store shared FMU (Functional Mock-up Unit) datasets used by simulations.

Characteristics:

  • Read-only during simulations

  • Shared across all locations and runs

  • Organized by heating system type

  • FMU naming: App{Walls}{Floors}{Roof}{HeatingSystem}.fmu

4.2. Location Root (cemdb/locations/<location>/)

Purpose: Group all data specific to a geographic location.

Subdirectories:

  • gis/: GIS input files (building geometries, terrain, metadata)

  • weather/: Weather data files

  • scenarios/: Scenario definition files

  • preprocessing/: Cached preprocessing results

  • simulations/: All simulation outputs (single and ensemble)

4.3. Simulations Base (cemdb/locations/<location>/simulations/)

Simulations are separated by run type.

4.3.1. Single Simulations (simulations/single/)

Purpose: Standalone, single-run simulations.

single/
`-- run_<timestamp>/
    |-- manifest.json           # Run manifest
    |-- setup.json              # Run setup
    `-- database/
        |-- buildings/          # Per-building simulation outputs
        |-- visualization/      # Visualization exports
        |-- stats/              # StatsEngine outputs
        |-- report/             # Legacy reports and facts
        `-- timeseries/
            `-- building_outputs_ts.h5

Use cases: baseline simulations, single scenario testing, debugging.

4.3.2. Ensemble Simulations (simulations/ensemble/)

Purpose: Coordinated ensemble simulations with multiple parameter samples.

ensemble/
`-- run_<timestamp>/
    |-- manifest.json          # Run manifest
    |-- setup.json             # Run setup
    |-- ensemble/              # Ensemble-level outputs
    |   |-- stats/
    |   |   |-- global_kpi_summary.parquet
    |   |   `-- building_outputs_ts.h5
    |-- sample_000/
    |   `-- database/
    |       |-- stats/
    |       |   `-- global_kpi_summary.parquet
    |       `-- report/
    |           `-- building_kpi_ts.part-r*.parquet
    `-- database/              # Run-level logs/config

Use cases: uncertainty quantification, sensitivity analysis, parameter space exploration.

5. RunType Classification

5.1. C++ Enum

namespace Feel::Ktirio::Ub {

enum class RunType {
    Single,     // Standalone simulation
    Ensemble    // Coordinated ensemble simulation
};

} // namespace Feel::Ktirio::Ub

5.2. Python Enum

from feelpp.ktirio.ub import RunType

# Available values:
RunType.Single    # Single simulation
RunType.Ensemble  # Ensemble simulation

6. PathManager API

6.1. Run ID Generation

C++ API:

// Basic run ID (no type prefix)
std::string runId = PathManager::generateRunId();
// -> "run_2025-01-15_14-30-45"

// Type-prefixed run IDs
std::string singleRunId = PathManager::generateRunId(RunType::Single, true);
// -> "single_run_2025-01-15_14-30-45"

std::string ensembleRunId = PathManager::generateRunId(RunType::Ensemble, true);
// -> "ensemble_run_2025-01-15_14-30-45"

Python API:

from feelpp.ktirio.ub import core as ktirio_core

# Basic run ID (no type prefix)
run_id = ktirio_core.PathManager.generateRunId()
# -> "run_2025-01-15_14-30-45"

# Type-prefixed run IDs
single_run_id = ktirio_core.PathManager.generateRunId(
    ktirio_core.RunType.Single,
    includeTypePrefix=True
)
# -> "single_run_2025-01-15_14-30-45"

ensemble_run_id = ktirio_core.PathManager.generateRunId(
    ktirio_core.RunType.Ensemble,
    includeTypePrefix=True
)
# -> "ensemble_run_2025-01-15_14-30-45"
Type prefixes are optional but useful when manually inspecting directories.

6.2. Initialization

C++ API:

#include <ktirio/ub/pathmanager.hpp>

using namespace Feel::Ktirio::Ub;

// Single simulation
PathManager::instance().initialize(
    customRoot,
    locationName,
    runId,
    std::nullopt,
    RunType::Single
);

// Ensemble simulation (coordinator)
PathManager::instance().initialize(
    customRoot,
    locationName,
    PathManager::generateRunId(RunType::Ensemble),
    std::nullopt,
    RunType::Ensemble
);

// Ensemble simulation (member)
PathManager::instance().initialize(
    customRoot,
    locationName,
    runId,
    "000",
    RunType::Ensemble
);

Python API:

from feelpp.ktirio.ub import core as ktirio_core

pm = ktirio_core.PathManager.instance()

pm.initialize(
    customRoot="/path/to/cemdb",
    locationName="kernante",
    runId=ktirio_core.PathManager.generateRunId(ktirio_core.RunType.Single),
    ensembleMemberId=None,
    runType=ktirio_core.RunType.Single
)

ensemble_run_id = ktirio_core.PathManager.generateRunId(
    ktirio_core.RunType.Ensemble,
    includeTypePrefix=True
)

pm.initialize(
    customRoot="/path/to/cemdb",
    locationName="kernante",
    runId=ensemble_run_id,
    ensembleMemberId=None,
    runType=ktirio_core.RunType.Ensemble
)

pm.initialize(
    customRoot="/path/to/cemdb",
    locationName="kernante",
    runId=ensemble_run_id,
    ensembleMemberId="000",
    runType=ktirio_core.RunType.Ensemble
)

6.3. Path Getters

// Root paths
std::filesystem::path cemdbRoot() const;
std::filesystem::path simulatorsDir() const;
std::filesystem::path locationRoot() const;

// Input data paths
std::filesystem::path locationGisDir() const;
std::filesystem::path locationWeatherDir() const;
std::filesystem::path locationScenariosDir() const;
std::filesystem::path locationPreprocessingDir() const;

// Simulation paths (type-aware)
std::filesystem::path locationSimulationsDir() const;
std::filesystem::path currentRunDir() const;
std::filesystem::path databaseDir() const;

// Ensemble-specific paths
std::filesystem::path ensembleDir() const;
std::filesystem::path ensembleMemberDir(const std::string& memberId) const;

// Run type
RunType runType() const;
bool isEnsembleMode() const;

Path Behavior:

Method Single Run Ensemble Coordinator Ensemble Member

locationSimulationsDir()

simulations/single/

simulations/ensemble/

simulations/ensemble/

runRootDir()

single/run_<id>/

ensemble/run_<id>/

ensemble/run_<id>/

currentRunDir()

single/run_<id>/

ensemble/run_<id>/

ensemble/run_<id>/member_<N>/

databaseDir()

single/run_<id>/database/

ensemble/run_<id>/database/

ensemble/run_<id>/member_<N>/database/

ensembleDir()

N/A

ensemble/run_<id>/ensemble/

ensemble/run_<id>/ensemble/

ensembleDir() always returns the run-level ensemble directory, never a member-specific path.

7. File Formats

7.1. Building Output Files (database/buildings/*.json)

Per-building simulation results with timeseries and aggregated statistics.

{
  "buildingId": "building_001",
  "outputs": {
    "FinalEnergy": { "val": 125.4, "unit": "MJ" },
    "UsefullEnergy": { "val": 112.8, "unit": "MJ" },
    "IndoorTemperature": { "mean": 20.5, "std": 1.2, "unit": "degC" }
  }
}

7.2. Run Manifest (run_<id>/manifest.json)

Metadata about the simulation run, including configuration and artifacts.

{
  "run_id": "run_2025-01-15_14-30-00",
  "location": "kernante",
  "run_type": "Single",
  "world_size": 1,
  "start_time": 0.0,
  "stop_time": 86400.0,
  "step_time": 3600.0,
  "backend": "json",
  "artifact_base": "cemdb/locations/kernante/simulations/single/run_2025-01-15_14-30-00",
  "artifacts": [
    { "dataset": "global_kpi_summary", "path": "database/stats/global_kpi_summary.parquet" },
    { "dataset": "report", "path": "database/report/report.json" }
  ],
  "setup_summary": {
    "buildings_total": 125,
    "buildings_simulated": 120,
    "buildings_skipped": 5,
    "solar_shading_enabled": true,
    "solar_shading_components": ["building", "terrain"],
    "ideal_flows_enabled": false,
    "outputs": {
      "hdf5": true,
      "csv": false,
      "visualization": true,
      "report": true,
      "building_reports": false
    }
  }
}
world_size and artifact_base make it easy to resolve relative artifact paths outside the original runtime environment.

8. Manifest Schema Specification (v4)

The manifest v4 schema is the current standard for simulation result discovery and artifact management.

8.1. Schema Versions

Version Location Description

v1

database/manifest.json

Legacy per-database manifest

v3

database/manifest.json

Intermediate schema with provenance

v4

run_<id>/manifest.json

Run-root manifest with unified artifact discovery

8.2. v4 Schema Definition

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "title": "KUB Manifest v4",
  "type": "object",
  "required": ["schema_version", "run_id", "run_type", "artifact_base", "artifacts"],
  "properties": {
    "schema_version": {
      "type": "integer",
      "const": 4,
      "description": "Schema version identifier"
    },
    "run_id": {
      "type": "string",
      "description": "Unique run identifier (e.g., run_2025-01-15_14-30-00)"
    },
    "run_type": {
      "type": "string",
      "enum": ["Single", "Ensemble"],
      "description": "Simulation run type"
    },
    "location": {
      "type": "string",
      "description": "Geographic location identifier"
    },
    "world_size": {
      "type": "integer",
      "minimum": 1,
      "description": "Number of MPI processes used"
    },
    "start_time": {
      "type": "number",
      "description": "Simulation start time (seconds)"
    },
    "stop_time": {
      "type": "number",
      "description": "Simulation stop time (seconds)"
    },
    "step_time": {
      "type": "number",
      "description": "Simulation time step (seconds)"
    },
    "backend": {
      "type": "string",
      "enum": ["json", "parquet"],
      "description": "Output backend type"
    },
    "artifact_base": {
      "type": "string",
      "description": "Base path for resolving relative artifact paths"
    },
    "repository_root": {
      "type": "string",
      "description": "Repository root path (for run-root manifests)"
    },
    "artifacts": {
      "type": "array",
      "items": { "$ref": "#/$defs/artifact" },
      "description": "List of output artifacts"
    },
    "setup_summary": {
      "$ref": "#/$defs/setup_summary",
      "description": "Simulation configuration summary"
    },
    "provenance": {
      "$ref": "#/$defs/provenance",
      "description": "Software version and build info"
    },
    "samples": {
      "type": "array",
      "items": { "type": "string" },
      "description": "Ensemble sample identifiers (ensemble runs only)"
    }
  },
  "$defs": {
    "artifact": {
      "type": "object",
      "required": ["dataset", "path"],
      "properties": {
        "dataset": {
          "type": "string",
          "description": "Dataset identifier"
        },
        "path": {
          "type": "string",
          "description": "Relative path from artifact_base"
        },
        "type": {
          "type": "string",
          "enum": ["shared", "partitioned", "resource", "composite"],
          "description": "Artifact type for MPI-parallel outputs"
        },
        "schema_id": {
          "type": "string",
          "description": "Schema identifier (e.g., kub.global_kpi_summary)"
        },
        "schema_version": {
          "type": "string",
          "description": "Schema version (e.g., 1.0)"
        },
        "sha256": {
          "type": "string",
          "description": "SHA-256 hash of file content"
        },
        "files": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "path": { "type": "string" },
              "rank": { "type": "integer" },
              "sha256": { "type": "string" }
            }
          },
          "description": "Individual files for composite artifacts"
        }
      }
    },
    "setup_summary": {
      "type": "object",
      "properties": {
        "buildings_total": { "type": "integer" },
        "buildings_simulated": { "type": "integer" },
        "buildings_skipped": { "type": "integer" },
        "solar_shading_enabled": { "type": "boolean" },
        "solar_shading_components": {
          "type": "array",
          "items": { "type": "string" }
        },
        "ideal_flows_enabled": { "type": "boolean" },
        "outputs": {
          "type": "object",
          "properties": {
            "hdf5": { "type": "boolean" },
            "csv": { "type": "boolean" },
            "visualization": { "type": "boolean" },
            "report": { "type": "boolean" },
            "building_reports": { "type": "boolean" }
          }
        }
      }
    },
    "provenance": {
      "type": "object",
      "properties": {
        "schema_version": { "type": "integer" },
        "run_id": { "type": "string" },
        "run_type": { "type": "string" },
        "software": {
          "type": "object",
          "properties": {
            "name": { "type": "string" },
            "cem_version": { "type": "string" },
            "feelpp_version": { "type": "string" }
          }
        },
        "git": {
          "type": "object",
          "properties": {
            "commit": { "type": "string" },
            "branch": { "type": "string" },
            "dirty": { "type": "boolean" }
          }
        },
        "timestamp_utc": { "type": "string" },
        "hostname": { "type": "string" }
      }
    }
  }
}

8.3. Dataset Types

Dataset ID Description Schema ID

global_kpi_summary

City-level KPI aggregates

kub.global_kpi_summary

building_kpi_summary

Building-level KPI aggregates

kub.building_kpi_summary

building_kpi_ts

Building KPI time series facts

kub.building_kpi_ts

building_spatial_index

Building geolocation index

kub.building_spatial_index

building_outputs_ts

Building outputs time series (HDF5)

kub.building_outputs_ts_hdf5

building_metadata

Building metadata (HDF5)

kub.building_metadata

analysis_notebook

Auto-deployed Jupyter notebook

N/A

ensemble_global_kpi_summary

Ensemble-aggregated city KPIs

kub.global_kpi_summary

ensemble_building_kpi_summary

Ensemble-aggregated building KPIs

kub.building_kpi_summary

8.4. Artifact Types

Type Description

shared

Single file written by master rank only

partitioned

Multiple files, one per MPI rank (e.g., building_kpi_ts.part-r00000.parquet)

composite

Logical dataset with multiple physical files

resource

Static resource file (e.g., analysis notebook)

9. StatsEngine Configuration

The StatsEngine processes building-level facts and produces aggregated KPI summaries.

9.1. Report Specification (JSON)

{
  "window": {
    "mode": "full",
    "start_time": 0.0,
    "end_time": 86400.0
  },
  "levels": ["city", "district"],
  "kpis": ["FinalEnergy", "IndoorTemperature", "HeatingPower"],
  "building_level": true,
  "unit_validation_mode": "warn"
}

9.2. Report Specification (YAML)

window:
  mode: full
  start_time: 0.0
  end_time: 86400.0
levels:
  - city
  - district
kpis:
  - FinalEnergy
  - IndoorTemperature
  - HeatingPower
building_level: true
unit_validation_mode: warn

9.3. Configuration Options

Field Type Description

window.mode

string

Window mode: full (entire simulation), custom (explicit range)

window.start_time

float

Window start time in seconds (optional)

window.end_time

float

Window end time in seconds (optional)

levels

array

Aggregation levels: city, district, building

kpis

array

KPI identifiers to include (empty = all)

building_level

bool

Generate per-building summaries

unit_validation_mode

string

Unit handling: strict (error), warn (log), off (ignore)

9.4. KPI Catalog

The KPI catalog defines available metrics and their computation methods.

{
  "kpis": [
    {
      "id": "FinalEnergy",
      "unit": "kWh",
      "reducer": "sum",
      "aggregations": ["mean", "sum", "min", "max"]
    },
    {
      "id": "IndoorTemperature",
      "unit": "degC",
      "reducer": "weighted_mean",
      "aggregations": ["mean", "std", "min", "max"]
    },
    {
      "id": "HeatingPower",
      "unit": "W",
      "reducer": "weighted_mean",
      "parameters": {
        "integration_method": "trapezoidal"
      },
      "aggregations": ["mean", "peak", "integrated"]
    }
  ]
}

9.5. Reducer Types

Reducer Description

sum

Sum of values across buildings

mean

Arithmetic mean

weighted_mean

Mean weighted by building count

min

Minimum value

max

Maximum value

count

Count of observations

10. Parquet Schema Reference

10.1. Building KPI Time Series (kub.building_kpi_ts v1.0)

Column Type Nullable Description

ts

timestamp_s

No

Timestamp in seconds (UTC)

building_id

int64

No

Building identifier

kpi_id

string

No

KPI identifier

value

float64

No

Observed value

unit

string

Yes

Unit string

run_id

string

Yes

Run identifier

scenario_id

string

Yes

Scenario identifier

sample_id

string

Yes

Ensemble sample identifier

10.2. Global KPI Summary (kub.global_kpi_summary v1.0)

Column Type Nullable Description

level

string

No

Aggregation level

entity_id

string

No

Entity identifier

kpi_id

string

No

KPI identifier

metric

string

No

Metric name (mean, sum, std, etc.)

value

float64

No

Metric value

unit

string

Yes

Unit string

count

int64

Yes

Sample count

window_start

timestamp_s

Yes

Window start

window_end

timestamp_s

Yes

Window end

epc_mode

string

Yes

EPC proxy mode

10.3. Building KPI Summary (kub.building_kpi_summary v1.0)

Column Type Nullable Description

sample_id

string

Yes

Ensemble sample identifier

building_id

int64

No

Building identifier

kpi_id

string

No

KPI identifier

metric

string

No

Metric name

value

float64

No

Metric value

unit

string

Yes

Unit string

count

int64

Yes

Sample count

window_start

timestamp_s

Yes

Window start

window_end

timestamp_s

Yes

Window end

epc_class

string

Yes

EPC/DPE class label

epc_mode

string

Yes

EPC proxy mode

10.4. Building Spatial Index (kub.building_spatial_index v1.0)

Column Type Nullable Description

building_id

int64

No

Building identifier

centroid_lon

float64

No

Centroid longitude (EPSG:4326)

centroid_lat

float64

No

Centroid latitude (EPSG:4326)

district_scheme

string

No

District scheme identifier

district_id

string

No

District identifier

lau_id

string

Yes

LAU identifier

nuts1

string

Yes

NUTS1 identifier

nuts2

string

Yes

NUTS2 identifier

nuts3

string

Yes

NUTS3 identifier

10.5. Ensemble Summary (ensemble/stats/global_kpi_summary.parquet)

Aggregated statistics across all ensemble samples, produced by the StatsEngine ensemble aggregator. The schema matches global_kpi_summary.parquet with metric values such as ensemble_mean, ensemble_std, ensemble_min, ensemble_max, and ensemble_p05/p50/p95.

11. Migration Guide

11.1. Existing Data

Old structure (before v1.0):

simulations/
`-- run_<timestamp>/
    |-- database/
    `-- ensemble/

New structure (v1.0+):

simulations/
|-- single/
|   `-- run_<timestamp>/
|       `-- database/
`-- ensemble/
    `-- run_<timestamp>/
        |-- member_<N>/database/
        `-- ensemble/

Migration Steps:

  1. Identify run type from existing data

  2. Move single runs to simulations/single/

  3. Move ensemble runs to simulations/ensemble/

  4. Update scripts to use the RunType parameter

Migration Script (Python example):

import shutil
from pathlib import Path

cemdb_root = Path("/path/to/cemdb")
location = "kernante"
simulations = cemdb_root / "locations" / location / "simulations"

(simulations / "single").mkdir(exist_ok=True)
(simulations / "ensemble").mkdir(exist_ok=True)

for run_dir in simulations.glob("run_*"):
    if (run_dir / "database").exists():
        has_ensemble = (run_dir / "ensemble").exists()
        has_members = any(run_dir.glob("member_*") )

        if has_ensemble or has_members:
            target = simulations / "ensemble" / run_dir.name
            shutil.move(str(run_dir), str(target))
        else:
            target = simulations / "single" / run_dir.name
            shutil.move(str(run_dir), str(target))

12. Benefits of the Separated Structure

  1. Clarity: run type is visible from filesystem layout

  2. Organization: easier to manage and archive different run types

  3. Performance: smaller directories for listings

  4. Policies: apply different retention policies per run type

  5. Tooling: no need to parse run metadata

  6. Debugging: easier to locate runs during development

13. Common Workflows

13.1. Single Simulation

PathManager::instance().initialize(
    "/data/cemdb",
    "kernante",
    "run_2025-01-15_14-30-00",
    std::nullopt,
    RunType::Single
);

auto model = cityEnergyModel();
auto instance = model->newInstance(startTime, stopTime, stepTime);
instance->execute();

// Results saved to:
// /data/cemdb/locations/kernante/simulations/single/run_2025-01-15_14-30-00/database/

13.2. Ensemble Simulation

PathManager::instance().initialize(
    "/data/cemdb",
    "kernante",
    "run_2025-01-15_15-00-00",
    std::nullopt,
    RunType::Ensemble
);

auto model = cityEnergyModel();
EnsemblePlan plan = loadEnsemblePlan("uncertainty.json");
auto stats = model->executeEnsemble(plan, startTime, stopTime, stepTime);

// Results saved to:
// /data/cemdb/locations/kernante/simulations/ensemble/run_2025-01-15_15-00-00/
//    |-- member_000/database/
//    |-- member_001/database/
//    `-- ensemble/

13.3. Notebook Analysis

from feelpp.ktirio.ub import core as ktirio_core
from pathlib import Path
import json

pm = ktirio_core.PathManager.instance()
pm.initialize(
    customRoot="/data/cemdb",
    locationName="kernante",
    runId="run_2025-01-15_15-00-00",
    runType=ktirio_core.RunType.Ensemble
)

ensemble_dir = Path(pm.ensembleDir())
statistics_file = ensemble_dir / "statistics.json"

with open(statistics_file) as f:
    stats = json.load(f)

print(f"Mean FinalEnergy: {stats['outputs']['FinalEnergy']['mean']} MJ")
print(f"95% CI: [{stats['outputs']['FinalEnergy']['ci_lower']}, "
      f"{stats['outputs']['FinalEnergy']['ci_upper']}]")

14. Troubleshooting

RunType not found error

  • Cause: older code not updated to use the RunType parameter

  • Solution: add RunType::Single or RunType::Ensemble to initialize() calls

Ensemble statistics saved in wrong directory

  • Cause: using currentRunDir() instead of ensembleDir()

  • Solution: always use ensembleDir() for ensemble-level statistics

Cannot find simulation outputs

  • Cause: looking in the old simulations/ structure

  • Solution: use simulations/single/ or simulations/ensemble/

15. References

  • src/cpp/ktirio/ub/pathmanager.{hpp,cpp}

  • src/python/feelpp/ktirio/ub/_cpp/bindings.cpp

  • src/cpp/ktirio/ub/ensemble.cpp

  • src/notebooks/ensemble/notebook_utils.py

16. Version History

  • v1.0 (2025-01-15): initial separated structure with RunType classification