Dataset Workflow

This stage prepares a reproducible local dataset state before simulation submission.

--cemdb-root can be omitted in many cases:

  • If the current directory contains ./cemdb, kub-dataset and kub-simulate use the default cemdb/locations.

  • kub-dashboard auto-detects ./cemdb.

CEMDB_ROOT environment variable:

  • kub-dashboard uses CEMDB_ROOT as its default CEMDB path.

  • For kub-dataset and kub-simulate, use --cemdb-root when your data is not under ./cemdb/locations.

1. 1. Discover Available Data Sources

kub-dataset list-dmps
kub-dataset list-locations --show-versions

If you need a specific backend only:

kub-dataset list-locations --dmp girder-unistra --show-versions

2. 2. Pull Location Dataset

  • Public Dataset

  • Authenticated Pull

kub-dataset pull arz \
  --version 0.1.0 \
  --cemdb-root cemdb/locations \
  --dmp girder-unistra
kub-dataset pull arz \
  --version 0.1.0 \
  --cemdb-root cemdb/locations \
  --dmp girder-unistra \
  --api-key "$GIRDER_API_KEY"

Current kub-dataset defaults use split archive behavior for pull and push when --type all is used.

3. 3. Pull Simulator Dataset

kub-dataset pull-simulator \
  --version 0.2.0 \
  --cemdb-root cemdb \
  --force

4. 4. Verify Local Dataset State

kub-dataset summary arz --cemdb-root cemdb/locations
kub-dataset manifest-show arz --version 0.1.0 --cemdb-root cemdb/locations

5. 5. Optional: Track Pulls With DVC

kub-dataset init
kub-dataset pull arz --version 0.1.0 --dvc --force --cemdb-root cemdb/locations
kub-dataset status

6. Expected Layout Check

find cemdb/locations/arz/v0.1.0 -maxdepth 2 -type d | sort

You should see key directories such as geo/, weather/, scenarios/, preprocessing/, and simulations/.

7. Next Step

Proceed to Run Simulation.