Dataset Naming Conventions
1. Overview
Dataset names in CEMDB must follow strict conventions to ensure cross-platform compatibility and consistency between C++ and Python code paths. This document describes the naming rules, normalization process, and versioning scheme used for datasets.
2. Naming Rules
2.1. Allowed Characters
Dataset names must contain only:
-
Lowercase letters (
a-z) -
Numbers (
0-9) -
Underscores (
_)
2.2. Forbidden Characters
The following characters are not allowed in dataset names:
| Character | Reason |
|---|---|
|
Can be confused with command-line flags in shell scripts |
` ` (space) |
Causes argument parsing issues in shells |
Accented characters (é, è, ü, etc.) |
May not be preserved consistently across systems |
Uppercase letters |
Normalized to lowercase for consistency |
Special characters ( |
Path separators or reserved characters on various filesystems |
2.3. Why These Restrictions?
The naming conventions ensure:
-
Shell safety: Paths can be used in shell scripts without quoting or escaping
-
Cross-platform compatibility: Works reliably on Linux, macOS, and Windows
-
Path expansion: Environment variable expansion (
${location}) works cleanly -
Web integration: Dashboard can use location names in URLs directly
-
Python compatibility: Underscores are valid in Python variable names
3. Normalization
The system automatically normalizes location names through the PathManager::normalizeLocationName() function (C++) and its Python equivalent.
4. Validation in kub-dataset
The kub-dataset CLI tool validates dataset names and rejects invalid ones with helpful error messages.
4.1. Default Behavior (Strict Mode)
By default, invalid names are rejected:
$ kub-dataset pack paris-6km-with-storeys-nd
Error: Invalid dataset name 'paris-6km-with-storeys-nd': dashes (-) are not allowed.
Use underscores (_) instead. Suggested: 'paris_6km_with_storeys_nd'
4.2. Auto-Normalization Mode
Use the --normalize flag to automatically fix invalid names:
$ kub-dataset pack paris-6km-with-storeys-nd --normalize
Normalized: 'paris-6km-with-storeys-nd' -> 'paris_6km_with_storeys_nd'
Created: paris_6km_with_storeys_nd_input.zip
The --normalize flag is available on pack, push, and pull commands.
5. Versioning (Semantic Versioning)
Datasets use Semantic Versioning (semver) for version numbers.
5.1. Version Format
MAJOR.MINOR.PATCH[-prerelease]
| Component | Description | Example |
|---|---|---|
MAJOR |
Incompatible changes to dataset structure |
|
MINOR |
New data added in backward-compatible manner |
|
PATCH |
Bug fixes or data corrections |
|
prerelease |
Optional pre-release identifier |
|
5.2. Version Examples
-
1.0.0- First stable release -
0.99.0- Pre-release version approaching 1.0 -
1.0.0-beta- Beta version -
2.1.3- Third patch of version 2.1
5.3. Version in Filenames
Dataset archives follow this naming pattern:
{location}_input-v{version}.zip
Examples:
-
kernante_input-v0.99.0.zip -
strasbourg_input-v1.0.0.zip -
paris_6km_with_storeys_nd_input-v1.0.0.zip
5.4. Version Tracking with DVC
When pulling datasets with the --dvc flag, version information is recorded in .dvc files:
outs:
- md5: abc123...
path: kernante
hash: md5
size: 12345678
nfiles: 42
meta:
source: girder-unistra
version: 0.99.0 # Resolved version
requested: latest # What was requested
tool: kub-dataset
This allows tracking both the actual version pulled and what was originally requested (e.g., latest).
6. Python API
Validation functions are available in the Python API:
from feelpp.ktirio.ub.dataset import (
validate_dataset_name,
normalize_dataset_name,
validate_or_normalize,
)
# Validate a name (raises ValueError if invalid)
validate_dataset_name("kernante") # OK
validate_dataset_name("paris-6km") # Raises ValueError
# Normalize a name
normalized = normalize_dataset_name("Paris 15ème")
# -> "paris_15eme"
# Validate or normalize based on flag
name = validate_or_normalize("paris-6km", normalize=True)
# -> "paris_6km"
7. C++ API
The normalization function is available via the PathManager class:
#include <ktirio/ub/pathmanager.hpp>
// Static method for normalization
std::string normalized = Feel::Ktirio::Ub::PathManager::normalizeLocationName("Paris 15ème");
// -> "paris_15eme"
// Used automatically during PathManager initialization
auto& pm = Feel::Ktirio::Ub::PathManager::instance();
pm.initialize(std::nullopt, "paris-6km-with-storeys-nd");
// Location name is normalized to "paris_6km_with_storeys_nd"
8. Best Practices
-
Use underscores from the start: When creating new datasets, use underscores instead of dashes
-
Lowercase everything: Avoid uppercase letters in dataset names
-
Keep names simple: Use only alphanumeric characters and underscores
-
Version consistently: Follow semver for all dataset releases
-
Use DVC tracking: Enable
--dvcwhen pulling to track dataset provenance