Dataset Naming Conventions

1. Overview

Dataset names in CEMDB must follow strict conventions to ensure cross-platform compatibility and consistency between C++ and Python code paths. This document describes the naming rules, normalization process, and versioning scheme used for datasets.

2. Naming Rules

2.1. Allowed Characters

Dataset names must contain only:

Lowercase letters (a-z)
Numbers (0-9)
Underscores (_)

2.2. Forbidden Characters

The following characters are not allowed in dataset names:

Character Reason

Character	Reason
`-` (dash)	Can be confused with command-line flags in shell scripts
` ` (space)	Causes argument parsing issues in shells
Accented characters (é, è, ü, etc.)	May not be preserved consistently across systems
Uppercase letters	Normalized to lowercase for consistency
Special characters (`/`, `\`, `:`, `*`, `?`, `"`, `<`, `>`, `\|`)	Path separators or reserved characters on various filesystems

- (dash)

Can be confused with command-line flags in shell scripts

` ` (space)

Causes argument parsing issues in shells

Accented characters (é, è, ü, etc.)

May not be preserved consistently across systems

Uppercase letters

Normalized to lowercase for consistency

Special characters (/, \, :, *, ?, ", <, >, |)

Path separators or reserved characters on various filesystems

2.3. Why These Restrictions?

The naming conventions ensure:

Shell safety: Paths can be used in shell scripts without quoting or escaping
Cross-platform compatibility: Works reliably on Linux, macOS, and Windows
Path expansion: Environment variable expansion (${location}) works cleanly
Web integration: Dashboard can use location names in URLs directly
Python compatibility: Underscores are valid in Python variable names

3. Normalization

The system automatically normalizes location names through the PathManager::normalizeLocationName() function (C++) and its Python equivalent.

3.1. Normalization Algorithm

Convert to lowercase
Replace all non-alphanumeric characters with underscores
Collapse consecutive underscores
Strip leading and trailing underscores

3.2. Examples

Input Normalized Output

Input	Normalized Output
`Strasbourg`	`strasbourg`
`Paris 15ème`	`paris_15eme`
`Saint-Étienne`	`saint_etienne`
`paris-6km-with-storeys-nd`	`paris_6km_with_storeys_nd`
`New York City`	`new_york_city`

Strasbourg

strasbourg

Paris 15ème

paris_15eme

Saint-Étienne

saint_etienne

paris-6km-with-storeys-nd

paris_6km_with_storeys_nd

New York City

new_york_city

4. Validation in kub-dataset

The kub-dataset CLI tool validates dataset names and rejects invalid ones with helpful error messages.

4.1. Default Behavior (Strict Mode)

By default, invalid names are rejected:

$ kub-dataset pack paris-6km-with-storeys-nd
Error: Invalid dataset name 'paris-6km-with-storeys-nd': dashes (-) are not allowed.
       Use underscores (_) instead. Suggested: 'paris_6km_with_storeys_nd'

4.2. Auto-Normalization Mode

Use the --normalize flag to automatically fix invalid names:

$ kub-dataset pack paris-6km-with-storeys-nd --normalize
Normalized: 'paris-6km-with-storeys-nd' -> 'paris_6km_with_storeys_nd'
Created: paris_6km_with_storeys_nd_input.zip

The --normalize flag is available on pack, push, and pull commands.

5. Versioning (Semantic Versioning)

Datasets use Semantic Versioning (semver) for version numbers.

5.1. Version Format

MAJOR.MINOR.PATCH[-prerelease]

Component Description Example

Component	Description	Example
MAJOR	Incompatible changes to dataset structure	`2.0.0`
MINOR	New data added in backward-compatible manner	`1.1.0`
PATCH	Bug fixes or data corrections	`1.0.1`
prerelease	Optional pre-release identifier	`1.0.0-beta`, `1.0.0-rc1`

MAJOR

Incompatible changes to dataset structure

2.0.0

MINOR

New data added in backward-compatible manner

1.1.0

PATCH

Bug fixes or data corrections

1.0.1

prerelease

Optional pre-release identifier

1.0.0-beta, 1.0.0-rc1

5.2. Version Examples

1.0.0 - First stable release
0.99.0 - Pre-release version approaching 1.0
1.0.0-beta - Beta version
2.1.3 - Third patch of version 2.1

5.3. Version in Filenames

Dataset archives follow this naming pattern:

{location}_input-v{version}.zip

Examples:

kernante_input-v0.99.0.zip
strasbourg_input-v1.0.0.zip
paris_6km_with_storeys_nd_input-v1.0.0.zip

5.4. Version Tracking with DVC

When pulling datasets with the --dvc flag, version information is recorded in .dvc files:

outs:
- md5: abc123...
  path: kernante
  hash: md5
  size: 12345678
  nfiles: 42
meta:
  source: girder-unistra
  version: 0.99.0          # Resolved version
  requested: latest        # What was requested
  tool: kub-dataset

This allows tracking both the actual version pulled and what was originally requested (e.g., latest).

6. Python API

Validation functions are available in the Python API:

from feelpp.ktirio.ub.dataset import (
    validate_dataset_name,
    normalize_dataset_name,
    validate_or_normalize,
)

# Validate a name (raises ValueError if invalid)
validate_dataset_name("kernante")  # OK
validate_dataset_name("paris-6km")  # Raises ValueError

# Normalize a name
normalized = normalize_dataset_name("Paris 15ème")
# -> "paris_15eme"

# Validate or normalize based on flag
name = validate_or_normalize("paris-6km", normalize=True)
# -> "paris_6km"

7. C++ API

The normalization function is available via the PathManager class:

#include <ktirio/ub/pathmanager.hpp>

// Static method for normalization
std::string normalized = Feel::Ktirio::Ub::PathManager::normalizeLocationName("Paris 15ème");
// -> "paris_15eme"

// Used automatically during PathManager initialization
auto& pm = Feel::Ktirio::Ub::PathManager::instance();
pm.initialize(std::nullopt, "paris-6km-with-storeys-nd");
// Location name is normalized to "paris_6km_with_storeys_nd"

8. Best Practices

Use underscores from the start: When creating new datasets, use underscores instead of dashes
Lowercase everything: Avoid uppercase letters in dataset names
Keep names simple: Use only alphanumeric characters and underscores
Version consistently: Follow semver for all dataset releases
Use DVC tracking: Enable --dvc when pulling to track dataset provenance