Dataset Naming Conventions

1. Overview

Dataset names in CEMDB must follow strict conventions to ensure cross-platform compatibility and consistency between C++ and Python code paths. This document describes the naming rules, normalization process, and versioning scheme used for datasets.

2. Naming Rules

2.1. Allowed Characters

Dataset names must contain only:

  • Lowercase letters (a-z)

  • Numbers (0-9)

  • Underscores (_)

2.2. Forbidden Characters

The following characters are not allowed in dataset names:

Character Reason

- (dash)

Can be confused with command-line flags in shell scripts

` ` (space)

Causes argument parsing issues in shells

Accented characters (é, è, ü, etc.)

May not be preserved consistently across systems

Uppercase letters

Normalized to lowercase for consistency

Special characters (/, \, :, *, ?, ", <, >, |)

Path separators or reserved characters on various filesystems

2.3. Why These Restrictions?

The naming conventions ensure:

  • Shell safety: Paths can be used in shell scripts without quoting or escaping

  • Cross-platform compatibility: Works reliably on Linux, macOS, and Windows

  • Path expansion: Environment variable expansion (${location}) works cleanly

  • Web integration: Dashboard can use location names in URLs directly

  • Python compatibility: Underscores are valid in Python variable names

3. Normalization

The system automatically normalizes location names through the PathManager::normalizeLocationName() function (C++) and its Python equivalent.

3.1. Normalization Algorithm

  1. Convert to lowercase

  2. Replace all non-alphanumeric characters with underscores

  3. Collapse consecutive underscores

  4. Strip leading and trailing underscores

3.2. Examples

Input Normalized Output

Strasbourg

strasbourg

Paris 15ème

paris_15eme

Saint-Étienne

saint_etienne

paris-6km-with-storeys-nd

paris_6km_with_storeys_nd

New York City

new_york_city

4. Validation in kub-dataset

The kub-dataset CLI tool validates dataset names and rejects invalid ones with helpful error messages.

4.1. Default Behavior (Strict Mode)

By default, invalid names are rejected:

$ kub-dataset pack paris-6km-with-storeys-nd
Error: Invalid dataset name 'paris-6km-with-storeys-nd': dashes (-) are not allowed.
       Use underscores (_) instead. Suggested: 'paris_6km_with_storeys_nd'

4.2. Auto-Normalization Mode

Use the --normalize flag to automatically fix invalid names:

$ kub-dataset pack paris-6km-with-storeys-nd --normalize
Normalized: 'paris-6km-with-storeys-nd' -> 'paris_6km_with_storeys_nd'
Created: paris_6km_with_storeys_nd_input.zip

The --normalize flag is available on pack, push, and pull commands.

5. Versioning (Semantic Versioning)

Datasets use Semantic Versioning (semver) for version numbers.

5.1. Version Format

MAJOR.MINOR.PATCH[-prerelease]
Component Description Example

MAJOR

Incompatible changes to dataset structure

2.0.0

MINOR

New data added in backward-compatible manner

1.1.0

PATCH

Bug fixes or data corrections

1.0.1

prerelease

Optional pre-release identifier

1.0.0-beta, 1.0.0-rc1

5.2. Version Examples

  • 1.0.0 - First stable release

  • 0.99.0 - Pre-release version approaching 1.0

  • 1.0.0-beta - Beta version

  • 2.1.3 - Third patch of version 2.1

5.3. Version in Filenames

Dataset archives follow this naming pattern:

{location}_input-v{version}.zip

Examples:

  • kernante_input-v0.99.0.zip

  • strasbourg_input-v1.0.0.zip

  • paris_6km_with_storeys_nd_input-v1.0.0.zip

5.4. Version Tracking with DVC

When pulling datasets with the --dvc flag, version information is recorded in .dvc files:

outs:
- md5: abc123...
  path: kernante
  hash: md5
  size: 12345678
  nfiles: 42
meta:
  source: girder-unistra
  version: 0.99.0          # Resolved version
  requested: latest        # What was requested
  tool: kub-dataset

This allows tracking both the actual version pulled and what was originally requested (e.g., latest).

6. Python API

Validation functions are available in the Python API:

from feelpp.ktirio.ub.dataset import (
    validate_dataset_name,
    normalize_dataset_name,
    validate_or_normalize,
)

# Validate a name (raises ValueError if invalid)
validate_dataset_name("kernante")  # OK
validate_dataset_name("paris-6km")  # Raises ValueError

# Normalize a name
normalized = normalize_dataset_name("Paris 15ème")
# -> "paris_15eme"

# Validate or normalize based on flag
name = validate_or_normalize("paris-6km", normalize=True)
# -> "paris_6km"

7. C++ API

The normalization function is available via the PathManager class:

#include <ktirio/ub/pathmanager.hpp>

// Static method for normalization
std::string normalized = Feel::Ktirio::Ub::PathManager::normalizeLocationName("Paris 15ème");
// -> "paris_15eme"

// Used automatically during PathManager initialization
auto& pm = Feel::Ktirio::Ub::PathManager::instance();
pm.initialize(std::nullopt, "paris-6km-with-storeys-nd");
// Location name is normalized to "paris_6km_with_storeys_nd"

8. Best Practices

  1. Use underscores from the start: When creating new datasets, use underscores instead of dashes

  2. Lowercase everything: Avoid uppercase letters in dataset names

  3. Keep names simple: Use only alphanumeric characters and underscores

  4. Version consistently: Follow semver for all dataset releases

  5. Use DVC tracking: Enable --dvc when pulling to track dataset provenance