HiDALGO2 D4.2 — kub-dataset and Multi-Backend Data Management (Draft Section)
This section describes the KUB contribution to HiDALGO2 D4.2 through kub-dataset. The objective is to provide reproducible dataset packaging and transfer across multiple data management platforms (CKAN, HDFS, S3-compatible storage), while preserving a single user workflow.
1. 1. kub-dataset design and dataset layout
kub-dataset manages location datasets as versioned assets (vX.Y.Z) with a current pointer for operational selection. This supports patch/minor/major update policies while keeping reproducible historical snapshots available. Local path resolution is centralized in the path manager layer so tooling can transparently consume flat or versioned datasets.
Operationally, the CLI delegates storage operations to a backend factory and a common backend protocol (push, pull, list_locations, list_versions, component-level operations). This design isolates platform-specific logic and allows the same command set to target different storage systems.
2. 2. Backend focus: NiFi facade + native connectors
For HiDALGO2 integration, the HiDALGO2 backend acts as a NiFi-based transfer facade through hid_data_transfer_lib:
-
CKAN-oriented pipelines:
local2ckan,ckan2local. -
HDFS-oriented pipelines:
local2hdfs,hdfs2local(method-name compatibility handling included). -
Optional proxy fallback to native CKAN/HDFS connectors when a NiFi transfer call fails.
In parallel, native connectors are available and can be used directly:
-
CKAN backend: package/resource lifecycle through CKAN API.
-
HDFS backend: hierarchical archive storage and retrieval.
-
S3 backend: S3-compatible object storage (AWS-compatible and MinIO-target model), including web-identity flow for temporary credentials.
This provides resilience: NiFi for orchestrated transfer workflows, native backends for continuity and diagnostics.
3. 3. Login and tenant management
Authentication uses Keycloak/OIDC providers with:
-
interactive PKCE browser login,
-
non-interactive login for automation,
-
token/context propagation to backends.
The same auth context carries organization identity, enabling tenant-aware routing (e.g., CKAN organization selection, S3 bucket naming convention, HDFS path partitioning by organization).
4. 4. Metrics and future capabilities for D4.2
The current architecture already exposes the basis for operational observability. Recommended future metrics:
-
transfer throughput/latency by backend and archive type;
-
NiFi success rate, fallback rate, and retry/recovery time;
-
integrity and quality metrics (checksum pass rate, manifest/schema validation status);
-
version governance indicators (patch cadence, age of
current, rollback frequency); -
access/security telemetry (login success/failure, token refresh issues, per-organization volume).
A practical next step is standardized event export per transfer operation (JSON/Prometheus-compatible), to unify monitoring across CKAN, HDFS, S3, and NiFi execution paths.