Files
VoxCPM/src/voxcpm/training/__init__.py
T
supermario_leo 4457617953 feat: add voxcpm validate CLI for pre-flight training data checks
Add a new `validate` subcommand that checks JSONL training manifests
before starting expensive fine-tuning jobs. This catches format issues,
missing audio files, and data quality problems early.

The validator performs:
- JSONL format validation (each line must be valid JSON)
- Required column checks (text, audio)
- Audio file existence and readability verification
- Duration and text length statistics (min, max, mean, median)
- Optional ref_audio column validation
- Warnings for very short (<0.3s) or very long (>30s) audio samples

Usage:
  voxcpm validate --manifest train.jsonl
  voxcpm validate --manifest train.jsonl --sample-rate 16000 --verbose

The module uses lazy imports for soundfile, so it works even in
minimal environments. Includes 11 unit tests covering all validation
paths.
2026-04-13 03:15:50 +08:00

31 lines
739 B
Python

"""
Training utilities for VoxCPM fine-tuning.
This package mirrors the training mechanics used in the minicpm-audio
tooling while relying solely on local audio-text datasets managed via
the HuggingFace ``datasets`` library.
"""
from .accelerator import Accelerator
from .tracker import TrainingTracker
from .data import (
load_audio_text_datasets,
HFVoxCPMDataset,
build_dataloader,
BatchProcessor,
)
from .state import TrainingState
from .validate import validate_manifest, ValidationResult
__all__ = [
"Accelerator",
"TrainingTracker",
"HFVoxCPMDataset",
"BatchProcessor",
"TrainingState",
"load_audio_text_datasets",
"build_dataloader",
"validate_manifest",
"ValidationResult",
]