4457617953
Add a new `validate` subcommand that checks JSONL training manifests before starting expensive fine-tuning jobs. This catches format issues, missing audio files, and data quality problems early. The validator performs: - JSONL format validation (each line must be valid JSON) - Required column checks (text, audio) - Audio file existence and readability verification - Duration and text length statistics (min, max, mean, median) - Optional ref_audio column validation - Warnings for very short (<0.3s) or very long (>30s) audio samples Usage: voxcpm validate --manifest train.jsonl voxcpm validate --manifest train.jsonl --sample-rate 16000 --verbose The module uses lazy imports for soundfile, so it works even in minimal environments. Includes 11 unit tests covering all validation paths.
31 lines
739 B
Python
31 lines
739 B
Python
"""
|
|
Training utilities for VoxCPM fine-tuning.
|
|
|
|
This package mirrors the training mechanics used in the minicpm-audio
|
|
tooling while relying solely on local audio-text datasets managed via
|
|
the HuggingFace ``datasets`` library.
|
|
"""
|
|
|
|
from .accelerator import Accelerator
|
|
from .tracker import TrainingTracker
|
|
from .data import (
|
|
load_audio_text_datasets,
|
|
HFVoxCPMDataset,
|
|
build_dataloader,
|
|
BatchProcessor,
|
|
)
|
|
from .state import TrainingState
|
|
from .validate import validate_manifest, ValidationResult
|
|
|
|
__all__ = [
|
|
"Accelerator",
|
|
"TrainingTracker",
|
|
"HFVoxCPMDataset",
|
|
"BatchProcessor",
|
|
"TrainingState",
|
|
"load_audio_text_datasets",
|
|
"build_dataloader",
|
|
"validate_manifest",
|
|
"ValidationResult",
|
|
]
|