Move `import soundfile as sf` and `from voxcpm.core import VoxCPM` from
module-level into the functions that require model inference (load_model,
_run_single, cmd_batch), so `voxcpm validate` can run without loading
the model/inference stack.
Pass manifest path via --manifest flag (required) instead of as a
positional argument, so the test exercises cmd_validate rather than
argparse error handling. Also assert returncode==1 and check stderr
for the FAILED/error message to prevent false positives.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Invalid audio rows (bad path or sample-rate mismatch) no longer
increment valid_samples; has_error is now set on any audio failure
- _check_audio_file now enforces the expected sample rate when soundfile
is available, making --sample-rate actually useful
- ref_audio missing-file warning is emitted for every invalid entry
independently, not only before the first valid one is seen
- New tests cover each of the four corrected behaviours: invalid audio
count, sample-rate mismatch, mixed ref_audio, and CLI exit code
Drop "half" from _VALID_DTYPE_OVERRIDES / _LOW_PRECISION_DTYPES.
get_dtype() has never accepted "half", so VOXCPM_MPS_DTYPE=half would
pass override validation and then crash downstream with
"Unsupported dtype: half". The remaining aliases (bfloat16/bf16,
float16/fp16, float32/fp32) already cover the intended dtype space.
Adds a standalone unit check under scripts/ to guard the invariant
that every accepted override parses through get_dtype().
Addresses review feedback on #263.
Added regex to strip parentheses from control instructions in the text synthesis method to ensure compatibility with the expected prompt format. This change improves the robustness of the input handling.
LoRA is a first-class workflow in VoxCPM, and the project already prefers
safetensors plus weights-only fallback loading for base model artifacts. The
legacy LoRA .ckpt/.pth path was the remaining place that still deserialized
arbitrary pickle objects, so this switches it to weights_only=True and adds
focused regression coverage for both model loaders.
Constraint: Must preserve compatibility with tensor-only legacy LoRA checkpoints
Rejected: Remove .ckpt/.pth support entirely | too disruptive for existing users
Confidence: high
Scope-risk: narrow
Reversibility: clean
Directive: Keep LoRA artifact handling aligned with the existing safetensors-first, weights-only loading pattern
Tested: python3 -m pytest -q tests/test_lora_checkpoint_loading.py tests/test_model_utils.py -q
Not-tested: Full end-to-end LoRA hot-load with heavyweight model assets
Document vLLM-Omni as a production serving option for VoxCPM2
alongside the existing Nano-vLLM reference. Mirrors the addition in
README_zh.md, and adds an ecosystem table entry.
Install snippet follows the upstream vLLM-Omni installation guide
(from source, since vllm-omni is rapidly evolving).
Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
- StreamingVAEDecoder caches CausalConv1d/CausalTransposeConv1d left-pad
state between calls — one patch in, one patch out, no overlap
- _inference yields single-patch latents in streaming mode
- 2x faster streaming VAE decode, more accurate (max diff 0.0005 vs 0.0011)
VoxCPM checkpoints default to bfloat16. Following commit e4e0496 which
added MPS device routing, running with `device=mps` selects bf16 on
Apple Silicon. On Metal, bf16 introduces enough numerical drift in the
diffusion AR loop that the synthesized audio is glitched and trips the
model's badcase detector, which retries until the per-call retry budget
is exhausted. Effectively MPS support is unusable in the default config.
This patch adds a single helper, `pick_runtime_dtype(device, dtype)`,
that promotes any low-precision dtype to float32 when the resolved
device is `mps`. CUDA and CPU paths are untouched. An opt-out env var
`VOXCPM_MPS_DTYPE` lets users force a specific dtype on MPS once future
PyTorch / macOS releases improve bf16 stability.
Both VoxCPMModel and VoxCPM2Model adopt the helper in their __init__,
replacing what would otherwise be duplicated inline checks.
Verified locally on Apple M5 Max, PyTorch 2.11, macOS 15:
- VoxCPM2 (2B): clean output, RTF ~0.78 steady state
- VoxCPM 0.5B: clean output, RTF ~0.92
- No badcase retries fired in any test
- VOXCPM_MPS_DTYPE=bfloat16 round-trips and reproduces the original
glitched output, confirming the override path.
Move generator close handling into a shared utility and wire the core generation pipeline through it so partially-consumed prompt cache generators are cleaned up consistently across both model variants and the public VoxCPM wrapper.
Made-with: Cursor
Factor repeated next-and-close patterns into a shared helper in both VoxCPM model variants so non-streaming inference cleans up generators consistently while keeping the issue reference close to the workaround.
Made-with: Cursor
Use an explicit broadcastable attention mask shape during MiniCPM incremental decoding so CPU runtimes avoid a PyTorch SDPA dimension error without changing attention semantics.
Made-with: Cursor
Add a new `validate` subcommand that checks JSONL training manifests
before starting expensive fine-tuning jobs. This catches format issues,
missing audio files, and data quality problems early.
The validator performs:
- JSONL format validation (each line must be valid JSON)
- Required column checks (text, audio)
- Audio file existence and readability verification
- Duration and text length statistics (min, max, mean, median)
- Optional ref_audio column validation
- Warnings for very short (<0.3s) or very long (>30s) audio samples
Usage:
voxcpm validate --manifest train.jsonl
voxcpm validate --manifest train.jsonl --sample-rate 16000 --verbose
The module uses lazy imports for soundfile, so it works even in
minimal environments. Includes 11 unit tests covering all validation
paths.
Use context managers when reading config.json in VoxCPMModel.from_local()
and VoxCPM2Model.from_local() to prevent file descriptor leaks. Also add
explicit encoding="utf-8" to avoid locale-dependent decode errors.
Closes#235
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Support optional ref_audio samples in finetuning and make runtime device selection explicit while keeping auto fallback behavior consistent. Also ignore the local app override file to avoid accidental commits.
Made-with: Cursor
The previous guard `not text.strip() or not isinstance(text, str)` called
.strip() before verifying that text is actually a string, causing an
AttributeError (e.g. for int input) instead of the intended ValueError.
Swap operand order so isinstance check short-circuits first.
Closes#228
Streaming decode previously re-decoded 4 overlapping patches through
the VAE each step, discarding 75% of the output. Replace with stateful
decode that carries causal conv padding buffers between calls — one
patch in, one patch out, no overlap.
Changes:
- Add StreamingVAEDecoder to audiovae/audio_vae_v2.py — caches
CausalConv1d and CausalTransposeConv1d left-pad state between calls
- AudioVAE.streaming_decode() context manager for clean lifecycle
- _inference yields single-patch latents in streaming mode
- _generate and _generate_with_prompt_cache use StreamingVAEDecoder
Streaming VAE decode time (isolated): 289ms → 148ms (2x faster)
Stateful vs full decode: cosine 1.0000, max diff 0.0005
(more accurate than previous overlap approach at max diff 0.001)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>