LoRA is a first-class workflow in VoxCPM, and the project already prefers
safetensors plus weights-only fallback loading for base model artifacts. The
legacy LoRA .ckpt/.pth path was the remaining place that still deserialized
arbitrary pickle objects, so this switches it to weights_only=True and adds
focused regression coverage for both model loaders.
Constraint: Must preserve compatibility with tensor-only legacy LoRA checkpoints
Rejected: Remove .ckpt/.pth support entirely | too disruptive for existing users
Confidence: high
Scope-risk: narrow
Reversibility: clean
Directive: Keep LoRA artifact handling aligned with the existing safetensors-first, weights-only loading pattern
Tested: python3 -m pytest -q tests/test_lora_checkpoint_loading.py tests/test_model_utils.py -q
Not-tested: Full end-to-end LoRA hot-load with heavyweight model assets
Document vLLM-Omni as a production serving option for VoxCPM2
alongside the existing Nano-vLLM reference. Mirrors the addition in
README_zh.md, and adds an ecosystem table entry.
Install snippet follows the upstream vLLM-Omni installation guide
(from source, since vllm-omni is rapidly evolving).
Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
- StreamingVAEDecoder caches CausalConv1d/CausalTransposeConv1d left-pad
state between calls — one patch in, one patch out, no overlap
- _inference yields single-patch latents in streaming mode
- 2x faster streaming VAE decode, more accurate (max diff 0.0005 vs 0.0011)
Move generator close handling into a shared utility and wire the core generation pipeline through it so partially-consumed prompt cache generators are cleaned up consistently across both model variants and the public VoxCPM wrapper.
Made-with: Cursor
Factor repeated next-and-close patterns into a shared helper in both VoxCPM model variants so non-streaming inference cleans up generators consistently while keeping the issue reference close to the workaround.
Made-with: Cursor
Use an explicit broadcastable attention mask shape during MiniCPM incremental decoding so CPU runtimes avoid a PyTorch SDPA dimension error without changing attention semantics.
Made-with: Cursor
Use context managers when reading config.json in VoxCPMModel.from_local()
and VoxCPM2Model.from_local() to prevent file descriptor leaks. Also add
explicit encoding="utf-8" to avoid locale-dependent decode errors.
Closes#235
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Support optional ref_audio samples in finetuning and make runtime device selection explicit while keeping auto fallback behavior consistent. Also ignore the local app override file to avoid accidental commits.
Made-with: Cursor
The previous guard `not text.strip() or not isinstance(text, str)` called
.strip() before verifying that text is actually a string, causing an
AttributeError (e.g. for int input) instead of the intended ValueError.
Swap operand order so isinstance check short-circuits first.
Closes#228
Streaming decode previously re-decoded 4 overlapping patches through
the VAE each step, discarding 75% of the output. Replace with stateful
decode that carries causal conv padding buffers between calls — one
patch in, one patch out, no overlap.
Changes:
- Add StreamingVAEDecoder to audiovae/audio_vae_v2.py — caches
CausalConv1d and CausalTransposeConv1d left-pad state between calls
- AudioVAE.streaming_decode() context manager for clean lifecycle
- _inference yields single-patch latents in streaming mode
- _generate and _generate_with_prompt_cache use StreamingVAEDecoder
Streaming VAE decode time (isolated): 289ms → 148ms (2x faster)
Stateful vs full decode: cosine 1.0000, max diff 0.0005
(more accurate than previous overlap approach at max diff 0.001)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add `residual_lm_no_rope` config option in VoxCPMConfig and propagate to MiniCPMModel
- Add `no_rope` field to MiniCPM4Config; make RoPE embedding optional in MiniCPMModel and MiniCPMAttention
- Add `streaming_prefix_len` parameter to generation interface
- Fix non-streaming audio decode in continuation mode to trim leading prefix patches consistently
- Refactor streaming prefix context preparation: distinguish continuation vs. zero-shot via feat_mask trailing bit instead of audio_mask sum
Made-with: Cursor