From a010d621ff9a1dce83ce9c934919b24f28073e90 Mon Sep 17 00:00:00 2001 From: Labmem-Zhouyx <913703649@qq.com> Date: Mon, 6 Apr 2026 22:09:24 +0800 Subject: [PATCH] update readme Made-with: Cursor --- README.md | 22 ++++++++++++---------- 1 file changed, 12 insertions(+), 10 deletions(-) diff --git a/README.md b/README.md index 502c741..104b7b2 100644 --- a/README.md +++ b/README.md @@ -6,6 +6,8 @@ Documentation Hugging Face ModelScope + +

@@ -40,7 +42,7 @@ VoxCPM is a **tokenizer-free** Text-to-Speech system that directly generates con - πŸŽ™οΈ **Ultimate Cloning** β€” Reproduce every vocal nuance: provide both reference audio and its transcript, and the model continues seamlessly from the reference, faithfully preserving every vocal detail β€” timbre, rhythm, emotion, and style (same as VoxCPM1.5) - πŸ”Š **48kHz High-Quality Audio** β€” Accepts 16kHz reference audio and directly outputs 48kHz studio-quality audio via AudioVAE V2's asymmetric encode/decode design, with built-in super-resolution β€” no external upsampler needed - 🧠 **Context-Aware Synthesis** β€” Automatically infers appropriate prosody and expressiveness from text content -- ⚑ **Real-Time Streaming** β€” RTF as low as ~0.13 on NVIDIA RTX 4090 by [Nano-VLLM](https://github.com/huggingface/nano-vllm) +- ⚑ **Real-Time Streaming** β€” RTF as low as ~0.3 on NVIDIA RTX 4090, and ~0.13 accelerated by [Nano-VLLM](https://github.com/a710128/nanovllm-voxcpm) - πŸ“œ **Fully Open-Source & Commercial-Ready** β€” Weights and code released under the [Apache-2.0](LICENSE) license, free for commercial use
@@ -53,10 +55,10 @@ Chinese Dialect: 四川话, η²€θ―­, 吴语, δΈœεŒ—θ―, 河南话, ι™•θ₯Ώθ―, ε±± ### News -* **[2026.04]** πŸ”₯ We release **VoxCPM2** β€” 2B, 30 languages, Voice Design & Controllable Voice Cloning, 48kHz audio output! [Weights](https://huggingface.co/openbmb/VoxCPM2) | [Docs](https://voxcpm.readthedocs.io/en/latest/) +* **[2026.04]** πŸ”₯ We release **VoxCPM2** β€” 2B, 30 languages, Voice Design & Controllable Voice Cloning, 48kHz audio output! [Weights](https://huggingface.co/openbmb/VoxCPM2) | [Docs](https://voxcpm.readthedocs.io/en/latest/) | [Playground](https://huggingface.co/spaces/OpenBMB/VoxCPM-Demo) * **[2025.12]** πŸŽ‰ Open-source **VoxCPM1.5** [weights](https://huggingface.co/openbmb/VoxCPM1.5) with SFT & LoRA fine-tuning. (**πŸ† #1 GitHub Trending**) * **[2025.09]** πŸ”₯ Release VoxCPM [Technical Report](https://arxiv.org/abs/2509.24650). -* **[2025.09]** πŸŽ‰ Open-source **VoxCPM-0.5B** [weights](https://huggingface.co/openbmb/VoxCPM-0.5B) & [Playground](https://huggingface.co/spaces/OpenBMB/VoxCPM-Demo). (**πŸ† #1 HuggingFace Trending**) +* **[2025.09]** πŸŽ‰ Open-source **VoxCPM-0.5B** [weights](https://huggingface.co/openbmb/VoxCPM-0.5B) (**πŸ† #1 HuggingFace Trending**) --- @@ -181,7 +183,7 @@ voxcpm design \ --text "VoxCPM2 brings studio-quality multilingual speech synthesis." \ --output out.wav -# Voice design with style control +# Controllable voice cloning with style control voxcpm design \ --text "VoxCPM2 brings studio-quality multilingual speech synthesis." \ --control "Young female voice, warm and gentle, slightly smiling" \ @@ -233,7 +235,7 @@ server.stop() > **RTF as low as ~0.13 on NVIDIA RTX 4090** (vs ~0.3 with the standard PyTorch implementation), with support for batched concurrent requests and a FastAPI HTTP server. See the [Nano-vLLM-VoxCPM repo](https://github.com/a710128/nanovllm-voxcpm) for deployment details. -> **Full parameter reference, multi-scenario examples, and voice cloning tips β†’** [Quick Start Guide](https://voxcpm.readthedocs.io/en/latest/quickstart.html) | [Usage Guide & Best Practices](https://voxcpm.readthedocs.io/en/latest/cookbook.html) +> **Full parameter reference, multi-scenario examples, and voice cloning tips β†’** [Quick Start Guide](https://voxcpm.readthedocs.io/en/latest/quickstart.html) | [Usage Guide](https://voxcpm.readthedocs.io/en/latest/usage_guide.html) | [Cookbook](https://voxcpm.readthedocs.io/en/latest/cookbook.html) --- @@ -246,15 +248,15 @@ server.stop() | **Audio Sample Rate** | 48kHz | 44.1kHz | 16kHz | | **LM Token Rate** | 6.25Hz | 6.25Hz | 12.5Hz | | **Languages** | 30 | 2 (zh, en) | 2 (zh, en) | +| **Cloning Mode** | Isolated Reference & Continuation | Continuation only | Continuation only | | **Voice Design** | βœ… | β€” | β€” | -| **Style Control** | βœ… | β€” | β€” | -| **Reference Cloning** | Isolated Reference & Continuation | Continuation only | Continuation only | +| **Controllable Voice Cloning** | βœ… | β€” | β€” | | **SFT / LoRA** | βœ… | βœ… | βœ… | | **RTF (RTX 4090)** | ~0.30 | ~0.15 | ~0.17 | | **RTF in Nano-VLLM (RTX 4090)** | ~0.13 | ~0.08 | ~0.10 | | **VRAM** | ~8 GB | ~6 GB | ~5 GB | | **Weights** | [πŸ€— HF](https://huggingface.co/openbmb/VoxCPM2) / [MS](https://modelscope.cn/models/OpenBMB/VoxCPM2) | [πŸ€— HF](https://huggingface.co/openbmb/VoxCPM1.5) / [MS](https://modelscope.cn/models/OpenBMB/VoxCPM1.5) | [πŸ€— HF](https://huggingface.co/openbmb/VoxCPM-0.5B) / [MS](https://modelscope.cn/models/OpenBMB/VoxCPM-0.5B) | -| **Technical Report** | Coming soon | β€” | [arXiv](https://arxiv.org/abs/2509.24650) | +| **Technical Report** | Coming soon | β€” | [arXiv](https://arxiv.org/abs/2509.24650) [ICLR 2026](https://openreview.net/forum?id=h5KLpGoqzC) | | **Demo Page** | [Audio Samples](https://openbmb.github.io/voxcpm2-demopage) | β€” | [Audio Samples](https://openbmb.github.io/VoxCPM-demopage) | VoxCPM2 is built on a **tokenizer-free, diffusion autoregressive** paradigm. The model operates entirely in the latent space of **AudioVAE V2**, following a four-stage pipeline: **LocEnc β†’ TSLM β†’ RALM β†’ LocDiT**, enabling rich expressiveness and 48kHz native audio output. @@ -263,7 +265,7 @@ VoxCPM2 is built on a **tokenizer-free, diffusion autoregressive** paradigm. The VoxCPM2 Model Architecture
-> For full architectural details, VoxCPM2-specific upgrades, and a model comparison table, see the [Architecture & Design Docs](https://voxcpm.readthedocs.io/en/latest/models/version_history.html). +> For full architectural details, VoxCPM2-specific upgrades, and a model comparison table, see the [Architecture Design](https://voxcpm.readthedocs.io/en/latest/models/architecture.html). --- @@ -470,7 +472,7 @@ Full documentation: **[voxcpm.readthedocs.io](https://voxcpm.readthedocs.io/en/l ## ⚠️ Risks and Limitations - **Potential for Misuse:** VoxCPM's voice cloning can generate highly realistic synthetic speech. It is **strictly forbidden** to use VoxCPM for impersonation, fraud, or disinformation. We strongly recommend clearly marking any AI-generated content. -- **Controllable Generation Stability:** Voice Design and Style Control results can vary between runs β€” you may try to generate 1~3 times to obtain the desired voice or style. We are actively working on improving controllability consistency. +- **Controllable Generation Stability:** Voice Design and Controllable Voice Cloning results can vary between runs β€” you may try to generate 1~3 times to obtain the desired voice or style. We are actively working on improving controllability consistency. - **Language Coverage:** VoxCPM2 officially supports 30 languages. For languages not on the list, you are welcome to test directly or try fine-tuning on your own data. We plan to expand language coverage in future releases. - **Usage:** This model is released under the Apache-2.0 license. For production deployments, we recommend conducting thorough testing and safety evaluation tailored to your use case.