Update app.py UI, adjust streaming_prefix_len, remove legacy docs
- Refine app.py: Ultimate Cloning naming, NFE slider, i18n polish - Change streaming_prefix_len default from 3 to 4 for smoother decoding - Remove legacy docs/ directory (migrated to ReadTheDocs) Made-with: Cursor
This commit is contained in:
@@ -24,118 +24,121 @@ logger = logging.getLogger(__name__)
|
||||
# ---------- Inline i18n (en + zh-CN only) ----------
|
||||
|
||||
_USAGE_INSTRUCTIONS_EN = (
|
||||
"**Usage Instructions:**\n\n"
|
||||
"🎨 **Voice Design** — Create a voice from scratch \n"
|
||||
"No reference audio needed. Simply describe the desired gender, tone, and emotion "
|
||||
"in Control Instruction, and VoxCPM will generate a unique voice for you.\n\n"
|
||||
"🎛️ **Controllable Voice Cloning** — Clone with style control \n"
|
||||
"Upload reference audio and use Control Instruction to guide speed, emotion, style, and more.\n\n"
|
||||
"🎙️ **Hi-Fi Cloning** — Maximum voice similarity \n"
|
||||
"For the best cloning quality, enable and provide the reference audio transcript "
|
||||
"to reproduce the original voice as closely as possible."
|
||||
"**VoxCPM2 — Three Modes of Speech Generation:**\n\n"
|
||||
"🎨 **Voice Design** — Create a brand-new voice \n"
|
||||
"No reference audio required. Describe the desired voice characteristics "
|
||||
"(gender, age, tone, emotion, pace …) in **Control Instruction**, and VoxCPM2 "
|
||||
"will craft a unique voice from your description alone.\n\n"
|
||||
"🎛️ **Controllable Cloning** — Clone a voice with optional style guidance \n"
|
||||
"Upload a reference audio clip, then use **Control Instruction** to steer "
|
||||
"emotion, speaking pace, and overall style while preserving the original timbre.\n\n"
|
||||
"🎙️ **Ultimate Cloning** — Reproduce every vocal nuance through audio continuation \n"
|
||||
"Turn on **Ultimate Cloning Mode** and provide (or auto-transcribe) the reference audio's transcript. "
|
||||
"The model treats the reference clip as a spoken prefix and seamlessly **continues** from it, faithfully preserving every vocal detail."
|
||||
"Note: This mode will disable Control Instruction."
|
||||
)
|
||||
|
||||
_EXAMPLES_FOOTER_EN = (
|
||||
"---\n"
|
||||
"**Voice Description Examples:** \n"
|
||||
"You can describe it like this: \n"
|
||||
"【Example 1: Melancholic/Tsundere Female】 \n"
|
||||
'Control Instruction: "A young beautiful girl with a sweet voice, '
|
||||
'tsundere tone, slow speaking pace, and a touch of sadness." \n'
|
||||
'Target Text: "I never asked you to stay... It\'s not like I care or anything. '
|
||||
'But... why does it still hurt so much now that you\'re gone?" \n\n'
|
||||
"【Example 2: Lazy/Casual Male】 \n"
|
||||
'Control Instruction: "Lazy and drawling male voice, nasal, '
|
||||
'very relaxed and casual." \n'
|
||||
'Target Text: "Dude, did you see that set? The waves out there are totally gnarly today, bro. '
|
||||
"Just catching barrels all morning. It's like, totally righteous, you know what I mean?\""
|
||||
"**💡 Voice Description Examples:** \n"
|
||||
"Try the following Control Instructions to explore different voices: \n\n"
|
||||
"**Example 1 — Gentle & Melancholic Girl** \n"
|
||||
'`Control Instruction`: *"A young girl with a soft, sweet voice. '
|
||||
'Speaks slowly with a melancholic, slightly tsundere tone."* \n'
|
||||
'`Target Text`: *"I never asked you to stay… It\'s not like I care or anything. '
|
||||
'But… why does it still hurt so much now that you\'re gone?"* \n\n'
|
||||
"**Example 2 — Laid-Back Surfer Dude** \n"
|
||||
'`Control Instruction`: *"Relaxed young male voice, slightly nasal, '
|
||||
'lazy drawl, very casual and chill."* \n'
|
||||
'`Target Text`: *"Dude, did you see that set? The waves out there are totally gnarly today. '
|
||||
"Just catching barrels all morning — it's like, totally righteous, you know what I mean?\"*"
|
||||
)
|
||||
|
||||
_USAGE_INSTRUCTIONS_ZH = (
|
||||
"**使用说明:**\n\n"
|
||||
"🎨 **Voice Design — 声音定制** \n"
|
||||
"无需上传参考音频,只需在 Control Instruction 中描述你想要的性别、音色和情绪,"
|
||||
"VoxCPM 即可凭空为你生成专属音色。\n\n"
|
||||
"🎛️ **Controllable Voice Cloning — 可控音色克隆** \n"
|
||||
"支持上传参考音频,并可以给instruction文本来指导控制语速、情绪、风格等表现。\n\n"
|
||||
"🎙️ **Hi-Fi Cloning — 高保真克隆** \n"
|
||||
"启用并上传参考音频文本,同时开启参考音频 + 音频续写,保留最佳一致性体验。\n\n"
|
||||
"**VoxCPM2 — 三种语音生成方式:**\n\n"
|
||||
"🎨 **声音设计(Voice Design)** \n"
|
||||
"无需参考音频。在 **Control Instruction** 中描述目标音色特征"
|
||||
"(性别、年龄、语气、情绪、语速等),VoxCPM2 即可为你从零创造独一无二的声音。\n\n"
|
||||
"🎛️ **可控克隆(Controllable Cloning)** \n"
|
||||
"上传参考音频,同时可选地使用 **Control Instruction** 来指定情绪、语速、风格等表达方式,"
|
||||
"在保留原始音色的基础上灵活控制说话风格。\n\n"
|
||||
"🎙️ **极致克隆(Ultimate Cloning)** \n"
|
||||
"开启 **极致克隆模式** 并提供参考音频的文字内容(可自动识别)。"
|
||||
"模型会将参考音频视为已说出的前文,以**音频续写**的方式完整还原参考音频中的所有声音细节。"
|
||||
"注意:该模式与可控克隆模式互斥,将禁用Control Instruction。\n\n"
|
||||
)
|
||||
|
||||
_EXAMPLES_FOOTER_ZH = (
|
||||
"---\n"
|
||||
"**声音描述示例:** \n"
|
||||
"你可以这样输入(中英文均可): \n"
|
||||
"【示例1:深宫太后】 \n"
|
||||
'`Control Instruction`: `"中老年女性,声音低沉阴冷,语速慢而有力,'
|
||||
'每个字都像是深思熟虑后说出,带有深不可测的城府和威胁感。"` \n'
|
||||
'`Target Text`: `"哀家在这深宫待了四十年,什么风浪没见过?你以为瞒得过哀家?"` \n\n'
|
||||
"【示例2:暴躁男声】 \n"
|
||||
'`Control Instruction`: `"暴躁的中年男声,语速较快,充满无奈和愤怒"` \n'
|
||||
'`Target Text`: `"踩离合!踩刹车啊!你往哪儿开呢?前面是树你看不见吗?'
|
||||
'我教了你八百遍了,打死方向盘!你是不是想把车给我开到沟里去?"`\n\n'
|
||||
"💡 **方言生成特别说明:** \n"
|
||||
'当前版本若要生成纯正的方言,请务必在"Target Text"中直接输入方言专属的词汇和表达,'
|
||||
"并配合方言的音色描述。 \n\n"
|
||||
"【示例一:广东话】 \n"
|
||||
'`Control Instruction`: `"广东话,中年男性,语气平淡"` \n'
|
||||
"✅ 正确的 `Target Text`(使用粤语表达):"
|
||||
'`"伙計,唔該一個A餐,凍奶茶少甜!"` \n'
|
||||
"❌ 错误的 `Target Text`(使用普通话):"
|
||||
'`"伙计,麻烦来一个A餐,冻奶茶少甜!"` \n\n'
|
||||
"【示例二:河南话】 \n"
|
||||
'`Control Instruction`: `"河南话,接地气的大叔"` \n'
|
||||
"✅ 正确的 `Target Text`(使用河南话表达):"
|
||||
'`"恁这是弄啥嘞?晌午吃啥饭?"` \n'
|
||||
"❌ 错误的 `Target Text`(使用普通话):"
|
||||
'`"你这是在干什么呢?中午吃什么饭?"` \n\n'
|
||||
"🤖 **实用小技巧:不知道怎么写地道的方言?** \n"
|
||||
"您可以先在 豆包、DeepSeek、Kimi 等 AI 助手中输入普通话,"
|
||||
"让它们帮你翻译成方言文本,然后再复制粘贴到 `Target Text` 中直接使用! \n\n"
|
||||
"📢 **研发小贴士:** \n"
|
||||
'我们正在努力优化 AI!后续版本将支持"输入普通话文本,一键生成方言口音"的功能,敬请期待!'
|
||||
"**💡 声音描述示例(中英文均可):** \n\n"
|
||||
"**示例 1 — 深宫太后** \n"
|
||||
'`Control Instruction`: *"中老年女性,声音低沉阴冷,语速缓慢而有力,'
|
||||
'字字深思熟虑,带有深不可测的城府与威慑感。"* \n'
|
||||
'`Target Text`: *"哀家在这深宫待了四十年,什么风浪没见过?你以为瞒得过哀家?"* \n\n'
|
||||
"**示例 2 — 暴躁驾校教练** \n"
|
||||
'`Control Instruction`: *"暴躁的中年男声,语速快,充满无奈和愤怒"* \n'
|
||||
'`Target Text`: *"踩离合!踩刹车啊!你往哪儿开呢?前面是树你看不见吗?'
|
||||
'我教了你八百遍了,打死方向盘!你是不是想把车给我开到沟里去?"* \n\n'
|
||||
"---\n"
|
||||
"**🗣️ 方言生成指南:** \n"
|
||||
"要生成地道的方言语音,请在 **Target Text** 中直接使用方言词汇和句式,"
|
||||
"并在 **Control Instruction** 中描述方言特征。 \n\n"
|
||||
"**示例 — 广东话** \n"
|
||||
'`Control Instruction`: *"粤语,中年男性,语气平淡"* \n'
|
||||
'✅ 正确(粤语表达):*"伙計,唔該一個A餐,凍奶茶少甜!"* \n'
|
||||
'❌ 错误(普通话原文):*"伙计,麻烦来一个A餐,冻奶茶少甜!"* \n\n'
|
||||
"**示例 — 河南话** \n"
|
||||
'`Control Instruction`: *"河南话,接地气的大叔"* \n'
|
||||
'✅ 正确(河南话表达):*"恁这是弄啥嘞?晌午吃啥饭?"* \n'
|
||||
'❌ 错误(普通话原文):*"你这是在干什么呢?中午吃什么饭?"* \n\n'
|
||||
"🤖 **小技巧:** 不知道方言怎么写?可以用豆包、DeepSeek、Kimi 等 AI 助手"
|
||||
"将普通话翻译为方言文本,再粘贴到 Target Text 中即可。 \n\n"
|
||||
)
|
||||
|
||||
_I18N_TRANSLATIONS = {
|
||||
"en": {
|
||||
"reference_audio_label": "Reference Audio (optional — for cloning)",
|
||||
"show_prompt_text_label": "Enable Prompt Text (improves voice similarity)",
|
||||
"show_prompt_text_info": "Uses the ASR transcript of reference audio for higher cloning fidelity. Control Instruction will be disabled.",
|
||||
"prompt_text_label": "Prompt Text (auto-filled by ASR, editable)",
|
||||
"prompt_text_placeholder": "The transcript of your reference audio will appear here...",
|
||||
"control_label": "Control Instruction (optional, only support English and Chinese)",
|
||||
"control_placeholder": "e.g. 年轻女性,温柔甜美 / sadly / an excited young man",
|
||||
"target_text_label": "Target Text",
|
||||
"generate_btn": "Generate Speech",
|
||||
"reference_audio_label": "🎤 Reference Audio (optional — upload for cloning)",
|
||||
"show_prompt_text_label": "🎙️ Ultimate Cloning Mode (transcript-guided cloning)",
|
||||
"show_prompt_text_info": "Auto-transcribes reference audio for every vocal nuance reproduced. Control Instruction will be disabled when active.",
|
||||
"prompt_text_label": "Transcript of Reference Audio (auto-filled via ASR, editable)",
|
||||
"prompt_text_placeholder": "The transcript of your reference audio will appear here …",
|
||||
"control_label": "🎛️ Control Instruction (optional — supports Chinese & English)",
|
||||
"control_placeholder": "e.g. A warm young woman / 年轻女性,温柔甜美 / Excited and fast-paced",
|
||||
"target_text_label": "✍️ Target Text — the content to speak",
|
||||
"generate_btn": "🔊 Generate Speech",
|
||||
"generated_audio_label": "Generated Audio",
|
||||
"advanced_settings_title": "Advanced Settings",
|
||||
"advanced_settings_title": "⚙️ Advanced Settings",
|
||||
"ref_denoise_label": "Reference audio enhancement",
|
||||
"ref_denoise_info": "Denoise reference audio with ZipEnhancer",
|
||||
"ref_denoise_info": "Apply ZipEnhancer denoising to the reference audio before cloning",
|
||||
"normalize_label": "Text normalization",
|
||||
"normalize_info": "Normalize input text with wetext",
|
||||
"normalize_info": "Normalize numbers, dates, and abbreviations via wetext",
|
||||
"cfg_label": "CFG (guidance scale)",
|
||||
"cfg_info": "Higher = stronger prompt adherence; lower = more variation",
|
||||
"cfg_info": "Higher → closer to the prompt / reference; lower → more creative variation",
|
||||
"dit_steps_label": "LocDiT flow-matching steps",
|
||||
"dit_steps_info": "LocDiT flow-matching steps — more steps → maybe better audio quality, but slower",
|
||||
"usage_instructions": _USAGE_INSTRUCTIONS_EN,
|
||||
"examples_footer": _EXAMPLES_FOOTER_EN,
|
||||
},
|
||||
"zh-CN": {
|
||||
"reference_audio_label": "参考音频(可选 - 用于克隆)",
|
||||
"show_prompt_text_label": "启用 Prompt Text(提升音色还原度)",
|
||||
"show_prompt_text_info": "使用参考音频的文本内容提升克隆相似度,开启后 Control Instruction 将被禁用",
|
||||
"prompt_text_label": "Prompt Text(ASR 自动填充,可编辑)",
|
||||
"prompt_text_placeholder": "参考音频的文本内容将自动识别到这里...",
|
||||
"control_label": "Control Instruction(可选,仅支持中文和英文)",
|
||||
"control_placeholder": "如:年轻女性,温柔甜美 / sadly / an excited young man",
|
||||
"target_text_label": "Target Text(要合成的文本)",
|
||||
"generate_btn": "开始生成",
|
||||
"generated_audio_label": "生成音频",
|
||||
"advanced_settings_title": "高级设置",
|
||||
"reference_audio_label": "🎤 参考音频(可选 — 上传后用于克隆)",
|
||||
"show_prompt_text_label": "🎙️ 极致克隆模式(基于文本引导的极致克隆)",
|
||||
"show_prompt_text_info": "自动识别参考音频文本,完整还原音色、节奏、情感等全部声音细节。开启后 Control Instruction 将暂时禁用",
|
||||
"prompt_text_label": "参考音频内容文本(ASR 自动填充,可手动编辑)",
|
||||
"prompt_text_placeholder": "参考音频的文字内容将自动识别并显示在此处 …",
|
||||
"control_label": "🎛️ Control Instruction(可选 — 支持中英文描述)",
|
||||
"control_placeholder": "如:年轻女性,温柔甜美 / A warm young woman / 暴躁老哥,语速飞快",
|
||||
"target_text_label": "✍️ Target Text — 要合成的目标文本",
|
||||
"generate_btn": "🔊 开始生成",
|
||||
"generated_audio_label": "生成结果",
|
||||
"advanced_settings_title": "⚙️ 高级设置",
|
||||
"ref_denoise_label": "参考音频降噪增强",
|
||||
"ref_denoise_info": "使用 ZipEnhancer 对参考音频进行降噪",
|
||||
"ref_denoise_info": "克隆前使用 ZipEnhancer 对参考音频进行降噪处理",
|
||||
"normalize_label": "文本规范化",
|
||||
"normalize_info": "使用 wetext 对输入文本进行规范化处理",
|
||||
"cfg_label": "CFG Value(引导强度)",
|
||||
"cfg_info": "数值越高,越贴合提示要求;数值越低,变化空间越大",
|
||||
"normalize_info": "自动规范化数字、日期及缩写(基于 wetext)",
|
||||
"cfg_label": "CFG(引导强度)",
|
||||
"cfg_info": "数值越高 → 越贴合提示/参考音色;数值越低 → 生成风格更自由",
|
||||
"dit_steps_label": "LocDiT 流匹配迭代步数",
|
||||
"dit_steps_info": "LocDiT 流匹配生成迭代步数 — 步数越多 → 可能生成更好的音频质量,但速度变慢",
|
||||
"usage_instructions": _USAGE_INSTRUCTIONS_ZH,
|
||||
"examples_footer": _EXAMPLES_FOOTER_ZH,
|
||||
},
|
||||
@@ -153,7 +156,7 @@ for _d in _I18N_TRANSLATIONS.values():
|
||||
I18N = gr.I18n(**_I18N_TRANSLATIONS)
|
||||
|
||||
DEFAULT_TARGET_TEXT = (
|
||||
"VoxCPM is an innovative end-to-end TTS model from ModelBest, "
|
||||
"VoxCPM2 is a creative multilingual TTS model from ModelBest, "
|
||||
"designed to generate highly realistic speech."
|
||||
)
|
||||
|
||||
@@ -279,12 +282,13 @@ class VoxCPMDemo:
|
||||
cfg_value_input: float,
|
||||
do_normalize: bool,
|
||||
denoise: bool,
|
||||
inference_timesteps: int = 10,
|
||||
) -> dict:
|
||||
generate_kwargs = dict(
|
||||
text=final_text,
|
||||
reference_wav_path=audio_path,
|
||||
cfg_value=float(cfg_value_input),
|
||||
inference_timesteps=10,
|
||||
inference_timesteps=inference_timesteps,
|
||||
normalize=do_normalize,
|
||||
denoise=denoise,
|
||||
)
|
||||
@@ -302,6 +306,7 @@ class VoxCPMDemo:
|
||||
cfg_value_input: float = 2.0,
|
||||
do_normalize: bool = True,
|
||||
denoise: bool = True,
|
||||
inference_timesteps: int = 10,
|
||||
) -> Tuple[int, np.ndarray]:
|
||||
current_model = self.get_or_load_voxcpm()
|
||||
|
||||
@@ -330,6 +335,7 @@ class VoxCPMDemo:
|
||||
cfg_value_input=cfg_value_input,
|
||||
do_normalize=do_normalize,
|
||||
denoise=denoise,
|
||||
inference_timesteps=inference_timesteps,
|
||||
)
|
||||
wav = current_model.generate(**generate_kwargs)
|
||||
return (current_model.tts_model.sample_rate, wav)
|
||||
@@ -349,6 +355,7 @@ def create_demo_interface(demo: VoxCPMDemo):
|
||||
cfg_value: float,
|
||||
do_normalize: bool,
|
||||
denoise: bool,
|
||||
dit_steps: int,
|
||||
):
|
||||
actual_prompt_text = prompt_text_value.strip() if use_prompt_text else ""
|
||||
actual_control = "" if use_prompt_text else control_instruction
|
||||
@@ -360,6 +367,7 @@ def create_demo_interface(demo: VoxCPMDemo):
|
||||
cfg_value_input=cfg_value,
|
||||
do_normalize=do_normalize,
|
||||
denoise=denoise,
|
||||
inference_timesteps=int(dit_steps),
|
||||
)
|
||||
return (sr, wav_np)
|
||||
|
||||
@@ -450,6 +458,14 @@ def create_demo_interface(demo: VoxCPMDemo):
|
||||
label=I18N("cfg_label"),
|
||||
info=I18N("cfg_info"),
|
||||
)
|
||||
dit_steps = gr.Slider(
|
||||
minimum=1,
|
||||
maximum=50,
|
||||
value=10,
|
||||
step=1,
|
||||
label=I18N("dit_steps_label"),
|
||||
info=I18N("dit_steps_info"),
|
||||
)
|
||||
|
||||
run_btn = gr.Button(I18N("generate_btn"), variant="primary", size="lg")
|
||||
|
||||
@@ -478,6 +494,7 @@ def create_demo_interface(demo: VoxCPMDemo):
|
||||
cfg_value,
|
||||
DoNormalizeText,
|
||||
DoDenoisePromptAudio,
|
||||
dit_steps,
|
||||
],
|
||||
outputs=[audio_output],
|
||||
show_progress=True,
|
||||
|
||||
@@ -1,468 +0,0 @@
|
||||
# VoxCPM Fine-tuning Guide
|
||||
|
||||
This guide covers how to fine-tune VoxCPM models with two approaches: full fine-tuning and LoRA fine-tuning.
|
||||
|
||||
### 🎓 SFT (Supervised Fine-Tuning)
|
||||
|
||||
Full fine-tuning updates all model parameters. Suitable for:
|
||||
- 📊 Large, specialized datasets
|
||||
- 🔄 Cases where significant behavior changes are needed
|
||||
|
||||
### ⚡ LoRA Fine-tuning
|
||||
|
||||
LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method that:
|
||||
- 🎯 Trains only a small number of additional parameters
|
||||
- 💾 Significantly reduces memory requirements and training time
|
||||
- 🔀 Supports multiple LoRA adapters with hot-swapping
|
||||
|
||||
|
||||
|
||||
## Table of Contents
|
||||
|
||||
- [Quick Start: WebUI](#quick-start-webui)
|
||||
- [Data Preparation](#data-preparation)
|
||||
- [Full Fine-tuning](#full-fine-tuning)
|
||||
- [LoRA Fine-tuning](#lora-fine-tuning)
|
||||
- [Inference](#inference)
|
||||
- [LoRA Hot-swapping](#lora-hot-swapping)
|
||||
- [FAQ](#faq)
|
||||
|
||||
---
|
||||
|
||||
## Quick Start: WebUI
|
||||
|
||||
For users who prefer a graphical interface, we provide `lora_ft_webui.py` - a comprehensive WebUI for training and inference:
|
||||
|
||||
### Launch WebUI
|
||||
|
||||
```bash
|
||||
python lora_ft_webui.py
|
||||
```
|
||||
|
||||
Then open `http://localhost:7860` in your browser.
|
||||
|
||||
### Features
|
||||
|
||||
- **🚀 Training Tab**: Configure and start LoRA training with an intuitive interface
|
||||
- Set training parameters (learning rate, batch size, LoRA rank, etc.)
|
||||
- Monitor training progress in real-time
|
||||
- Resume training from existing checkpoints
|
||||
|
||||
- **🎵 Inference Tab**: Generate audio with trained models
|
||||
- Automatic base model loading from LoRA checkpoint config
|
||||
- Voice cloning with automatic ASR (reference text recognition)
|
||||
- Hot-swap between multiple LoRA models
|
||||
- Zero-shot TTS without reference audio
|
||||
|
||||
## Data Preparation
|
||||
|
||||
Training data should be prepared as a JSONL manifest file, with one sample per line:
|
||||
|
||||
```jsonl
|
||||
{"audio": "path/to/audio1.wav", "text": "Transcript of audio 1."}
|
||||
{"audio": "path/to/audio2.wav", "text": "Transcript of audio 2."}
|
||||
{"audio": "path/to/audio3.wav", "text": "Optional duration field.", "duration": 3.5}
|
||||
{"audio": "path/to/audio4.wav", "text": "Optional dataset_id for multi-dataset.", "dataset_id": 1}
|
||||
```
|
||||
|
||||
### Required Fields
|
||||
|
||||
| Field | Description |
|
||||
|-------|-------------|
|
||||
| `audio` | Path to audio file (absolute or relative) |
|
||||
| `text` | Corresponding transcript |
|
||||
|
||||
### Optional Fields
|
||||
|
||||
| Field | Description |
|
||||
|-------|-------------|
|
||||
| `duration` | Audio duration in seconds (speeds up sample filtering) |
|
||||
| `dataset_id` | Dataset ID for multi-dataset training (default: 0) |
|
||||
|
||||
### Requirements
|
||||
|
||||
- Audio format: WAV
|
||||
- Sample rate: 16kHz for VoxCPM-0.5B, 44.1kHz for VoxCPM1.5
|
||||
- Text: Transcript matching the audio content
|
||||
|
||||
See `examples/train_data_example.jsonl` for a complete example.
|
||||
|
||||
---
|
||||
|
||||
## Full Fine-tuning
|
||||
|
||||
Full fine-tuning updates all model parameters. Suitable for large datasets or when significant behavior changes are needed.
|
||||
|
||||
### Configuration
|
||||
|
||||
Create `conf/voxcpm_v1.5/voxcpm_finetune_all.yaml`:
|
||||
|
||||
```yaml
|
||||
pretrained_path: /path/to/VoxCPM1.5/
|
||||
train_manifest: /path/to/train.jsonl
|
||||
val_manifest: ""
|
||||
|
||||
sample_rate: 44100
|
||||
batch_size: 16
|
||||
grad_accum_steps: 1
|
||||
num_workers: 2
|
||||
num_iters: 2000
|
||||
log_interval: 10
|
||||
valid_interval: 1000
|
||||
save_interval: 1000
|
||||
|
||||
learning_rate: 0.00001 # Use smaller LR for full fine-tuning
|
||||
weight_decay: 0.01
|
||||
warmup_steps: 100
|
||||
max_steps: 2000
|
||||
max_batch_tokens: 8192
|
||||
|
||||
save_path: /path/to/checkpoints/finetune_all
|
||||
tensorboard: /path/to/logs/finetune_all
|
||||
|
||||
lambdas:
|
||||
loss/diff: 1.0
|
||||
loss/stop: 1.0
|
||||
```
|
||||
|
||||
### Training
|
||||
|
||||
```bash
|
||||
# Single GPU
|
||||
python scripts/train_voxcpm_finetune.py --config_path conf/voxcpm_v1.5/voxcpm_finetune_all.yaml
|
||||
|
||||
# Multi-GPU
|
||||
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 \
|
||||
scripts/train_voxcpm_finetune.py --config_path conf/voxcpm_v1.5/voxcpm_finetune_all.yaml
|
||||
```
|
||||
|
||||
### Checkpoint Structure
|
||||
|
||||
Full fine-tuning saves a complete model directory that can be loaded directly:
|
||||
|
||||
```
|
||||
checkpoints/finetune_all/
|
||||
└── step_0002000/
|
||||
├── model.safetensors # Model weights (excluding audio_vae)
|
||||
├── config.json # Model config
|
||||
├── audiovae.pth # Audio VAE weights
|
||||
├── tokenizer.json # Tokenizer
|
||||
├── tokenizer_config.json
|
||||
├── special_tokens_map.json
|
||||
├── optimizer.pth
|
||||
└── scheduler.pth
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## LoRA Fine-tuning
|
||||
|
||||
LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method that trains only a small number of additional parameters, significantly reducing memory requirements.
|
||||
|
||||
### Configuration
|
||||
|
||||
Create `conf/voxcpm_v1.5/voxcpm_finetune_lora.yaml`:
|
||||
|
||||
```yaml
|
||||
pretrained_path: /path/to/VoxCPM1.5/
|
||||
train_manifest: /path/to/train.jsonl
|
||||
val_manifest: ""
|
||||
|
||||
sample_rate: 44100
|
||||
batch_size: 16
|
||||
grad_accum_steps: 1
|
||||
num_workers: 2
|
||||
num_iters: 2000
|
||||
log_interval: 10
|
||||
valid_interval: 1000
|
||||
save_interval: 1000
|
||||
|
||||
learning_rate: 0.0001 # LoRA can use larger LR
|
||||
weight_decay: 0.01
|
||||
warmup_steps: 100
|
||||
max_steps: 2000
|
||||
max_batch_tokens: 8192
|
||||
|
||||
save_path: /path/to/checkpoints/finetune_lora
|
||||
tensorboard: /path/to/logs/finetune_lora
|
||||
|
||||
lambdas:
|
||||
loss/diff: 1.0
|
||||
loss/stop: 1.0
|
||||
|
||||
# LoRA configuration
|
||||
lora:
|
||||
enable_lm: true # Apply LoRA to Language Model
|
||||
enable_dit: true # Apply LoRA to Diffusion Transformer
|
||||
enable_proj: false # Apply LoRA to projection layers (optional)
|
||||
|
||||
r: 32 # LoRA rank (higher = more capacity)
|
||||
alpha: 16 # LoRA alpha, scaling = alpha / r
|
||||
dropout: 0.0
|
||||
|
||||
# Target modules
|
||||
target_modules_lm: ["q_proj", "v_proj", "k_proj", "o_proj"]
|
||||
target_modules_dit: ["q_proj", "v_proj", "k_proj", "o_proj"]
|
||||
|
||||
# Distribution options (optional)
|
||||
# hf_model_id: "openbmb/VoxCPM1.5" # HuggingFace ID
|
||||
# distribute: true # If true, save hf_model_id in lora_config.json
|
||||
```
|
||||
|
||||
### LoRA Parameters
|
||||
|
||||
| Parameter | Description | Recommended |
|
||||
|-----------|-------------|-------------|
|
||||
| `enable_lm` | Apply LoRA to LM (language model) | `true` |
|
||||
| `enable_dit` | Apply LoRA to DiT (diffusion model) | `true` (required for voice cloning) |
|
||||
| `r` | LoRA rank (higher = more capacity) | 16-64 |
|
||||
| `alpha` | Scaling factor, `scaling = alpha / r` | Usually `r/2` or `r` |
|
||||
| `target_modules_*` | Layer names to add LoRA | attention layers |
|
||||
|
||||
### Distribution Options (Optional)
|
||||
|
||||
| Parameter | Description | Default |
|
||||
|-----------|-------------|---------|
|
||||
| `hf_model_id` | HuggingFace model ID (e.g., `openbmb/VoxCPM1.5`) | `""` |
|
||||
| `distribute` | If `true`, save `hf_model_id` as `base_model` in checkpoint; otherwise save local `pretrained_path` | `false` |
|
||||
|
||||
> **Note**: If `distribute: true`, `hf_model_id` is required.
|
||||
|
||||
### Training
|
||||
|
||||
```bash
|
||||
# Single GPU
|
||||
python scripts/train_voxcpm_finetune.py --config_path conf/voxcpm_v1.5/voxcpm_finetune_lora.yaml
|
||||
|
||||
# Multi-GPU
|
||||
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 \
|
||||
scripts/train_voxcpm_finetune.py --config_path conf/voxcpm_v1.5/voxcpm_finetune_lora.yaml
|
||||
```
|
||||
|
||||
### Checkpoint Structure
|
||||
|
||||
LoRA training saves LoRA parameters and configuration:
|
||||
|
||||
```
|
||||
checkpoints/finetune_lora/
|
||||
└── step_0002000/
|
||||
├── lora_weights.safetensors # Only lora_A, lora_B parameters
|
||||
├── lora_config.json # LoRA config + base model path
|
||||
├── optimizer.pth
|
||||
└── scheduler.pth
|
||||
```
|
||||
|
||||
The `lora_config.json` contains:
|
||||
```json
|
||||
{
|
||||
"base_model": "/path/to/VoxCPM1.5/",
|
||||
"lora_config": {
|
||||
"enable_lm": true,
|
||||
"enable_dit": true,
|
||||
"r": 32,
|
||||
"alpha": 16,
|
||||
...
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
The `base_model` field contains:
|
||||
- Local path (default): when `distribute: false` or not set
|
||||
- HuggingFace ID: when `distribute: true` (e.g., `"openbmb/VoxCPM1.5"`)
|
||||
|
||||
This allows loading LoRA checkpoints without the original training config file.
|
||||
|
||||
---
|
||||
|
||||
## Inference
|
||||
|
||||
### Full Fine-tuning Inference
|
||||
|
||||
The checkpoint directory is a complete model, load it directly:
|
||||
|
||||
```bash
|
||||
python scripts/test_voxcpm_ft_infer.py \
|
||||
--ckpt_dir /path/to/checkpoints/finetune_all/step_0002000 \
|
||||
--text "Hello, this is the fine-tuned model." \
|
||||
--output output.wav
|
||||
```
|
||||
|
||||
With voice cloning:
|
||||
|
||||
```bash
|
||||
python scripts/test_voxcpm_ft_infer.py \
|
||||
--ckpt_dir /path/to/checkpoints/finetune_all/step_0002000 \
|
||||
--text "This is voice cloning result." \
|
||||
--prompt_audio /path/to/reference.wav \
|
||||
--prompt_text "Reference audio transcript" \
|
||||
--output cloned_output.wav
|
||||
```
|
||||
|
||||
### LoRA Inference
|
||||
|
||||
LoRA inference only requires the checkpoint directory (base model path and LoRA config are read from `lora_config.json`):
|
||||
|
||||
```bash
|
||||
python scripts/test_voxcpm_lora_infer.py \
|
||||
--lora_ckpt /path/to/checkpoints/finetune_lora/step_0002000 \
|
||||
--text "Hello, this is LoRA fine-tuned result." \
|
||||
--output lora_output.wav
|
||||
```
|
||||
|
||||
With voice cloning:
|
||||
|
||||
```bash
|
||||
python scripts/test_voxcpm_lora_infer.py \
|
||||
--lora_ckpt /path/to/checkpoints/finetune_lora/step_0002000 \
|
||||
--text "This is voice cloning with LoRA." \
|
||||
--prompt_audio /path/to/reference.wav \
|
||||
--prompt_text "Reference audio transcript" \
|
||||
--output cloned_output.wav
|
||||
```
|
||||
|
||||
Override base model path (optional):
|
||||
|
||||
```bash
|
||||
python scripts/test_voxcpm_lora_infer.py \
|
||||
--lora_ckpt /path/to/checkpoints/finetune_lora/step_0002000 \
|
||||
--base_model /path/to/another/VoxCPM1.5 \
|
||||
--text "Use different base model." \
|
||||
--output output.wav
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## LoRA Hot-swapping
|
||||
|
||||
LoRA supports dynamic loading, unloading, and switching at inference time without reloading the entire model.
|
||||
|
||||
### API Reference
|
||||
|
||||
```python
|
||||
from voxcpm.core import VoxCPM
|
||||
from voxcpm.model.voxcpm import LoRAConfig
|
||||
|
||||
# 1. Load model with LoRA structure and weights
|
||||
lora_cfg = LoRAConfig(
|
||||
enable_lm=True,
|
||||
enable_dit=True,
|
||||
r=32,
|
||||
alpha=16,
|
||||
target_modules_lm=["q_proj", "v_proj", "k_proj", "o_proj"],
|
||||
target_modules_dit=["q_proj", "v_proj", "k_proj", "o_proj"],
|
||||
)
|
||||
model = VoxCPM.from_pretrained(
|
||||
hf_model_id="openbmb/VoxCPM1.5", # or local path
|
||||
load_denoiser=False, # Optional: disable denoiser for faster loading
|
||||
optimize=True, # Enable torch.compile acceleration
|
||||
lora_config=lora_cfg,
|
||||
lora_weights_path="/path/to/lora_checkpoint",
|
||||
)
|
||||
|
||||
# 2. Generate audio
|
||||
audio = model.generate(
|
||||
text="Hello, this is LoRA fine-tuned result.",
|
||||
prompt_wav_path="/path/to/reference.wav", # Optional: for voice cloning
|
||||
prompt_text="Reference audio transcript", # Optional: for voice cloning
|
||||
)
|
||||
|
||||
# 3. Disable LoRA (use base model only)
|
||||
model.set_lora_enabled(False)
|
||||
|
||||
# 4. Re-enable LoRA
|
||||
model.set_lora_enabled(True)
|
||||
|
||||
# 5. Unload LoRA (reset weights to zero)
|
||||
model.unload_lora()
|
||||
|
||||
# 6. Hot-swap to another LoRA
|
||||
loaded, skipped = model.load_lora("/path/to/another_lora_checkpoint")
|
||||
print(f"Loaded {len(loaded)} params, skipped {len(skipped)}")
|
||||
|
||||
# 7. Get current LoRA weights
|
||||
lora_state = model.get_lora_state_dict()
|
||||
```
|
||||
|
||||
### Simplified Usage (Load from lora_config.json)
|
||||
|
||||
If your checkpoint contains `lora_config.json` (saved by the training script), you can load everything automatically:
|
||||
|
||||
```python
|
||||
import json
|
||||
from voxcpm.core import VoxCPM
|
||||
from voxcpm.model.voxcpm import LoRAConfig
|
||||
|
||||
# Load config from checkpoint
|
||||
lora_ckpt_dir = "/path/to/checkpoints/finetune_lora/step_0002000"
|
||||
with open(f"{lora_ckpt_dir}/lora_config.json") as f:
|
||||
lora_info = json.load(f)
|
||||
|
||||
base_model = lora_info["base_model"]
|
||||
lora_cfg = LoRAConfig(**lora_info["lora_config"])
|
||||
|
||||
# Load model with LoRA
|
||||
model = VoxCPM.from_pretrained(
|
||||
hf_model_id=base_model,
|
||||
lora_config=lora_cfg,
|
||||
lora_weights_path=lora_ckpt_dir,
|
||||
)
|
||||
```
|
||||
|
||||
Or use the test script directly:
|
||||
|
||||
```bash
|
||||
python scripts/test_voxcpm_lora_infer.py \
|
||||
--lora_ckpt /path/to/checkpoints/finetune_lora/step_0002000 \
|
||||
--text "Hello world"
|
||||
```
|
||||
|
||||
### Method Reference
|
||||
|
||||
| Method | Description | torch.compile Compatible |
|
||||
|--------|-------------|--------------------------|
|
||||
| `load_lora(path)` | Load LoRA weights from file | ✅ |
|
||||
| `set_lora_enabled(bool)` | Enable/disable LoRA | ✅ |
|
||||
| `unload_lora()` | Reset LoRA weights to initial values | ✅ |
|
||||
| `get_lora_state_dict()` | Get current LoRA weights | ✅ |
|
||||
| `lora_enabled` | Property: check if LoRA is configured | ✅ |
|
||||
|
||||
---
|
||||
|
||||
## FAQ
|
||||
|
||||
### 1. How Much Data is Needed for LoRA Fine-tuning to Converge to a Single Voice?
|
||||
|
||||
We have tested with 5 minutes and 10 minutes of data (all audio clips are 3-6s in length). In our experiments, both datasets converged to a single voice after 2000 training steps with default configurations. You can adjust the data amount and training configurations based on your available data and computational resources.
|
||||
|
||||
### 2. Out of Memory (OOM)
|
||||
|
||||
- Increase `grad_accum_steps` (gradient accumulation)
|
||||
- Decrease `batch_size`
|
||||
- Use LoRA fine-tuning instead of full fine-tuning
|
||||
- Decrease `max_batch_tokens` to filter long samples
|
||||
|
||||
### 3. Poor LoRA Performance
|
||||
|
||||
- Increase `r` (LoRA rank)
|
||||
- Adjust `alpha` (try `alpha = r/2` or `alpha = r`)
|
||||
- Increase training steps
|
||||
- Add more target modules
|
||||
|
||||
### 4. Training Not Converging
|
||||
|
||||
- Decrease `learning_rate`
|
||||
- Increase `warmup_steps`
|
||||
- Check data quality
|
||||
|
||||
### 5. LoRA Not Taking Effect at Inference
|
||||
|
||||
- Check that `lora_config.json` exists in the checkpoint directory
|
||||
- Check `load_lora()` return value - `skipped_keys` should be empty
|
||||
- Verify `set_lora_enabled(True)` is called
|
||||
|
||||
### 6. Checkpoint Loading Errors
|
||||
|
||||
- Full fine-tuning: checkpoint directory should contain `model.safetensors` (or `pytorch_model.bin`), `config.json`, `audiovae.pth`
|
||||
- LoRA: checkpoint directory should contain:
|
||||
- `lora_weights.safetensors` (or `lora_weights.ckpt`) - LoRA weights
|
||||
- `lora_config.json` - LoRA config and base model path
|
||||
@@ -1,46 +0,0 @@
|
||||
# 📊 Performance Highlights
|
||||
|
||||
VoxCPM achieves competitive results on public zero-shot TTS benchmarks.
|
||||
|
||||
## Seed-TTS-eval Benchmark
|
||||
|
||||
| Model | Parameters | Open-Source | test-EN | | test-ZH | | test-Hard | |
|
||||
|------|------|------|:------------:|:--:|:------------:|:--:|:-------------:|:--:|
|
||||
| | | | WER/%⬇ | SIM/%⬆| CER/%⬇| SIM/%⬆ | CER/%⬇ | SIM/%⬆ |
|
||||
| MegaTTS3 | 0.5B | ❌ | 2.79 | 77.1 | 1.52 | 79.0 | - | - |
|
||||
| DiTAR | 0.6B | ❌ | 1.69 | 73.5 | 1.02 | 75.3 | - | - |
|
||||
| CosyVoice3 | 0.5B | ❌ | 2.02 | 71.8 | 1.16 | 78.0 | 6.08 | 75.8 |
|
||||
| CosyVoice3 | 1.5B | ❌ | 2.22 | 72.0 | 1.12 | 78.1 | 5.83 | 75.8 |
|
||||
| Seed-TTS | - | ❌ | 2.25 | 76.2 | 1.12 | 79.6 | 7.59 | 77.6 |
|
||||
| MiniMax-Speech | - | ❌ | 1.65 | 69.2 | 0.83 | 78.3 | - | - |
|
||||
| F5-TTS | 0.3B | ✅ | 2.00 | 67.0 | 1.53 | 76.0 | 8.67 | 71.3 |
|
||||
| MaskGCT | 1B | ✅ | 2.62 | 71.7 | 2.27 | 77.4 | - | - |
|
||||
| CosyVoice | 0.3B | ✅ | 4.29 | 60.9 | 3.63 | 72.3 | 11.75 | 70.9 |
|
||||
| CosyVoice2 | 0.5B | ✅ | 3.09 | 65.9 | 1.38 | 75.7 | **6.83** | 72.4 |
|
||||
| SparkTTS | 0.5B | ✅ | 3.14 | 57.3 | 1.54 | 66.0 | - | - |
|
||||
| FireRedTTS | 0.5B | ✅ | 3.82 | 46.0 | 1.51 | 63.5 | 17.45 | 62.1 |
|
||||
| FireRedTTS-2 | 1.5B | ✅ | 1.95 | 66.5 | 1.14 | 73.6 | - | - |
|
||||
| Qwen2.5-Omni | 7B | ✅ | 2.72 | 63.2 | 1.70 | 75.2 | 7.97 | **74.7** |
|
||||
| OpenAudio-s1-mini | 0.5B | ✅ | 1.94 | 55.0 | 1.18 | 68.5 | 23.37 | 64.3 |
|
||||
| IndexTTS2 | 1.5B | ✅ | 2.23 | 70.6 | 1.03 | 76.5 | 7.12 | 75.5 |
|
||||
| VibeVoice | 1.5B | ✅ | 3.04 | 68.9 | 1.16 | 74.4 | - | - |
|
||||
| HiggsAudio-v2 | 3B | ✅ | 2.44 | 67.7 | 1.50 | 74.0 | 55.07 | 65.6 |
|
||||
| **VoxCPM** | 0.5B | ✅ | **1.85** | **72.9** | **0.93** | **77.2** | 8.87 | 73.0 |
|
||||
|
||||
|
||||
## CV3-eval Benchmark
|
||||
|
||||
| Model | zh | en | hard-zh | | | hard-en | | |
|
||||
|-------|:--:|:--:|:-------:|:--:|:--:|:-------:|:--:|:--:|
|
||||
| | CER/%⬇ | WER/%⬇ | CER/%⬇ | SIM/%⬆ | DNSMOS⬆ | WER/%⬇ | SIM/%⬆ | DNSMOS⬆ |
|
||||
| F5-TTS | 5.47 | 8.90 | - | - | - | - | - | - |
|
||||
| SparkTTS | 5.15 | 11.0 | - | - | - | - | - | - |
|
||||
| GPT-SoVits | 7.34 | 12.5 | - | - | - | - | - | - |
|
||||
| CosyVoice2 | 4.08 | 6.32 | 12.58 | 72.6 | 3.81 | 11.96 | 66.7 | 3.95 |
|
||||
| OpenAudio-s1-mini | 4.00 | 5.54 | 18.1 | 58.2 | 3.77 | 12.4 | 55.7 | 3.89 |
|
||||
| IndexTTS2 | 3.58 | 4.45 | 12.8 | 74.6 | 3.65 | 8.78 | 74.5 | 3.80 |
|
||||
| HiggsAudio-v2 | 9.54 | 7.89 | 41.0 | 60.2 | 3.39 | 10.3 | 61.8 | 3.68 |
|
||||
| CosyVoice3-0.5B | 3.89 | 5.24 | 14.15 | 78.6 | 3.75 | 9.04 | 75.9 | 3.92 |
|
||||
| CosyVoice3-1.5B | 3.91 | 4.99 | 9.77 | 78.5 | 3.79 | 10.55 | 76.1 | 3.95 |
|
||||
| **VoxCPM** | **3.40** | **4.04** | 12.9 | 66.1 | 3.59 | **7.89** | 64.3 | 3.74 |
|
||||
|
||||
@@ -1,116 +0,0 @@
|
||||
# VoxCPM1.5 Release Notes
|
||||
|
||||
**Release Date:** December 5, 2025
|
||||
|
||||
## 🎉 Overview
|
||||
|
||||
|
||||
We’re thrilled to introduce a major upgrade that improves audio quality and efficiency of VoxCPM, while maintaining the core capabilities of context-aware speech generation and zero-shot voice cloning.
|
||||
|
||||
| Feature | VoxCPM | VoxCPM1.5 |
|
||||
|---------|------------|------------|
|
||||
| **Audio VAE Sampling Rate** | 16kHz | 44.1kHz |
|
||||
| **LM Token Rate** | 12.5Hz | 6.25Hz |
|
||||
| **Patch Size** | 2 | 4 |
|
||||
| **SFT Support** | ✅ | ✅ |
|
||||
| **LoRA Support** | ✅ | ✅ |
|
||||
|
||||
## 🎵 Model Updates
|
||||
|
||||
### 🔊 AudioVAE Sampling Rate: 16kHz → 44.1kHz
|
||||
|
||||
The AudioVAE now supports 44.1kHz sampling rate, which allows the model to:
|
||||
- 🎯 Clone better, preserving more high-frequency details and generate higher quality voice outputs
|
||||
|
||||
|
||||
*Note: This upgrade enables higher quality generation when using high-quality reference audio, but does not guarantee that all generated audio will be high-fidelity. The output quality depends on the **prompt speech** quality.*
|
||||
|
||||
### ⚡ Token Rate: 12.5Hz → 6.25Hz
|
||||
|
||||
We reduced the token rate in LM backbone from 12.5Hz to 6.25Hz (LocEnc&LocDiT patch size increased from 2 to 4) while maintaining similar performance on evaluation benchmarks. This change:
|
||||
- 💨 Reduces computational requirements for generating the same length of audio
|
||||
- 📈 Provides a foundation for longer audio generation
|
||||
- 🏗️ Paves the way for training larger models in the future
|
||||
|
||||
**Model Architecture Clarification**: The core architecture of VoxCPM1.5 remains unchanged from the technical report. The key modification is adjusting the patch size of the local modules (LocEnc & LocDiT) from 2 to 4, which reduces the LM processing rate from 12.5Hz to 6.25Hz. Since the local modules now need to handle longer contexts, we expanded their network depth, resulting in a slightly larger overall model parameter count.
|
||||
|
||||
**Generation Speed Clarification**: Although the model parameters have increased, VoxCPM1.5 only requires 6.25 tokens to generate 1 second of audio (compared to 12.5 tokens in the previous version). While the displayed generation speed (xx it/s) may appear slower, the actual Real-Time Factor (RTF = audio duration / processing time) shows no difference or may even be faster.
|
||||
|
||||
## 🔧 Fine-tuning Support
|
||||
|
||||
We support full fine-tuning and LoRA fine-tuning now, please see the [Fine-tuning Guide](finetune.md) for detailed instructions.
|
||||
|
||||
|
||||
## 📚 Documentation
|
||||
|
||||
- Updated README with version comparison
|
||||
- Added comprehensive fine-tuning guide
|
||||
- Improved code comments and documentation
|
||||
|
||||
|
||||
## 🙏 Our Thanks to You
|
||||
This release wouldn’t be possible without the incredible feedback, testing, and contributions from our open-source community. Thank you for helping shape VoxCPM1.5!
|
||||
|
||||
|
||||
## 📞 Let's Build Together
|
||||
Questions, ideas, or want to contribute?
|
||||
|
||||
- 🐛 Report an issue: [GitHub Issues on OpenBMB/VoxCPM](https://github.com/OpenBMB/VoxCPM/issues)
|
||||
|
||||
- 📖 Dig into the docs: Check the [docs/](../docs/) folder for guides and API details
|
||||
|
||||
Enjoy the richer sound and powerful new features of VoxCPM1.5 🎉
|
||||
|
||||
We can't wait to hear what you create next! 🥂
|
||||
|
||||
## 🚀 What We're Working On
|
||||
|
||||
We're continuously improving VoxCPM and working on exciting new features:
|
||||
|
||||
- 🌍 **Multilingual TTS Support**: We are actively developing support for languages beyond Chinese and English.
|
||||
- 🎯 **Controllable Expressive Speech Generation**: We are researching controllable speech generation that allows fine-grained control over speech attributes (emotion, timbre, prosody, etc.) through natural language instructions.
|
||||
- 🎵 **Universal Audio Generation Foundation**: We also hope to explore VoxCPM as a unified audio generation foundation model capable of joint generation of speech, music, and sound effects. However, this is a longer-term vision.
|
||||
|
||||
**📅 Next Release**: We plan to release the next version in Q1 2026, which will include significant improvements and new features. Stay tuned for updates! We're committed to making VoxCPM even more powerful and versatile.
|
||||
|
||||
## ❓ Frequently Asked Questions (FAQ)
|
||||
|
||||
### Q: Does VoxCPM support fine-tuning for personalized voice customization?
|
||||
|
||||
**A:** Yes! VoxCPM now supports both full fine-tuning (SFT) and efficient LoRA fine-tuning. You can train personalized voice models on your own data. Please refer to the [Fine-tuning Guide](finetune.md) for detailed instructions and examples.
|
||||
|
||||
### Q: Is 16kHz audio quality sufficient for my use case?
|
||||
|
||||
**A:** We have upgraded the AudioVAE to support 44.1kHz sampling rate in VoxCPM1.5, which provides higher quality audio output with better preservation of high-frequency details. This upgrade enables better voice cloning quality and more natural speech synthesis when using high-quality reference audio.
|
||||
|
||||
### Q: Has the stability issue been resolved?
|
||||
|
||||
**A:** We have made stability optimizations in VoxCPM1.5, including improvements to the inference code logic, training data, and model architecture. Based on community feedback, we collected some stability issues such as:
|
||||
- Increased noise and reverberation
|
||||
- Audio artifacts (e.g., howling/squealing)
|
||||
- Unstable speaking rate (speeding up)
|
||||
- Volume fluctuations (increases or decreases)
|
||||
- Noise artifacts at the beginning and end of audio
|
||||
- Synthesis issues with very short texts (e.g., "hello")
|
||||
|
||||
**What we've improved:**
|
||||
- By adjusting inference code logic and optimizing training data, we have largely fixed the beginning/ending artifacts.
|
||||
- By reducing the LM processing rate (12.5Hz → 6.25Hz), we have improved stability on longer speech generation cases.
|
||||
|
||||
**What remains:** We acknowledge that long speech stability issues have not been completely resolved. Particularly for highly expressive or complex reference speech, error accumulation during autoregressive generation can still occur. We will continue to analyze and optimize this in future versions.
|
||||
|
||||
### Q: Does VoxCPM plan to support multilingual TTS?
|
||||
|
||||
**A:** Currently, VoxCPM is primarily trained on Chinese and English data. We are actively researching and developing multilingual TTS support for more languages beyond Chinese and English. Please let us know what languages you'd like to see supported!
|
||||
|
||||
### Q: Does VoxCPM plan to support controllable generation (emotion, style, fine-grained control)?
|
||||
|
||||
**A:** Currently, VoxCPM only supports zero-shot voice cloning and context-aware speech generation. Direct control over specific speech attributes (emotion, style, fine-grained prosody) is limited. However, we are actively researching instruction-controllable expressive speech generation with fine-grained control capabilities, working towards a human instruction-to-speech generation model!
|
||||
|
||||
### Q: Does VoxCPM support different hardware chips (e.g., Ascend 910B, XPU, NPU)?
|
||||
|
||||
**A:** Currently, we have not yet adapted VoxCPM for different hardware chips. Our main focus remains on developing new model capabilities and improving stability. We encourage you to check if community developers have done similar work, and we warmly welcome everyone to contribute and promote such adaptations together!
|
||||
|
||||
These features are under active development, and we look forward to sharing updates in future releases!
|
||||
|
||||
|
||||
@@ -1,55 +0,0 @@
|
||||
# 👩🍳 A Voice Chef's Guide
|
||||
|
||||
Welcome to the VoxCPM kitchen! Follow this recipe to cook up perfect generated speech. Let's begin.
|
||||
|
||||
---
|
||||
|
||||
## 🥚 Step 1: Prepare Your Base Ingredients (Content)
|
||||
|
||||
First, choose how you'd like to input your text:
|
||||
|
||||
### 1. Regular Text (Classic Mode)
|
||||
- ✅ Keep "Text Normalization" ON. Type naturally (e.g., "Hello, world! 123"). The system will automatically process numbers, abbreviations, and punctuation using WeTextProcessing library.
|
||||
|
||||
### 2. Phoneme Input (Native Mode)
|
||||
- ❌ Turn "Text Normalization" OFF. Enter phoneme text like `{HH AH0 L OW1}` (EN) or `{ni3}{hao3}` (ZH) for precise pronunciation control. In this mode, VoxCPM also supports native understanding of other complex non-normalized text—try it out!
|
||||
- **Phoneme Conversion**: For Chinese, phonemes are converted using pinyin. For English, phonemes are converted using CMUDict. Please refer to the relevant documentation for more details.
|
||||
|
||||
---
|
||||
|
||||
## 🍳 Step 2: Choose Your Flavor Profile (Voice Style)
|
||||
|
||||
This is the secret sauce that gives your audio its unique sound.
|
||||
|
||||
### 1. Cooking with a Prompt Speech (Following a Famous Recipe)
|
||||
- A prompt speech provides the desired acoustic characteristics for VoxCPM. The speaker's timbre, speaking style, and even the background sounds and ambiance will be replicated.
|
||||
- **For a Clean, Denoising Voice:**
|
||||
- ✅ Enable "Prompt Speech Enhancement". This acts like a noise filter, removing background hiss and rumble to give you a pure, clean voice clone. However, this will limit the audio sampling rate to 16kHz, restricting the cloning quality ceiling.
|
||||
- **For High-Quality Audio Cloning (Up to 44.1kHz):**
|
||||
- ❌ Disable "Prompt Speech Enhancement" to preserve all original audio information, including background atmosphere, and support audio cloning up to 44.1kHz sampling rate.
|
||||
|
||||
### 2. Cooking au Naturel (Letting the Model Improvise)
|
||||
- If no reference is provided, VoxCPM becomes a creative chef! It will infer a fitting speaking style based on the text itself, thanks to the text-smartness of its foundation model, MiniCPM-4.
|
||||
- **Pro Tip**: Challenge VoxCPM with any text—poetry, song lyrics, dramatic monologues—it may deliver some interesting results!
|
||||
|
||||
---
|
||||
|
||||
## 🧂 Step 3: The Final Seasoning (Fine-Tuning Your Results)
|
||||
|
||||
You're ready to serve! But for master chefs who want to tweak the flavor, here are two key spices.
|
||||
|
||||
### CFG Value (How Closely to Follow the Recipe)
|
||||
- **Default**: A great starting point.
|
||||
- **Voice sounds strained or weird?** Lower this value. It tells the model to be more relaxed and improvisational, great for expressive prompts.
|
||||
- **Need maximum clarity and adherence to the text?** Raise it slightly to keep the model on a tighter leash.
|
||||
- **Short sentences?** Consider increasing the CFG value for better clarity and adherence.
|
||||
- **Long texts?** Consider lowering the CFG value to improve stability and naturalness over extended passages.
|
||||
|
||||
### Inference Timesteps (Simmering Time: Quality vs. Speed)
|
||||
- **Need a quick snack?** Use a lower number. Perfect for fast drafts and experiments.
|
||||
- **Cooking a gourmet meal?** Use a higher number. This lets the model "simmer" longer, refining the audio for superior detail and naturalness.
|
||||
|
||||
---
|
||||
|
||||
Happy creating! 🎉 Start with the default settings and tweak from there to suit your project. The kitchen is yours!
|
||||
|
||||
@@ -476,7 +476,7 @@ class VoxCPM2Model(nn.Module):
|
||||
retry_badcase_max_times: int = 3,
|
||||
retry_badcase_ratio_threshold: float = 6.0,
|
||||
streaming: bool = False,
|
||||
streaming_prefix_len: int = 3,
|
||||
streaming_prefix_len: int = 4,
|
||||
) -> Generator[torch.Tensor, None, None]:
|
||||
if retry_badcase and streaming:
|
||||
warnings.warn("Retry on bad cases is not supported in streaming mode, setting retry_badcase=False.")
|
||||
@@ -775,7 +775,7 @@ class VoxCPM2Model(nn.Module):
|
||||
retry_badcase_max_times: int = 3,
|
||||
retry_badcase_ratio_threshold: float = 6.0,
|
||||
streaming: bool = False,
|
||||
streaming_prefix_len: int = 3,
|
||||
streaming_prefix_len: int = 4,
|
||||
) -> Generator[Tuple[torch.Tensor, torch.Tensor, Union[torch.Tensor, List[torch.Tensor]]], None, None]:
|
||||
"""
|
||||
Generate audio using pre-built prompt cache.
|
||||
@@ -964,7 +964,7 @@ class VoxCPM2Model(nn.Module):
|
||||
inference_timesteps: int = 10,
|
||||
cfg_value: float = 2.0,
|
||||
streaming: bool = False,
|
||||
streaming_prefix_len: int = 3,
|
||||
streaming_prefix_len: int = 4,
|
||||
) -> Generator[Tuple[torch.Tensor, Union[torch.Tensor, List[torch.Tensor]]], None, None]:
|
||||
"""Core inference method for audio generation.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user