2025-12-05 21:00:01 +08:00
# VoxCPM Fine-tuning Guide
This guide covers how to fine-tune VoxCPM models with two approaches: full fine-tuning and LoRA fine-tuning.
### 🎓 SFT (Supervised Fine-Tuning)
Full fine-tuning updates all model parameters. Suitable for:
- 📊 Large, specialized datasets
- 🔄 Cases where significant behavior changes are needed
### ⚡ LoRA Fine-tuning
LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method that:
- 🎯 Trains only a small number of additional parameters
- 💾 Significantly reduces memory requirements and training time
- 🔀 Supports multiple LoRA adapters with hot-swapping
## Table of Contents
2025-12-09 21:34:39 +08:00
- [Quick Start: WebUI ](#quick-start-webui )
2025-12-05 21:00:01 +08:00
- [Data Preparation ](#data-preparation )
- [Full Fine-tuning ](#full-fine-tuning )
- [LoRA Fine-tuning ](#lora-fine-tuning )
- [Inference ](#inference )
- [LoRA Hot-swapping ](#lora-hot-swapping )
- [FAQ ](#faq )
---
2025-12-09 21:34:39 +08:00
## Quick Start: WebUI
For users who prefer a graphical interface, we provide `lora_ft_webui.py` - a comprehensive WebUI for training and inference:
### Launch WebUI
``` bash
python lora_ft_webui.py
```
Then open `http://localhost:7860` in your browser.
### Features
- **🚀 Training Tab**: Configure and start LoRA training with an intuitive interface
- Set training parameters (learning rate, batch size, LoRA rank, etc.)
- Monitor training progress in real-time
- Resume training from existing checkpoints
- **🎵 Inference Tab**: Generate audio with trained models
- Automatic base model loading from LoRA checkpoint config
- Voice cloning with automatic ASR (reference text recognition)
- Hot-swap between multiple LoRA models
- Zero-shot TTS without reference audio
2025-12-05 21:00:01 +08:00
## Data Preparation
Training data should be prepared as a JSONL manifest file, with one sample per line:
``` jsonl
{"audio": "path/to/audio1.wav", "text": "Transcript of audio 1."}
{"audio": "path/to/audio2.wav", "text": "Transcript of audio 2."}
{"audio": "path/to/audio3.wav", "text": "Optional duration field.", "duration": 3.5}
{"audio": "path/to/audio4.wav", "text": "Optional dataset_id for multi-dataset.", "dataset_id": 1}
```
### Required Fields
| Field | Description |
|-------|-------------|
| `audio` | Path to audio file (absolute or relative) |
| `text` | Corresponding transcript |
### Optional Fields
| Field | Description |
|-------|-------------|
| `duration` | Audio duration in seconds (speeds up sample filtering) |
| `dataset_id` | Dataset ID for multi-dataset training (default: 0) |
### Requirements
- Audio format: WAV
- Sample rate: 16kHz for VoxCPM-0.5B, 44.1kHz for VoxCPM1.5
- Text: Transcript matching the audio content
See `examples/train_data_example.jsonl` for a complete example.
---
## Full Fine-tuning
Full fine-tuning updates all model parameters. Suitable for large datasets or when significant behavior changes are needed.
### Configuration
Create `conf/voxcpm_v1.5/voxcpm_finetune_all.yaml` :
``` yaml
pretrained_path : /path/to/VoxCPM1.5/
train_manifest : /path/to/train.jsonl
val_manifest : ""
sample_rate : 44100
batch_size : 16
grad_accum_steps : 1
num_workers : 2
num_iters : 2000
log_interval : 10
valid_interval : 1000
save_interval : 1000
learning_rate : 0.00001 # Use smaller LR for full fine-tuning
weight_decay : 0.01
warmup_steps : 100
max_steps : 2000
max_batch_tokens : 8192
save_path : /path/to/checkpoints/finetune_all
tensorboard : /path/to/logs/finetune_all
lambdas :
loss/diff : 1.0
loss/stop : 1.0
```
### Training
``` bash
# Single GPU
python scripts/train_voxcpm_finetune.py --config_path conf/voxcpm_v1.5/voxcpm_finetune_all.yaml
# Multi-GPU
CUDA_VISIBLE_DEVICES = 0,1,2,3 torchrun --nproc_per_node= 4 \
scripts/train_voxcpm_finetune.py --config_path conf/voxcpm_v1.5/voxcpm_finetune_all.yaml
```
### Checkpoint Structure
Full fine-tuning saves a complete model directory that can be loaded directly:
```
checkpoints/finetune_all/
└── step_0002000/
├── model.safetensors # Model weights (excluding audio_vae)
├── config.json # Model config
├── audiovae.pth # Audio VAE weights
├── tokenizer.json # Tokenizer
├── tokenizer_config.json
├── special_tokens_map.json
├── optimizer.pth
└── scheduler.pth
```
---
## LoRA Fine-tuning
LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method that trains only a small number of additional parameters, significantly reducing memory requirements.
### Configuration
Create `conf/voxcpm_v1.5/voxcpm_finetune_lora.yaml` :
``` yaml
pretrained_path : /path/to/VoxCPM1.5/
train_manifest : /path/to/train.jsonl
val_manifest : ""
sample_rate : 44100
batch_size : 16
grad_accum_steps : 1
num_workers : 2
num_iters : 2000
log_interval : 10
valid_interval : 1000
save_interval : 1000
learning_rate : 0.0001 # LoRA can use larger LR
weight_decay : 0.01
warmup_steps : 100
max_steps : 2000
max_batch_tokens : 8192
save_path : /path/to/checkpoints/finetune_lora
tensorboard : /path/to/logs/finetune_lora
lambdas :
loss/diff : 1.0
loss/stop : 1.0
# LoRA configuration
lora :
enable_lm : true # Apply LoRA to Language Model
enable_dit : true # Apply LoRA to Diffusion Transformer
enable_proj : false # Apply LoRA to projection layers (optional)
r : 32 # LoRA rank (higher = more capacity)
alpha : 16 # LoRA alpha, scaling = alpha / r
dropout : 0.0
# Target modules
target_modules_lm : [ "q_proj" , "v_proj" , "k_proj" , "o_proj" ]
target_modules_dit : [ "q_proj" , "v_proj" , "k_proj" , "o_proj" ]
2025-12-09 21:34:39 +08:00
# Distribution options (optional)
# hf_model_id: "openbmb/VoxCPM1.5" # HuggingFace ID
# distribute: true # If true, save hf_model_id in lora_config.json
2025-12-05 21:00:01 +08:00
```
### LoRA Parameters
| Parameter | Description | Recommended |
|-----------|-------------|-------------|
| `enable_lm` | Apply LoRA to LM (language model) | `true` |
| `enable_dit` | Apply LoRA to DiT (diffusion model) | `true` (required for voice cloning) |
| `r` | LoRA rank (higher = more capacity) | 16-64 |
| `alpha` | Scaling factor, `scaling = alpha / r` | Usually `r/2` or `r` |
| `target_modules_*` | Layer names to add LoRA | attention layers |
2025-12-09 21:34:39 +08:00
### Distribution Options (Optional)
| Parameter | Description | Default |
|-----------|-------------|---------|
| `hf_model_id` | HuggingFace model ID (e.g., `openbmb/VoxCPM1.5` ) | `""` |
| `distribute` | If `true` , save `hf_model_id` as `base_model` in checkpoint; otherwise save local `pretrained_path` | `false` |
> **Note**: If `distribute: true`, `hf_model_id` is required.
2025-12-05 21:00:01 +08:00
### Training
``` bash
# Single GPU
python scripts/train_voxcpm_finetune.py --config_path conf/voxcpm_v1.5/voxcpm_finetune_lora.yaml
# Multi-GPU
CUDA_VISIBLE_DEVICES = 0,1,2,3 torchrun --nproc_per_node= 4 \
scripts/train_voxcpm_finetune.py --config_path conf/voxcpm_v1.5/voxcpm_finetune_lora.yaml
```
### Checkpoint Structure
2025-12-09 21:34:39 +08:00
LoRA training saves LoRA parameters and configuration:
2025-12-05 21:00:01 +08:00
```
checkpoints/finetune_lora/
└── step_0002000/
├── lora_weights.safetensors # Only lora_A, lora_B parameters
2025-12-09 21:34:39 +08:00
├── lora_config.json # LoRA config + base model path
2025-12-05 21:00:01 +08:00
├── optimizer.pth
└── scheduler.pth
```
2025-12-09 21:34:39 +08:00
The `lora_config.json` contains:
``` json
{
"base_model" : "/path/to/VoxCPM1.5/" ,
"lora_config" : {
"enable_lm" : true ,
"enable_dit" : true ,
"r" : 32 ,
"alpha" : 16 ,
. . .
}
}
```
The `base_model` field contains:
- Local path (default): when `distribute: false` or not set
- HuggingFace ID: when `distribute: true` (e.g., `"openbmb/VoxCPM1.5"` )
This allows loading LoRA checkpoints without the original training config file.
2025-12-05 21:00:01 +08:00
---
## Inference
### Full Fine-tuning Inference
The checkpoint directory is a complete model, load it directly:
``` bash
python scripts/test_voxcpm_ft_infer.py \
--ckpt_dir /path/to/checkpoints/finetune_all/step_0002000 \
--text "Hello, this is the fine-tuned model." \
--output output.wav
```
With voice cloning:
``` bash
python scripts/test_voxcpm_ft_infer.py \
--ckpt_dir /path/to/checkpoints/finetune_all/step_0002000 \
--text "This is voice cloning result." \
--prompt_audio /path/to/reference.wav \
--prompt_text "Reference audio transcript" \
--output cloned_output.wav
```
### LoRA Inference
2025-12-09 21:34:39 +08:00
LoRA inference only requires the checkpoint directory (base model path and LoRA config are read from `lora_config.json` ):
2025-12-05 21:00:01 +08:00
``` bash
python scripts/test_voxcpm_lora_infer.py \
--lora_ckpt /path/to/checkpoints/finetune_lora/step_0002000 \
--text "Hello, this is LoRA fine-tuned result." \
--output lora_output.wav
```
With voice cloning:
``` bash
python scripts/test_voxcpm_lora_infer.py \
--lora_ckpt /path/to/checkpoints/finetune_lora/step_0002000 \
--text "This is voice cloning with LoRA." \
--prompt_audio /path/to/reference.wav \
--prompt_text "Reference audio transcript" \
--output cloned_output.wav
```
2025-12-09 21:34:39 +08:00
Override base model path (optional):
``` bash
python scripts/test_voxcpm_lora_infer.py \
--lora_ckpt /path/to/checkpoints/finetune_lora/step_0002000 \
--base_model /path/to/another/VoxCPM1.5 \
--text "Use different base model." \
--output output.wav
```
2025-12-05 21:00:01 +08:00
---
## LoRA Hot-swapping
LoRA supports dynamic loading, unloading, and switching at inference time without reloading the entire model.
### API Reference
``` python
2025-12-05 22:22:13 +08:00
from voxcpm . core import VoxCPM
2025-12-05 21:00:01 +08:00
from voxcpm . model . voxcpm import LoRAConfig
2025-12-05 22:22:13 +08:00
# 1. Load model with LoRA structure and weights
2025-12-05 21:00:01 +08:00
lora_cfg = LoRAConfig (
enable_lm = True ,
enable_dit = True ,
r = 32 ,
alpha = 16 ,
target_modules_lm = [ " q_proj " , " v_proj " , " k_proj " , " o_proj " ] ,
target_modules_dit = [ " q_proj " , " v_proj " , " k_proj " , " o_proj " ] ,
)
2025-12-05 22:22:13 +08:00
model = VoxCPM . from_pretrained (
hf_model_id = " openbmb/VoxCPM1.5 " , # or local path
load_denoiser = False , # Optional: disable denoiser for faster loading
optimize = True , # Enable torch.compile acceleration
lora_config = lora_cfg ,
lora_weights_path = " /path/to/lora_checkpoint " ,
2025-12-05 21:00:01 +08:00
)
2025-12-05 22:22:13 +08:00
# 2. Generate audio
audio = model . generate (
text = " Hello, this is LoRA fine-tuned result. " ,
prompt_wav_path = " /path/to/reference.wav " , # Optional: for voice cloning
prompt_text = " Reference audio transcript " , # Optional: for voice cloning
)
2025-12-05 21:00:01 +08:00
# 3. Disable LoRA (use base model only)
model . set_lora_enabled ( False )
# 4. Re-enable LoRA
model . set_lora_enabled ( True )
# 5. Unload LoRA (reset weights to zero)
2025-12-05 22:22:13 +08:00
model . unload_lora ( )
2025-12-05 21:00:01 +08:00
# 6. Hot-swap to another LoRA
2025-12-05 22:22:13 +08:00
loaded , skipped = model . load_lora ( " /path/to/another_lora_checkpoint " )
print ( f " Loaded { len ( loaded ) } params, skipped { len ( skipped ) } " )
2025-12-05 21:00:01 +08:00
# 7. Get current LoRA weights
lora_state = model . get_lora_state_dict ( )
```
2025-12-09 21:34:39 +08:00
### Simplified Usage (Load from lora_config.json)
2025-12-05 22:22:13 +08:00
2025-12-09 21:34:39 +08:00
If your checkpoint contains `lora_config.json` (saved by the training script), you can load everything automatically:
2025-12-05 22:22:13 +08:00
``` python
2025-12-09 21:34:39 +08:00
import json
2025-12-05 22:22:13 +08:00
from voxcpm . core import VoxCPM
2025-12-09 21:34:39 +08:00
from voxcpm . model . voxcpm import LoRAConfig
# Load config from checkpoint
lora_ckpt_dir = " /path/to/checkpoints/finetune_lora/step_0002000 "
with open ( f " { lora_ckpt_dir } /lora_config.json " ) as f :
lora_info = json . load ( f )
2025-12-05 22:22:13 +08:00
2025-12-09 21:34:39 +08:00
base_model = lora_info [ " base_model " ]
lora_cfg = LoRAConfig ( * * lora_info [ " lora_config " ] )
# Load model with LoRA
2025-12-05 22:22:13 +08:00
model = VoxCPM . from_pretrained (
2025-12-09 21:34:39 +08:00
hf_model_id = base_model ,
lora_config = lora_cfg ,
lora_weights_path = lora_ckpt_dir ,
2025-12-05 22:22:13 +08:00
)
```
2025-12-09 21:34:39 +08:00
Or use the test script directly:
``` bash
python scripts/test_voxcpm_lora_infer.py \
--lora_ckpt /path/to/checkpoints/finetune_lora/step_0002000 \
--text "Hello world"
```
2025-12-05 21:00:01 +08:00
### Method Reference
| Method | Description | torch.compile Compatible |
|--------|-------------|--------------------------|
2025-12-05 22:22:13 +08:00
| `load_lora(path)` | Load LoRA weights from file | ✅ |
2025-12-05 21:00:01 +08:00
| `set_lora_enabled(bool)` | Enable/disable LoRA | ✅ |
2025-12-05 22:22:13 +08:00
| `unload_lora()` | Reset LoRA weights to initial values | ✅ |
2025-12-05 21:00:01 +08:00
| `get_lora_state_dict()` | Get current LoRA weights | ✅ |
2025-12-05 22:22:13 +08:00
| `lora_enabled` | Property: check if LoRA is configured | ✅ |
2025-12-05 21:00:01 +08:00
---
## FAQ
### 1. Out of Memory (OOM)
- Increase `grad_accum_steps` (gradient accumulation)
- Decrease `batch_size`
- Use LoRA fine-tuning instead of full fine-tuning
- Decrease `max_batch_tokens` to filter long samples
### 2. Poor LoRA Performance
- Increase `r` (LoRA rank)
- Adjust `alpha` (try `alpha = r/2` or `alpha = r` )
- Increase training steps
- Add more target modules
### 3. Training Not Converging
- Decrease `learning_rate`
- Increase `warmup_steps`
- Check data quality
### 4. LoRA Not Taking Effect at Inference
2025-12-09 21:34:39 +08:00
- Check that `lora_config.json` exists in the checkpoint directory
2025-12-05 22:22:13 +08:00
- Check `load_lora()` return value - `skipped_keys` should be empty
2025-12-05 21:00:01 +08:00
- Verify `set_lora_enabled(True)` is called
### 5. Checkpoint Loading Errors
2025-12-09 21:34:39 +08:00
- Full fine-tuning: checkpoint directory should contain `model.safetensors` (or `pytorch_model.bin` ), `config.json` , `audiovae.pth`
- LoRA: checkpoint directory should contain:
- `lora_weights.safetensors` (or `lora_weights.ckpt` ) - LoRA weights
- `lora_config.json` - LoRA config and base model path