About: Fine-tuning lets you adapt Kitten TTS models to specific voices. Requires a dataset of clean audio recordings.
Overview
Kitten TTS models can be fine-tuned on custom voice datasets. The process involves preparing audio-text pairs, selecting a base model (mini/micro/nano), and running training.
Dataset Preparation
Requirements
- 5-30 minutes of clean audio (single speaker)
- 16-24 kHz sample rate, mono WAV files
- Matching text transcripts
- Quiet recording environment (no background noise)
- Consistent microphone and recording setup
Dataset Structure
dataset/
metadata.csv # file_name|text
wavs/
audio_001.wav
audio_002.wav
audio_003.wav
...
metadata.csv Format
audio_001|Welcome to my custom voice dataset.
audio_002|This audio should be clean and clear.
audio_003|Each file should have matching text.
Base Model Selection
| Model | Params | VRAM Needed | Training Time | Best For |
|---|---|---|---|---|
| mini-0.8 | 80M | 4-8 GB | ~2-4 hours | Best quality results |
| micro-0.8 | 40M | 2-4 GB | ~1-2 hours | Good quality, faster |
| nano-0.8 | 15M | 1-2 GB | ~30-60 min | Quick experiments |
Recommendation: Start with
micro-0.8 for first fine-tuning experiments. It balances quality and training speed well.Training Tips
Audio Quality
- Remove background noise with Audacity or similar tools
- Normalize volume across all samples
- Trim silence from beginning and end of each clip
- Keep clips between 2-15 seconds each
Text Quality
- Ensure transcripts match audio exactly
- Use proper punctuation (., !, ?)
- Lowercase text for consistency
- Include a variety of sentence structures
Hyperparameter Tips
- Start with default learning rate
- Train for 10-50 epochs depending on dataset size
- Monitor loss curve for overfitting
- Save checkpoints every 5 epochs
Evaluation
After training, evaluate your model:
from kittentts import KittenTTS
# Load fine-tuned model
model = KittenTTS("./my-fine-tuned-model")
# Test with various texts
test_texts = [
"Hello, this is a test of my custom voice.",
"The quick brown fox jumps over the lazy dog.",
"I hope this voice sounds natural and clear."
]
for i, text in enumerate(test_texts):
model.generate_to_file(text, "test_%d.wav" % i)
Common Issues
Robotic output: Usually caused by poor audio quality or insufficient data. Ensure clean recordings and at least 10 minutes of audio.
Overfitting: If the model only reproduces exact training phrases, reduce epochs or increase dataset size.
Pronunciation issues: Include more varied vocabulary in your training data to improve generalization.