About: Fine-tuning lets you adapt Kitten TTS models to specific voices. Requires a dataset of clean audio recordings.

Overview

Kitten TTS models can be fine-tuned on custom voice datasets. The process involves preparing audio-text pairs, selecting a base model (mini/micro/nano), and running training.

Dataset Preparation

Requirements

  • 5-30 minutes of clean audio (single speaker)
  • 16-24 kHz sample rate, mono WAV files
  • Matching text transcripts
  • Quiet recording environment (no background noise)
  • Consistent microphone and recording setup

Dataset Structure

dataset/
  metadata.csv   # file_name|text
  wavs/
    audio_001.wav
    audio_002.wav
    audio_003.wav
    ...

metadata.csv Format

audio_001|Welcome to my custom voice dataset.
audio_002|This audio should be clean and clear.
audio_003|Each file should have matching text.

Base Model Selection

ModelParamsVRAM NeededTraining TimeBest For
mini-0.880M4-8 GB~2-4 hoursBest quality results
micro-0.840M2-4 GB~1-2 hoursGood quality, faster
nano-0.815M1-2 GB~30-60 minQuick experiments
Recommendation: Start with micro-0.8 for first fine-tuning experiments. It balances quality and training speed well.

Training Tips

Audio Quality

  • Remove background noise with Audacity or similar tools
  • Normalize volume across all samples
  • Trim silence from beginning and end of each clip
  • Keep clips between 2-15 seconds each

Text Quality

  • Ensure transcripts match audio exactly
  • Use proper punctuation (., !, ?)
  • Lowercase text for consistency
  • Include a variety of sentence structures

Hyperparameter Tips

  • Start with default learning rate
  • Train for 10-50 epochs depending on dataset size
  • Monitor loss curve for overfitting
  • Save checkpoints every 5 epochs

Evaluation

After training, evaluate your model:

from kittentts import KittenTTS

# Load fine-tuned model
model = KittenTTS("./my-fine-tuned-model")

# Test with various texts
test_texts = [
    "Hello, this is a test of my custom voice.",
    "The quick brown fox jumps over the lazy dog.",
    "I hope this voice sounds natural and clear."
]

for i, text in enumerate(test_texts):
    model.generate_to_file(text, "test_%d.wav" % i)

Common Issues

Robotic output: Usually caused by poor audio quality or insufficient data. Ensure clean recordings and at least 10 minutes of audio.
Overfitting: If the model only reproduces exact training phrases, reduce epochs or increase dataset size.
Pronunciation issues: Include more varied vocabulary in your training data to improve generalization.