I Cloned My Own Voice for My Website

The intro video for this site shipped with a voiceover. It was articulate, warm, well-paced — and completely not me. It was a synthetic voice off the shelf, and the moment I heard it narrate my own words, it landed wrong. If the whole point of the site is me, building in public, the voice can’t be a stranger.

So the fix was obvious: use my own voice. The interesting part was everything that went wrong getting there.

Finding the sample

You can’t clone a voice you don’t have audio of. I had a better source than I expected: years of recorded updates — me, alone, talking to a camera for ten minutes at a stretch. A solo monologue is the ideal clone source. One speaker, no crosstalk, natural prosody, plenty of it.

The content of those recordings doesn’t matter, and that’s worth saying clearly: a voice clone copies timbre, not words. The reference clip is throwaway. The model never repeats what’s in it. What gets published is the new script in the cloned voice — so the source recording stays private and the output carries none of its content. I pulled about twenty seconds of clean speech, normalized the levels, and that was the raw material.

The obvious way, and why it broke

Most modern TTS engines clone a voice two ways. One path conditions on the reference audio and a transcript of it — in-context learning. The other extracts a speaker embedding — an “x-vector” — that captures timbre and discards the words entirely.

I registered my voice and used the in-context path, because it’s the default and it gives the best quality when the reference and the target are the same language. I asked it to say one sentence.

It produced two and a half minutes of audio.

Not my sentence stretched out — a runaway. The model latched onto the reference and never found the exit. I tried a shorter clip, a cleaner clip, a different segment. Same result every time: a clean reference, a short prompt, and a minute or more of unintelligible drift. Meanwhile the eleven voices already installed worked perfectly through the exact same path.

The actual fix

The tell was that the other path — the x-vector one — was stable. The blend feature, which works purely on speaker embeddings, produced a clean twelve-second clip of my voice on the first try.

The in-context path was destabilizing on my specific clip against the engine’s fixed reference transcript. The x-vector path doesn’t care about any of that — it takes the timbre and synthesizes the new text natively. So I routed my voice through it. One configuration flag, no model retraining: tag the reference as a different language than the target, which is exactly how the engine decides to use timbre-only mode for cross-lingual voices. My voice now rides the same rails as a voice cloned from a clip in another language.

Then the rule I should have started with: I ran the output back through speech-to-text. If the transcription comes back as my exact script, the clone is intelligible. It did. That round-trip — synthesize, then transcribe, then diff against the input — is the cheapest possible reality check for generated audio, and it would have caught the runaway in one step instead of five.

The pipeline, concretely

Nothing here is exotic, and all of it is open source. The shape:

Pull the source with yt-dlp — it handles Loom, YouTube, most players. yt-dlp -x --audio-format wav.
Cut a clean ~20s window with ffmpeg — high-pass to kill rumble, loudness-normalize, downsample to the model’s rate: ffmpeg -ss 30 -t 22 -af "highpass=f=70,loudnorm=I=-18:TP=-1.5,aresample=24000" -ac 1 sample.wav.
Register it with the TTS engine — a self-hosted Qwen3-TTS model behind a small OpenAI-shaped API. A voice is just an entry in a manifest:

{
  "wav": "voices/jon.wav",
  "name": "Jon Roosevelt",
  "ref_language": "ko"
}

That ref_language is the whole fix. The server picks the clone path with a one-liner — roughly needs_xvector = (target_language != reference_language). Tag the reference as a language other than the one you synthesize in, and it routes to the timbre-only x-vector prompt instead of the fragile in-context one. No retraining, no fork — one field.

Synthesize with POST /v1/audio/speech {voice: "jon", input: "..."}.
Verify by transcribing the output with Whisper and diffing against the script. A runaway shows up as a 150-second clip; a broken clone shows up as empty transcription. Both are caught before anything ships.

The whole thing is wrapped in a small script now, so next time it’s one command. The video itself is composed in HyperFrames (HTML + GSAP timelines rendered to MP4), with the narration dropped in as the audio track.

What I’d tell you

Three things, if you’re cloning a voice:

Timbre, not content. The reference clip is disposable and its words never surface. That makes the privacy story simple and the sourcing easy — any clean solo recording works.
Two clone paths, very different failure modes. In-context conditioning is higher fidelity and more fragile. X-vector embedding is timbre-only and rock-stable. If the fancy path runs away, drop to the embedding.
Verify generated audio by transcribing it. You can’t eyeball a waveform. Round-trip it through ASR and compare to the script. Intelligible is a property you can measure; “sounds right” is not.

The voice on this site is mine now. It took one good recording, one stubborn bug, and one boring verification step I should have run first.

Built on: Qwen3-TTS (Alibaba) for the voice model · yt-dlp and ffmpeg for sourcing · Whisper (OpenAI) for the verification round-trip · HyperFrames + GSAP for the video. All of it open or self-hostable — none of this needs a vendor.