What are some tips for improving low-quality audio inference with D-SVCRVC?
Please let me know if there’s a more suitable subreddit for this topic!
I trained a local voice model of my father, who recently passed away. He was a musician, and I discovered some songs he recorded in the 80s. Unfortunately, I only have a digitized copy from an audio cassette, though I might find an open reel copy in the basement.
Since my dad was always creating new songs, my training dataset includes high-quality vocal stems at 96kHz/24bit, spanning about 30 minutes of audio over 20 years. I even added a speech he gave, bringing the total to around 55 minutes of audio.
I aimed to use his voice model to remaster the original vocals from the cassette and re-record the other instruments myself. I isolated the vocals using UVR (and tried mdx23) and cleaned them up as much as possible.
While RVC produces decent results, it struggles with certain words, e.g., “free” becomes “fee.” SVC offers better tonality but introduces more artifacts and inconsistent pitch. Higher quality samples from my own tracks don’t have these issues as much, though RVC tends to blend the original voice with the model, creating a hybrid timbre rather than a pure model sound.
Happy to share sample audio if needed!
tl;dr: High-quality training data with low-quality inference audio. Same voice model as input audio.
Tips for optimizing results, incorporating text with audio, or training UVR/MDX models for better isolation?