What are some tips for inferring low-quality audio using D-SVCRVC?

Unclewaffl3s · August 14, 2024, 2:16pm

What are some tips for improving low-quality audio inference with D-SVCRVC?

Please let me know if there’s a more suitable subreddit for this topic!

I trained a local voice model of my father, who recently passed away. He was a musician, and I discovered some songs he recorded in the 80s. Unfortunately, I only have a digitized copy from an audio cassette, though I might find an open reel copy in the basement.

Since my dad was always creating new songs, my training dataset includes high-quality vocal stems at 96kHz/24bit, spanning about 30 minutes of audio over 20 years. I even added a speech he gave, bringing the total to around 55 minutes of audio.

I aimed to use his voice model to remaster the original vocals from the cassette and re-record the other instruments myself. I isolated the vocals using UVR (and tried mdx23) and cleaned them up as much as possible.

While RVC produces decent results, it struggles with certain words, e.g., “free” becomes “fee.” SVC offers better tonality but introduces more artifacts and inconsistent pitch. Higher quality samples from my own tracks don’t have these issues as much, though RVC tends to blend the original voice with the model, creating a hybrid timbre rather than a pure model sound.

Happy to share sample audio if needed!

tl;dr: High-quality training data with low-quality inference audio. Same voice model as input audio.

Tips for optimizing results, incorporating text with audio, or training UVR/MDX models for better isolation?

Dolph · August 14, 2024, 2:17pm

Consider splitting the audio into segments; some parts might be best preserved as is, while others could be re-recorded. It might also be helpful to select only the highest-quality audio clips. Additionally, keep an eye out for RVC3, which will be released soon.

DolphGabbana · August 14, 2024, 2:18pm

To separate vocals, you might want to try using Demucs. It performs quite effectively and is relatively straightforward to use. From my experience, it tends to work better than mdx, although I’m not certain about UVR.