RVC WebUI (Retrieval-Based-Voice-Conversation-Web User Interface) install and usage tutorial

justthetipwithdani
Dec 19, 2023
10 min read

Updated: Dec 27, 2023

#voicecloning #ai

Some Links you might be interested in...

Required

Hugging Face Site to Download - Link

(You will need this to open the Zip) - Link

Pretrained Voice Models - Link

Optional

Translated version of the GitHub Repository - GitHub - Link

Google Translate Site - Google Translate

Colab Version of RVC - Link

Davinci Resolve or Clip Champ - Free Editing Software to combine vocal and instrumental files.

Let's install this thing!

Before we get started: This is a all-in-one installer so no prereqs are required. It will have Python 3.10.6 portable, ffmepg, pytorch and so much stuff already zipped and you're basically just unzipping a complete product then clicking on the go-web batch file.

Step 1. Head on over to this Hugging Face Site if the link doesn't work you will have to find it in this GitHub Repository.

Step 2 Download the newest RVC files for your PC Setup.

The Choices are Nvidia OR AMD_Intel

Choose the newest version (I'm choosing RVC10006Nvidia.7z because I have a Nvidia)

Note: File size is 5.64GB and was added 27 days ago. When extracted it will be over 10GB.

Time to Extract BUT...

The default Windows Extract won't extract .SH files that are in this compressed 7z file. You will either need to download WinRAR or 7Zip and install one of those FREE Programs to extract the file you downloaded with no errors.

Click on WinRAR or the 7 Zip Icon to go to their homepage and download the tool. I used WinRAR.

To install Either one of these just open the file you downloaded and click next for everything and leave everything with the defaults and you won't need to access them directly after installing.

Now We Can Get Down to Business

Right Click on the RVC10006Nvidia.7z file you downloaded and extract with WinRAR or 7zip.

If you didn't get a WinRAR option you can select "open with > " and find WinRAR or choose another App and navigate to the WinRAR Application file for me it was located in C:\Program Files\WinRAR

I installed mine into my documents folder and it created a folder called RVC10006Nvidia for me.

This is what your new RVC Web UI Folder will look like... and you actually done with the install...sort of....

Now Click on Go-Web (This is also also what you will click on to start this app every time)

The first time you run this it will take 1-2min as it will be downloading updates. When you open it in the future it will take 10 seconds to run.

When you see the words "Running on local URL: http//0.0.0.0:7897" your tool should be up and running

By now the RVC WebUI should of popped up on a webpage automatically (A Gradio UI interface) that looks like this :

If it didn't pop up automatically you will have to hold control and click on the IP address (or the lack of IP address) http://0.0.0.0:7897

you can just type localhost:7897 into any address bar once you see it is running on Local URL http://0.0.0.0:7897

Note: Typing the IP address in the address bar will not work because 0.0.0.0 is a lack of an IP address. Unlike a loopback address which is 127.0.0.1 so just type Localhost:7897 and it should work.

You're now In!!! BUT Wait....it looks complicated

Let me show you the cheat code cheat sheet and take a closer look at the tool.

This is the model inference Tab cheat sheet.

Step 1. Inference Voice. Choose a Model. This will be a .pth file that is saved in your assets folder. I will talk more about this later.

Main Settings

Transpose is a value of -12 being the deepest voice and 12 being the highest voice or octave. Try and play with this number and match it as much as possible with the input song or audio file and the Model voice you choose (in inferencing voice step 1)

Enter Path of Audio file is the source audio file (Aka Audio Input file) you will be copying the tone and pitch from. If this is a song the model you choose will auto tune and sing to try and match it. It works much better if your input is just vocals, the next part of this tutorial will teach you how to split music from vocals.

Path to the feature index (just leave this blank)

Auto-detect index path and select from dropdown: This is a .index file found in your logs folder. Make sure it matches the name of the model you choose in model inference in step 1. i,.e. I chose Bruno Mars for my model .pth file so I need to choose Bruno Mars here.

rmvpe extract algorithm- the tool does an extremely good job at defining what everything does but just choose rmvpe since it is the most efficient with the best quality.

Once all your settings are in order click on convert. This will take 5-10 seconds or slightly longer for a 3-4min audio file on a RTX 4070.

The output information should say Success at the top like it does in this picture.

The Bottom Right will have the audio output. Click Play and if you like the result click on the ellipses button (3 vertical dots) to download the file to your PC.

Before we go to the next section let's look at how to add a custom model:

Custom Model installs

Head over to hugging face again and type RVC in the search bar. You can also type RVC space and a voice your trying to find. i.e. RVC Bruno Mars (do not put a comma)

Here is an example of what that looks like and the result:

Once you click on the model if will bring you the main page, click on the files and versions sub tab

Alrighty!!! now you can download the model. Not all repositories will look the same but your looking for an all in one zip file or a separate .index file and .pth file. Since these are both generically named model I will rename them to BrunoMars.index and BrunoMars.pth

Now that you downloaded these files you will have to move them into your RVC folder BUT WAIT! they go into separate folders...

.pth files will go into your rvc folders assets under the subfolder for weights (see image above)

.index files will go into your RVC folder logs folder and IMPORTANT, it will be in a logs subfolder that you will create with the same exact name as the .pth.

i.e. I have a BrunoMars folder that holds the BrunoMars.index which has the same name as my BrunoMars.pth that is held in my assets > Weights subfolder.

note: when you train a a voice it will automatically put these files into the correct location and create these folders for you. i.e. MyVoice_9 was automatically created when the training was done.

Ok, you now know how to add custom models. Let's dive into something else

Let's separate the vocals from instrumentals on songs and remove reverb/echo/artifacting

This is a very important process when your using vocals for input on your model inferencing (voice cloning) or when your train a voice of your own. Getting clean vocals and removing echo is not optional if you want a good result.

Here is an overview at the section we will break down:

Ok your just looking at this to ensure your on the same tab I'm on. Let's take a closer look now.

The first bar at the top left is something you will probably never use. It is asking for a folder of an entire playlist with multiple audio files that you want to process at the same time. If you want to do that add a folder path in there.

You might of guessed what this is for but you can drag the song you want to process in this box or click on this section to upload by directing to it's path

The first line is model. I recommend using HP2 but experiment with the others. This is the model that splits your vocals and instrumentals

Take note: You can copy and paste a directory address of a folder you created and replace the text opt to select where your output will go.

(Optional) It's not required but I put a address in here so I can just erase the folder entirely when I'm doing playing with the tool. This is what that will look like:

Let's take a look at the available models again:

The text above does a great job explaining these models but I'm going to break it down and oversimplify it for you.

HP2 is my goto and works perfectly most of the time. It's the weakest but the most dependable and uses the least resources. Takes 5 seconds or less.

HP3 this is a little more aggressive than HP2 and will sometimes remove too much music.

HP5 is more advanced but I haven't had luck with it.

onnx deverb is great BUT you can only use it with "just vocals" and it will already have to be separated from the instrumentals. Same goes for the next 3 DeEcho options. I recommend using one of these if you are going to reupload the vocals to train a model. This will allow you to clean up the sample for good input.

So HP2, HP3 and HP5 to split vocal and instrumentals and the rest for cleaning up vocal only clips. Keep in mind the deverbs and dechos require a lot of VRAM so keep an eye on your console to make sure your not running out of memory.

Next is the file format.

Wav is a raw format and the biggest of the 4 but also the most compatible.

Flac is a lossless format that allows compression to make the file smaller but it's not as compatible as the others

Mp3 is your standard compression format for audio and is pretty good quality but there is quality loss for the exchange of a smaller file.

m4a is the smallest and most compressed of the 4 but potentially loses the most quality.

Your ear probably can't tell the diff between the 4 but the training models will. I suggest going with .Wav.

Click convert to get the separation started. The output information will say "Success" but the audio output will be the folders your specified in the previous steps.

I've actually never used the default folders but I "assume" it will go to the RVC temp folder.

The last thing I will show you is training

Here is an overview of the training

Step 1. Name the output file

The only mandatory thing you do in step 1 is to give this output a name in the "Enter Experiment name:" section. I named this test10. Everything can remain default.

Step 2a. Add your folder path.

Note: Singer ID doesn't work at the time this tutorial was made.

Enter the path of the folder that has all your training samples. These should be clean samples with no background noise. I recommend using voice reverb removal before using a file for training.

Once your ready click the BIG Orange Preprocess Data button and your output info should end with Suc. end preprocess. if everything went well.

Step 2 B. Just click the orange button

Make sure your settings look like mine with 0 for the video card which will be your main card and rmvpe_gpu which is the most efficient way to train. Then click feature extraction.

Note: here is what the Chinese words translate to "Select the pitch extraction algorithm: input singing can be accelerated by pm, high-quality voice but poor CPU can be accelerated by dio, harvest quality is better but slow, rmvpe has the best effect and eats CPU/GPU"

Your expected outcome should say "all-feature-done" you might have to scroll down to see it.

Click inside the box and scroll down with the mouse wheel.

Step 3 Let's GO!

If your tired of messing with settings just click on One-Click Training. In fact, I think you can just click it after you gave the name output and the voice training folder.

Let's set up our training preferences.

Epoch = 1 cycle or completion of a algorithm, if you set it to save every 50 epochs it will save that many times during a training. Here is what that looks like in the console.

As you can see, at 50 epochs there was a notice it saved a model and where it saved it.

EPOCHS!!!!!!!!!!!!!!

This is the most important concept about training and will make or break your sample. epochs go from 1 to 1000

I recommend starting with under 50 epochs for under 5 minutes of audio. THIS is just a guess based on 2 models I've trained and gave me the best results. The models I've trained with my voice were only 2-5 min of audio trained at 50-100 epochs. They are not the best clones and you will need more clips and time to get better results.

This doesn't scale well and the quality of the audio is also a factor. Some of the default models were trained on 50 hours of audio. Check out hugging face to see what the average epochs are and how many minutes or hours they used to train their models.

If you go too high you will hear artifacting and the voice will sound over auto tuned or super robotic. If you go too low it won't capture enough of your accent or the nuance in your voice and it will end up monotone.

You will have to play with the epochs to find the right balance for the sample you gave it.

Next is batch size per GPU. As this tool updates it gets better optimized most GPU's can accept larger batch sizes. As of now My recommendation is 40 if you have 24GB of VRAM 30 if you have 12GB, 20 if you have 8 GB, 10 if you have 6 and sorry if you have 4GB of VRAM but I wouldn't go over a batch size of 5 if that's you.

If you get out of vram errors in your console you will have to lower it down. With my 4070 and 12 GB of Vram I was able to max it out at 40 but that wasn't the case last month.

Once you successfully trained your model you can access it in the first tab "Model Inference", if you don't see the model your trained click on "refresh voice list and index path and it should show up.

That's it!!! you should now know how to voice clone and even make your clone sing, split vocals and instrumentals from a song and separate the two, train your own custom model, install custom models that people already trained and install this tool.

Don't forget! once you have installed the tool you will be clicking on the Go-Web file to access this tool and if it doesn't automatically pop up in your browser, you will have to put localhost:7897

Someone recently brought it to my attention that they didn't know how to combine the instrumentals and vocals back into one clip.

Here are some links to some free editing software -

Davinci Resolve or Clip Champ - Free Editing Software to combine vocal and instrumental files.

Clip Champ now has a free windows version and is the easiest way to edit. It also has some top nocte text-to-speech options all for free.

Davinci resolve is a powerful video editor that can do some pretty advanced features like keyframing, green screening, and is widely used by the film making community. However, the tool is not very intuitive and seems a lot harder to use that it needs to be. There will be a learning curve for this tool.

The way to combine the files together is simple and pretty universal regarding what video editing tool you're using.

Upload the files you want to use; the video, vocals, and instrumentals then drag the files into the timeline separately on different layers. Here is an example of what that looks like:

Have fun with it!

RVC WebUI (Retrieval-Based-Voice-Conversation-Web User Interface) install and usage tutorial

Let's install this thing!

Time to Extract BUT...

Now We Can Get Down to Business

Recent Posts

Comentarios