mirror of
https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI.git
synced 2026-02-03 01:40:23 +00:00
Place does by language
This commit is contained in:
100
docs/en/Changelog_EN.md
Normal file
100
docs/en/Changelog_EN.md
Normal file
@@ -0,0 +1,100 @@
|
||||
### 2023-08-13
|
||||
1-Regular bug fix
|
||||
- Change the minimum total epoch number to 1, and change the minimum total epoch number to 2
|
||||
- Fix training errors of not using pre-train models
|
||||
- After accompaniment vocals separation, clear graphics memory
|
||||
- Change faiss save path absolute path to relative path
|
||||
- Support path containing spaces (both training set path and experiment name are supported, and errors will no longer be reported)
|
||||
- Filelist cancels mandatory utf8 encoding
|
||||
- Solve the CPU consumption problem caused by faiss searching during real-time voice changes
|
||||
|
||||
2-Key updates
|
||||
- Train the current strongest open-source vocal pitch extraction model RMVPE, and use it for RVC training, offline/real-time inference, supporting PyTorch/Onnx/DirectML
|
||||
- Support for AMD and Intel graphics cards through Pytorch_DML
|
||||
|
||||
(1) Real time voice change (2) Inference (3) Separation of vocal accompaniment (4) Training not currently supported, will switch to CPU training; supports RMVPE inference of gpu by Onnx_Dml
|
||||
|
||||
|
||||
### 2023-06-18
|
||||
- New pretrained v2 models: 32k and 48k
|
||||
- Fix non-f0 model inference errors
|
||||
- For training-set exceeding 1 hour, do automatic minibatch-kmeans to reduce feature shape, so that index training, adding, and searching will be much faster.
|
||||
- Provide a toy vocal2guitar huggingface space
|
||||
- Auto delete outlier short cut training-set audios
|
||||
- Onnx export tab
|
||||
|
||||
Failed experiments:
|
||||
- ~~Feature retrieval: add temporal feature retrieval: not effective~~
|
||||
- ~~Feature retrieval: add PCAR dimensionality reduction: searching is even slower~~
|
||||
- ~~Random data augmentation when training: not effective~~
|
||||
|
||||
todolist:
|
||||
- ~~Vocos-RVC (tiny vocoder): not effective~~
|
||||
- ~~Crepe support for training:replaced by RMVPE~~
|
||||
- ~~Half precision crepe inference:replaced by RMVPE. And hard to achive.~~
|
||||
- F0 editor support
|
||||
|
||||
### 2023-05-28
|
||||
- Add v2 jupyter notebook, korean changelog, fix some environment requirments
|
||||
- Add voiceless consonant and breath protection mode
|
||||
- Support crepe-full pitch detect
|
||||
- UVR5 vocal separation: support dereverb models and de-echo models
|
||||
- Add experiment name and version on the name of index
|
||||
- Support users to manually select export format of output audios when batch voice conversion processing and UVR5 vocal separation
|
||||
- v1 32k model training is no more supported
|
||||
|
||||
### 2023-05-13
|
||||
- Clear the redundant codes in the old version of runtime in the one-click-package: lib.infer_pack and uvr5_pack
|
||||
- Fix pseudo multiprocessing bug in training set preprocessing
|
||||
- Adding median filtering radius adjustment for harvest pitch recognize algorithm
|
||||
- Support post processing resampling for exporting audio
|
||||
- Multi processing "n_cpu" setting for training is changed from "f0 extraction" to "data preprocessing and f0 extraction"
|
||||
- Automatically detect the index paths under the logs folder and provide a drop-down list function
|
||||
- Add "Frequently Asked Questions and Answers" on the tab page (you can also refer to github RVC wiki)
|
||||
- When inference, harvest pitch is cached when using same input audio path (purpose: using harvest pitch extraction, the entire pipeline will go through a long and repetitive pitch extraction process. If caching is not used, users who experiment with different timbre, index, and pitch median filtering radius settings will experience a very painful waiting process after the first inference)
|
||||
|
||||
### 2023-05-14
|
||||
- Use volume envelope of input to mix or replace the volume envelope of output (can alleviate the problem of "input muting and output small amplitude noise". If the input audio background noise is high, it is not recommended to turn it on, and it is not turned on by default (1 can be considered as not turned on)
|
||||
- Support saving extracted small models at a specified frequency (if you want to see the performance under different epochs, but do not want to save all large checkpoints and manually extract small models by ckpt-processing every time, this feature will be very practical)
|
||||
- Resolve the issue of "connection errors" caused by the server's global proxy by setting environment variables
|
||||
- Supports pre-trained v2 models (currently only 40k versions are publicly available for testing, and the other two sampling rates have not been fully trained yet)
|
||||
- Limit excessive volume exceeding 1 before inference
|
||||
- Slightly adjusted the settings of training-set preprocessing
|
||||
|
||||
|
||||
#######################
|
||||
|
||||
History changelogs:
|
||||
|
||||
### 2023-04-09
|
||||
- Fixed training parameters to improve GPU utilization rate: A100 increased from 25% to around 90%, V100: 50% to around 90%, 2060S: 60% to around 85%, P40: 25% to around 95%; significantly improved training speed
|
||||
- Changed parameter: total batch_size is now per GPU batch_size
|
||||
- Changed total_epoch: maximum limit increased from 100 to 1000; default increased from 10 to 20
|
||||
- Fixed issue of ckpt extraction recognizing pitch incorrectly, causing abnormal inference
|
||||
- Fixed issue of distributed training saving ckpt for each rank
|
||||
- Applied nan feature filtering for feature extraction
|
||||
- Fixed issue with silent input/output producing random consonants or noise (old models need to retrain with a new dataset)
|
||||
|
||||
### 2023-04-16 Update
|
||||
- Added local real-time voice changing mini-GUI, start by double-clicking go-realtime-gui.bat
|
||||
- Applied filtering for frequency bands below 50Hz during training and inference
|
||||
- Lowered the minimum pitch extraction of pyworld from the default 80 to 50 for training and inference, allowing male low-pitched voices between 50-80Hz not to be muted
|
||||
- WebUI supports changing languages according to system locale (currently supporting en_US, ja_JP, zh_CN, zh_HK, zh_SG, zh_TW; defaults to en_US if not supported)
|
||||
- Fixed recognition of some GPUs (e.g., V100-16G recognition failure, P4 recognition failure)
|
||||
|
||||
### 2023-04-28 Update
|
||||
- Upgraded faiss index settings for faster speed and higher quality
|
||||
- Removed dependency on total_npy; future model sharing will not require total_npy input
|
||||
- Unlocked restrictions for the 16-series GPUs, providing 4GB inference settings for 4GB VRAM GPUs
|
||||
- Fixed bug in UVR5 vocal accompaniment separation for certain audio formats
|
||||
- Real-time voice changing mini-GUI now supports non-40k and non-lazy pitch models
|
||||
|
||||
### Future Plans:
|
||||
Features:
|
||||
- Add option: extract small models for each epoch save
|
||||
- Add option: export additional mp3 to the specified path during inference
|
||||
- Support multi-person training tab (up to 4 people)
|
||||
|
||||
Base model:
|
||||
- Collect breathing wav files to add to the training dataset to fix the issue of distorted breath sounds
|
||||
- We are currently training a base model with an extended singing dataset, which will be released in the future
|
||||
143
docs/en/README.en.md
Normal file
143
docs/en/README.en.md
Normal file
@@ -0,0 +1,143 @@
|
||||
<div align="center">
|
||||
|
||||
<h1>Retrieval-based-Voice-Conversion-WebUI</h1>
|
||||
An easy-to-use Voice Conversion framework based on VITS.<br><br>
|
||||
|
||||
[](https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI)
|
||||
|
||||
<img src="https://counter.seku.su/cmoe?name=rvc&theme=r34" /><br>
|
||||
|
||||
[](https://colab.research.google.com/github/RVC-Project/Retrieval-based-Voice-Conversion-WebUI/blob/main/Retrieval_based_Voice_Conversion_WebUI.ipynb)
|
||||
[](https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI/blob/main/LICENSE)
|
||||
[](https://huggingface.co/lj1995/VoiceConversionWebUI/tree/main/)
|
||||
|
||||
[](https://discord.gg/HcsmBBGyVk)
|
||||
|
||||
</div>
|
||||
|
||||
------
|
||||
[**Changelog**](https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI/blob/main/docs/Changelog_EN.md) | [**FAQ (Frequently Asked Questions)**](https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI/wiki/FAQ-(Frequently-Asked-Questions))
|
||||
|
||||
[**English**](./README.en.md) | [**中文简体**](../README.md) | [**日本語**](./README.ja.md) | [**한국어**](./README.ko.md) ([**韓國語**](./README.ko.han.md)) | [**Türkçe**](./README.tr.md)
|
||||
|
||||
|
||||
Check our [Demo Video](https://www.bilibili.com/video/BV1pm4y1z7Gm/) here!
|
||||
|
||||
Realtime Voice Conversion Software using RVC : [w-okada/voice-changer](https://github.com/w-okada/voice-changer)
|
||||
|
||||
|
||||
> The dataset for the pre-training model uses nearly 50 hours of high quality VCTK open source dataset.
|
||||
|
||||
> High quality licensed song datasets will be added to training-set one after another for your use, without worrying about copyright infringement.
|
||||
|
||||
> Please look forward to the pretrained base model of RVCv3, which has larger parameters, more training data, better results, unchanged inference speed, and requires less training data for training.
|
||||
|
||||
## Summary
|
||||
This repository has the following features:
|
||||
+ Reduce tone leakage by replacing the source feature to training-set feature using top1 retrieval;
|
||||
+ Easy and fast training, even on relatively poor graphics cards;
|
||||
+ Training with a small amount of data also obtains relatively good results (>=10min low noise speech recommended);
|
||||
+ Supporting model fusion to change timbres (using ckpt processing tab->ckpt merge);
|
||||
+ Easy-to-use Webui interface;
|
||||
+ Use the UVR5 model to quickly separate vocals and instruments.
|
||||
+ Use the most powerful High-pitch Voice Extraction Algorithm [InterSpeech2023-RMVPE](#Credits) to prevent the muted sound problem. Provides the best results (significantly) and is faster, with even lower resource consumption than Crepe_full.
|
||||
+ AMD/Intel graphics cards acceleration supported.
|
||||
|
||||
## Preparing the environment
|
||||
The following commands need to be executed in the environment of Python version 3.8 or higher.
|
||||
|
||||
(Windows/Linux)
|
||||
First install the main dependencies through pip:
|
||||
```bash
|
||||
# Install PyTorch-related core dependencies, skip if installed
|
||||
# Reference: https://pytorch.org/get-started/locally/
|
||||
pip install torch torchvision torchaudio
|
||||
|
||||
#For Windows + Nvidia Ampere Architecture(RTX30xx), you need to specify the cuda version corresponding to pytorch according to the experience of https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI/issues/21
|
||||
#pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117
|
||||
```
|
||||
|
||||
Then can use poetry to install the other dependencies:
|
||||
```bash
|
||||
# Install the Poetry dependency management tool, skip if installed
|
||||
# Reference: https://python-poetry.org/docs/#installation
|
||||
curl -sSL https://install.python-poetry.org | python3 -
|
||||
|
||||
# Install the project dependencies
|
||||
poetry install
|
||||
```
|
||||
|
||||
You can also use pip to install them:
|
||||
```bash
|
||||
|
||||
for Nvidia graphics cards
|
||||
pip install -r requirements.txt
|
||||
|
||||
for AMD/Intel graphics cards:
|
||||
pip install -r requirements-dml.txt
|
||||
|
||||
```
|
||||
|
||||
------
|
||||
Mac users can install dependencies via `run.sh`:
|
||||
```bash
|
||||
sh ./run.sh
|
||||
```
|
||||
|
||||
## Preparation of other Pre-models
|
||||
RVC requires other pre-models to infer and train.
|
||||
|
||||
You need to download them from our [Huggingface space](https://huggingface.co/lj1995/VoiceConversionWebUI/tree/main/).
|
||||
|
||||
Here's a list of Pre-models and other files that RVC needs:
|
||||
```bash
|
||||
hubert_base.pt
|
||||
|
||||
./pretrained
|
||||
|
||||
./uvr5_weights
|
||||
|
||||
If you want to test the v2 version model (the v2 version model has changed the input from the 256 dimensional feature of 9-layer Hubert+final_proj to the 768 dimensional feature of 12-layer Hubert, and has added 3 period discriminators), you will need to download additional features
|
||||
|
||||
./pretrained_v2
|
||||
|
||||
#If you are using Windows, you may also need these two files, skip if FFmpeg and FFprobe are installed
|
||||
ffmpeg.exe
|
||||
|
||||
https://huggingface.co/lj1995/VoiceConversionWebUI/blob/main/ffmpeg.exe
|
||||
|
||||
ffprobe.exe
|
||||
|
||||
https://huggingface.co/lj1995/VoiceConversionWebUI/blob/main/ffprobe.exe
|
||||
|
||||
If you want to use the latest SOTA RMVPE vocal pitch extraction algorithm, you need to download the RMVPE weights and place them in the RVC root directory
|
||||
|
||||
https://huggingface.co/lj1995/VoiceConversionWebUI/blob/main/rmvpe.pt
|
||||
|
||||
For AMD/Intel graphics cards users you need download:
|
||||
|
||||
https://huggingface.co/lj1995/VoiceConversionWebUI/blob/main/rmvpe.onnx
|
||||
|
||||
```
|
||||
Then use this command to start Webui:
|
||||
```bash
|
||||
python infer-web.py
|
||||
```
|
||||
If you are using Windows or macOS, you can download and extract `RVC-beta.7z` to use RVC directly by using `go-web.bat` on windows or `sh ./run.sh` on macOS to start Webui.
|
||||
|
||||
## Credits
|
||||
+ [ContentVec](https://github.com/auspicious3000/contentvec/)
|
||||
+ [VITS](https://github.com/jaywalnut310/vits)
|
||||
+ [HIFIGAN](https://github.com/jik876/hifi-gan)
|
||||
+ [Gradio](https://github.com/gradio-app/gradio)
|
||||
+ [FFmpeg](https://github.com/FFmpeg/FFmpeg)
|
||||
+ [Ultimate Vocal Remover](https://github.com/Anjok07/ultimatevocalremovergui)
|
||||
+ [audio-slicer](https://github.com/openvpi/audio-slicer)
|
||||
+ [Vocal pitch extraction:RMVPE](https://github.com/Dream-High/RMVPE)
|
||||
+ The pretrained model is trained and tested by [yxlllc](https://github.com/yxlllc/RMVPE) and [RVC-Boss](https://github.com/RVC-Boss).
|
||||
|
||||
## Thanks to all contributors for their efforts
|
||||
<a href="https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI/graphs/contributors" target="_blank">
|
||||
<img src="https://contrib.rocks/image?repo=RVC-Project/Retrieval-based-Voice-Conversion-WebUI" />
|
||||
</a>
|
||||
|
||||
102
docs/en/faiss_tips_en.md
Normal file
102
docs/en/faiss_tips_en.md
Normal file
@@ -0,0 +1,102 @@
|
||||
faiss tuning TIPS
|
||||
==================
|
||||
# about faiss
|
||||
faiss is a library of neighborhood searches for dense vectors, developed by facebook research, which efficiently implements many approximate neighborhood search methods.
|
||||
Approximate Neighbor Search finds similar vectors quickly while sacrificing some accuracy.
|
||||
|
||||
## faiss in RVC
|
||||
In RVC, for the embedding of features converted by HuBERT, we search for embeddings similar to the embedding generated from the training data and mix them to achieve a conversion that is closer to the original speech. However, since this search takes time if performed naively, high-speed conversion is realized by using approximate neighborhood search.
|
||||
|
||||
# implementation overview
|
||||
In '/logs/your-experiment/3_feature256' where the model is located, features extracted by HuBERT from each voice data are located.
|
||||
From here we read the npy files in order sorted by filename and concatenate the vectors to create big_npy. (This vector has shape [N, 256].)
|
||||
After saving big_npy as /logs/your-experiment/total_fea.npy, train it with faiss.
|
||||
|
||||
In this article, I will explain the meaning of these parameters.
|
||||
|
||||
# Explanation of the method
|
||||
## index factory
|
||||
An index factory is a unique faiss notation that expresses a pipeline that connects multiple approximate neighborhood search methods as a string.
|
||||
This allows you to try various approximate neighborhood search methods simply by changing the index factory string.
|
||||
In RVC it is used like this:
|
||||
|
||||
```python
|
||||
index = faiss.index_factory(256, "IVF%s,Flat" % n_ivf)
|
||||
```
|
||||
Among the arguments of index_factory, the first is the number of dimensions of the vector, the second is the index factory string, and the third is the distance to use.
|
||||
|
||||
For more detailed notation
|
||||
https://github.com/facebookresearch/faiss/wiki/The-index-factory
|
||||
|
||||
## index for distance
|
||||
There are two typical indexes used as similarity of embedding as follows.
|
||||
|
||||
- Euclidean distance (METRIC_L2)
|
||||
- inner product (METRIC_INNER_PRODUCT)
|
||||
|
||||
Euclidean distance takes the squared difference in each dimension, sums the differences in all dimensions, and then takes the square root. This is the same as the distance in 2D and 3D that we use on a daily basis.
|
||||
The inner product is not used as an index of similarity as it is, and the cosine similarity that takes the inner product after being normalized by the L2 norm is generally used.
|
||||
|
||||
Which is better depends on the case, but cosine similarity is often used in embedding obtained by word2vec and similar image retrieval models learned by ArcFace. If you want to do l2 normalization on vector X with numpy, you can do it with the following code with eps small enough to avoid 0 division.
|
||||
|
||||
```python
|
||||
X_normed = X / np.maximum(eps, np.linalg.norm(X, ord=2, axis=-1, keepdims=True))
|
||||
```
|
||||
|
||||
Also, for the index factory, you can change the distance index used for calculation by choosing the value to pass as the third argument.
|
||||
|
||||
```python
|
||||
index = faiss.index_factory(dimention, text, faiss.METRIC_INNER_PRODUCT)
|
||||
```
|
||||
|
||||
## IVF
|
||||
IVF (Inverted file indexes) is an algorithm similar to the inverted index in full-text search.
|
||||
During learning, the search target is clustered with kmeans, and Voronoi partitioning is performed using the cluster center. Each data point is assigned a cluster, so we create a dictionary that looks up the data points from the clusters.
|
||||
|
||||
For example, if clusters are assigned as follows
|
||||
|index|Cluster|
|
||||
|-----|-------|
|
||||
|1|A|
|
||||
|2|B|
|
||||
|3|A|
|
||||
|4|C|
|
||||
|5|B|
|
||||
|
||||
The resulting inverted index looks like this:
|
||||
|
||||
|cluster|index|
|
||||
|-------|-----|
|
||||
|A|1, 3|
|
||||
|B|2, 5|
|
||||
|C|4|
|
||||
|
||||
When searching, we first search n_probe clusters from the clusters, and then calculate the distances for the data points belonging to each cluster.
|
||||
|
||||
# recommend parameter
|
||||
There are official guidelines on how to choose an index, so I will explain accordingly.
|
||||
https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index
|
||||
|
||||
For datasets below 1M, 4bit-PQ is the most efficient method available in faiss as of April 2023.
|
||||
Combining this with IVF, narrowing down the candidates with 4bit-PQ, and finally recalculating the distance with an accurate index can be described by using the following index factory.
|
||||
|
||||
```python
|
||||
index = faiss.index_factory(256, "IVF1024,PQ128x4fs,RFlat")
|
||||
```
|
||||
|
||||
## Recommended parameters for IVF
|
||||
Consider the case of too many IVFs. For example, if coarse quantization by IVF is performed for the number of data, this is the same as a naive exhaustive search and is inefficient.
|
||||
For 1M or less, IVF values are recommended between 4*sqrt(N) ~ 16*sqrt(N) for N number of data points.
|
||||
|
||||
Since the calculation time increases in proportion to the number of n_probes, please consult with the accuracy and choose appropriately. Personally, I don't think RVC needs that much accuracy, so n_probe = 1 is fine.
|
||||
|
||||
## FastScan
|
||||
FastScan is a method that enables high-speed approximation of distances by Cartesian product quantization by performing them in registers.
|
||||
Cartesian product quantization performs clustering independently for each d dimension (usually d = 2) during learning, calculates the distance between clusters in advance, and creates a lookup table. At the time of prediction, the distance of each dimension can be calculated in O(1) by looking at the lookup table.
|
||||
So the number you specify after PQ usually specifies half the dimension of the vector.
|
||||
|
||||
For a more detailed description of FastScan, please refer to the official documentation.
|
||||
https://github.com/facebookresearch/faiss/wiki/Fast-accumulation-of-PQ-and-AQ-codes-(FastScan)
|
||||
|
||||
## RFlat
|
||||
RFlat is an instruction to recalculate the rough distance calculated by FastScan with the exact distance specified by the third argument of index factory.
|
||||
When getting k neighbors, k*k_factor points are recalculated.
|
||||
104
docs/en/faq_en.md
Normal file
104
docs/en/faq_en.md
Normal file
@@ -0,0 +1,104 @@
|
||||
## Q1:ffmpeg error/utf8 error.
|
||||
It is most likely not a FFmpeg issue, but rather an audio path issue;
|
||||
|
||||
FFmpeg may encounter an error when reading paths containing special characters like spaces and (), which may cause an FFmpeg error; and when the training set's audio contains Chinese paths, writing it into filelist.txt may cause a utf8 error.<br>
|
||||
|
||||
## Q2:Cannot find index file after "One-click Training".
|
||||
If it displays "Training is done. The program is closed," then the model has been trained successfully, and the subsequent errors are fake;
|
||||
|
||||
The lack of an 'added' index file after One-click training may be due to the training set being too large, causing the addition of the index to get stuck; this has been resolved by using batch processing to add the index, which solves the problem of memory overload when adding the index. As a temporary solution, try clicking the "Train Index" button again.<br>
|
||||
|
||||
## Q3:Cannot find the model in “Inferencing timbre” after training
|
||||
Click “Refresh timbre list” and check again; if still not visible, check if there are any errors during training and send screenshots of the console, web UI, and logs/experiment_name/*.log to the developers for further analysis.<br>
|
||||
|
||||
## Q4:How to share a model/How to use others' models?
|
||||
The pth files stored in rvc_root/logs/experiment_name are not meant for sharing or inference, but for storing the experiment checkpoits for reproducibility and further training. The model to be shared should be the 60+MB pth file in the weights folder;
|
||||
|
||||
In the future, weights/exp_name.pth and logs/exp_name/added_xxx.index will be merged into a single weights/exp_name.zip file to eliminate the need for manual index input; so share the zip file, not the pth file, unless you want to continue training on a different machine;
|
||||
|
||||
Copying/sharing the several hundred MB pth files from the logs folder to the weights folder for forced inference may result in errors such as missing f0, tgt_sr, or other keys. You need to use the ckpt tab at the bottom to manually or automatically (if the information is found in the logs/exp_name), select whether to include pitch infomation and target audio sampling rate options and then extract the smaller model. After extraction, there will be a 60+ MB pth file in the weights folder, and you can refresh the voices to use it.<br>
|
||||
|
||||
## Q5:Connection Error.
|
||||
You may have closed the console (black command line window).<br>
|
||||
|
||||
## Q6:WebUI popup 'Expecting value: line 1 column 1 (char 0)'.
|
||||
Please disable system LAN proxy/global proxy and then refresh.<br>
|
||||
|
||||
## Q7:How to train and infer without the WebUI?
|
||||
Training script:<br>
|
||||
You can run training in WebUI first, and the command-line versions of dataset preprocessing and training will be displayed in the message window.<br>
|
||||
|
||||
Inference script:<br>
|
||||
https://huggingface.co/lj1995/VoiceConversionWebUI/blob/main/myinfer.py<br>
|
||||
|
||||
|
||||
e.g.<br>
|
||||
|
||||
runtime\python.exe myinfer.py 0 "E:\codes\py39\RVC-beta\todo-songs\1111.wav" "E:\codes\py39\logs\mi-test\added_IVF677_Flat_nprobe_7.index" harvest "test.wav" "weights/mi-test.pth" 0.6 cuda:0 True<br>
|
||||
|
||||
|
||||
f0up_key=sys.argv[1]<br>
|
||||
input_path=sys.argv[2]<br>
|
||||
index_path=sys.argv[3]<br>
|
||||
f0method=sys.argv[4]#harvest or pm<br>
|
||||
opt_path=sys.argv[5]<br>
|
||||
model_path=sys.argv[6]<br>
|
||||
index_rate=float(sys.argv[7])<br>
|
||||
device=sys.argv[8]<br>
|
||||
is_half=bool(sys.argv[9])<br>
|
||||
|
||||
## Q8:Cuda error/Cuda out of memory.
|
||||
There is a small chance that there is a problem with the CUDA configuration or the device is not supported; more likely, there is not enough memory (out of memory).<br>
|
||||
|
||||
For training, reduce the batch size (if reducing to 1 is still not enough, you may need to change the graphics card); for inference, adjust the x_pad, x_query, x_center, and x_max settings in the config.py file as needed. 4G or lower memory cards (e.g. 1060(3G) and various 2G cards) can be abandoned, while 4G memory cards still have a chance.<br>
|
||||
|
||||
## Q9:How many total_epoch are optimal?
|
||||
If the training dataset's audio quality is poor and the noise floor is high, 20-30 epochs are sufficient. Setting it too high won't improve the audio quality of your low-quality training set.<br>
|
||||
|
||||
If the training set audio quality is high, the noise floor is low, and there is sufficient duration, you can increase it. 200 is acceptable (since training is fast, and if you're able to prepare a high-quality training set, your GPU likely can handle a longer training duration without issue).<br>
|
||||
|
||||
## Q10:How much training set duration is needed?
|
||||
|
||||
A dataset of around 10min to 50min is recommended.<br>
|
||||
|
||||
With guaranteed high sound quality and low bottom noise, more can be added if the dataset's timbre is uniform.<br>
|
||||
|
||||
For a high-level training set (lean + distinctive tone), 5min to 10min is fine.<br>
|
||||
|
||||
There are some people who have trained successfully with 1min to 2min data, but the success is not reproducible by others and is not very informative. <br>This requires that the training set has a very distinctive timbre (e.g. a high-frequency airy anime girl sound) and the quality of the audio is high;
|
||||
Data of less than 1min duration has not been successfully attempted so far. This is not recommended.<br>
|
||||
|
||||
|
||||
## Q11:What is the index rate for and how to adjust it?
|
||||
If the tone quality of the pre-trained model and inference source is higher than that of the training set, they can bring up the tone quality of the inference result, but at the cost of a possible tone bias towards the tone of the underlying model/inference source rather than the tone of the training set, which is generally referred to as "tone leakage".<br>
|
||||
|
||||
The index rate is used to reduce/resolve the timbre leakage problem. If the index rate is set to 1, theoretically there is no timbre leakage from the inference source and the timbre quality is more biased towards the training set. If the training set has a lower sound quality than the inference source, then a higher index rate may reduce the sound quality. Turning it down to 0 does not have the effect of using retrieval blending to protect the training set tones.<br>
|
||||
|
||||
If the training set has good audio quality and long duration, turn up the total_epoch, when the model itself is less likely to refer to the inferred source and the pretrained underlying model, and there is little "tone leakage", the index_rate is not important and you can even not create/share the index file.<br>
|
||||
|
||||
## Q12:How to choose the gpu when inferring?
|
||||
In the config.py file, select the card number after "device cuda:".<br>
|
||||
|
||||
The mapping between card number and graphics card can be seen in the graphics card information section of the training tab.<br>
|
||||
|
||||
## Q13:How to use the model saved in the middle of training?
|
||||
Save via model extraction at the bottom of the ckpt processing tab.
|
||||
|
||||
## Q14:File/memory error(when training)?
|
||||
Too many processes and your memory is not enough. You may fix it by:
|
||||
|
||||
1、decrease the input in field "Threads of CPU".
|
||||
|
||||
2、pre-cut trainset to shorter audio files.
|
||||
|
||||
## Q15: How to continue training using more data
|
||||
|
||||
step1: put all wav data to path2.
|
||||
|
||||
step2: exp_name2+path2 -> process dataset and extract feature.
|
||||
|
||||
step3: copy the latest G and D file of exp_name1 (your previous experiment) into exp_name2 folder.
|
||||
|
||||
step4: click "train the model", and it will continue training from the beginning of your previous exp model epoch.
|
||||
|
||||
|
||||
65
docs/en/training_tips_en.md
Normal file
65
docs/en/training_tips_en.md
Normal file
@@ -0,0 +1,65 @@
|
||||
Instructions and tips for RVC training
|
||||
======================================
|
||||
This TIPS explains how data training is done.
|
||||
|
||||
# Training flow
|
||||
I will explain along the steps in the training tab of the GUI.
|
||||
|
||||
## step1
|
||||
Set the experiment name here.
|
||||
|
||||
You can also set here whether the model should take pitch into account.
|
||||
If the model doesn't consider pitch, the model will be lighter, but not suitable for singing.
|
||||
|
||||
Data for each experiment is placed in `/logs/your-experiment-name/`.
|
||||
|
||||
## step2a
|
||||
Loads and preprocesses audio.
|
||||
|
||||
### load audio
|
||||
If you specify a folder with audio, the audio files in that folder will be read automatically.
|
||||
For example, if you specify `C:Users\hoge\voices`, `C:Users\hoge\voices\voice.mp3` will be loaded, but `C:Users\hoge\voices\dir\voice.mp3` will Not loaded.
|
||||
|
||||
Since ffmpeg is used internally for reading audio, if the extension is supported by ffmpeg, it will be read automatically.
|
||||
After converting to int16 with ffmpeg, convert to float32 and normalize between -1 to 1.
|
||||
|
||||
### denoising
|
||||
The audio is smoothed by scipy's filtfilt.
|
||||
|
||||
### Audio Split
|
||||
First, the input audio is divided by detecting parts of silence that last longer than a certain period (max_sil_kept=5 seconds?). After splitting the audio on silence, split the audio every 4 seconds with an overlap of 0.3 seconds. For audio separated within 4 seconds, after normalizing the volume, convert the wav file to `/logs/your-experiment-name/0_gt_wavs` and then convert it to 16k sampling rate to `/logs/your-experiment-name/1_16k_wavs ` as a wav file.
|
||||
|
||||
## step2b
|
||||
### Extract pitch
|
||||
Extract pitch information from wav files. Extract the pitch information (=f0) using the method built into parselmouth or pyworld and save it in `/logs/your-experiment-name/2a_f0`. Then logarithmically convert the pitch information to an integer between 1 and 255 and save it in `/logs/your-experiment-name/2b-f0nsf`.
|
||||
|
||||
### Extract feature_print
|
||||
Convert the wav file to embedding in advance using HuBERT. Read the wav file saved in `/logs/your-experiment-name/1_16k_wavs`, convert the wav file to 256-dimensional features with HuBERT, and save in npy format in `/logs/your-experiment-name/3_feature256`.
|
||||
|
||||
## step3
|
||||
train the model.
|
||||
### Glossary for Beginners
|
||||
In deep learning, the data set is divided and the learning proceeds little by little. In one model update (step), batch_size data are retrieved and predictions and error corrections are performed. Doing this once for a dataset counts as one epoch.
|
||||
|
||||
Therefore, the learning time is the learning time per step x (the number of data in the dataset / batch size) x the number of epochs. In general, the larger the batch size, the more stable the learning becomes (learning time per step ÷ batch size) becomes smaller, but it uses more GPU memory. GPU RAM can be checked with the nvidia-smi command. Learning can be done in a short time by increasing the batch size as much as possible according to the machine of the execution environment.
|
||||
|
||||
### Specify pretrained model
|
||||
RVC starts training the model from pretrained weights instead of from 0, so it can be trained with a small dataset.
|
||||
|
||||
By default
|
||||
|
||||
- If you consider pitch, it loads `rvc-location/pretrained/f0G40k.pth` and `rvc-location/pretrained/f0D40k.pth`.
|
||||
- If you don't consider pitch, it loads `rvc-location/pretrained/f0G40k.pth` and `rvc-location/pretrained/f0D40k.pth`.
|
||||
|
||||
When learning, model parameters are saved in `logs/your-experiment-name/G_{}.pth` and `logs/your-experiment-name/D_{}.pth` for each save_every_epoch, but by specifying this path, you can start learning. You can restart or start training from model weights learned in a different experiment.
|
||||
|
||||
### learning index
|
||||
RVC saves the HuBERT feature values used during training, and during inference, searches for feature values that are similar to the feature values used during learning to perform inference. In order to perform this search at high speed, the index is learned in advance.
|
||||
For index learning, we use the approximate neighborhood search library faiss. Read the feature value of `logs/your-experiment-name/3_feature256` and use it to learn the index, and save it as `logs/your-experiment-name/add_XXX.index`.
|
||||
|
||||
(From the 20230428update version, it is read from the index, and saving / specifying is no longer necessary.)
|
||||
|
||||
### Button description
|
||||
- Train model: After executing step2b, press this button to train the model.
|
||||
- Train feature index: After training the model, perform index learning.
|
||||
- One-click training: step2b, model training and feature index training all at once.
|
||||
Reference in New Issue
Block a user