Real-Time Deepfake Voice & Video Conversion

Goal

This project aims to provide real-time deepfake capabilities for both voice and video in applications such as Skype, Zoom, and other video calling platforms. Using state-of-the-art audio and video deepfake models, this project builds a complete real-time pipeline that converts your voice and face on the fly. The system is designed with high-quality, low-latency performance in mind to ensure a seamless live interaction experience.

Overview

This repository provides a complete real-time deepfake system for both voice and video conversion. It consists of a server-client architecture where:

Audio Server (Diff-HierVC):
Runs a Flask-based service to perform real-time voice conversion.
Video Server (insightface + GFPGAN):
Processes webcam frames to perform face detection, swapping, and optional enhancement.

A single GUI client manages both audio and video connections. From this GUI, you can:

Connect/Disconnect to the audio server.
Connect/Disconnect to the video server.
Push-to-Talk or continuous audio streaming for real-time voice conversion.
Configure audio chunk size, select vocoder (BigVGAN or HiFiGAN), and upload/update target voice references.
Upload and update the source image for face swapping on the video server.
Start/Stop the virtual camera stream for real-time face swapping.
Adjust upscale factor for face enhancement.
Disable/enable face enhancement.

Below is a screenshot of the single GUI client:

Note: The pre-trained models are too large for GitHub. Download them manually using the links below and place them into the specified directories.

System Requirements

Python Version: Developed and tested on Python 3.10.16.
Platform Compatibility:
- Fully functional on Windows.
- macOS is supported to some extent; however, the GUI client (which relies on pyvirtualcam) does not automatically create a virtual camera on newer macOS devices (M1/M2). See the Future Improvements section for more details.
Dependencies are listed in the respective requirements.txt files for the audio and video components.

Audio Server Setup 🎤

Setup

Clone the Repository & Install Dependencies:

git clone https://github.com/ali-shariaty/Real-Time-Deepfake-Pipeline.git
cd Real-Time-Deepfake-Pipeline/audio
pip install -r requirements.txt

Download Pre-trained Models and place them in the following directory:

Diff-HierVC Models

.
├── ckpt
│   ├── config.json
│   └── model_diffhier.pth ✅
├── server.py
├── infer.sh
├── vocoder
│   ├── voc_hifigan.pth ✅
│   └── voc_bigvgan.pth ✅

Start the Audio Server:
```
python server.py
```
By default, the server listens on port 5003.
VB-Audio Virtual Cable
- Install VB-CABLE.
- In Windows Sound Settings, set CABLE Output as the default microphone.
- In the calling app (Skype/Zoom), select CABLE Output as the microphone.
- The GUI client will play converted audio to your default output device, which VB-CABLE routes into the calling app.

Video Server Setup 🎥

Setup

Install Dependencies:

cd Real-Time-Deepfake-Pipeline/video
pip install -r requirements.txt

Download Pre-trained Models and Place Them in the Correct Directory:
- Face Swapping Model
- GFPGAN Model v1.3 or/and GFPGAN Model v1.4
- Move them into Real-Time-Deepfake-Pipeline/video/models/
Start the Video Server:
```
python server.py
```
With --help you can see the available options which can be modified for the video server. The options source image, upscale factor and disable face enhancement toggle can also be modified at runtime via the GUI-Client.
Set Up Virtual Camera in OBS
- Open OBS Studio.
- Add your webcam as a source by going to Sources -> Video Capture Device and selecting your webcam.
- Click Start Virtual Camera in the Controls panel.
- Close OBS Studio after starting the virtual camera. This is necessary to avoid conflicts, as the webcam might otherwise be occupied by OBS when your program is running.
This will allow OBS’s virtual camera to be used by the client later.

Running the Single GUI Client

After both servers are running (locally or remotely), navigate to the root directory of the project and run:

python GUI-Client.py

The GUI includes:

Server Connection section to connect to Audio/Video servers.
Audio Controls to choose your audio device, push-to-talk, chunk size, vocoder type, etc.
Target Audio Upload to select and upload new reference audio for voice conversion.
Video Controls to start/stop video streaming, configure the source image for face swapping, and upscale factor for face enhancement.
Virtual Camera Status to indicate whether the OBS virtual camera is active.

SSH Server Compatibility 🔧

If you want to run the server and client on different devices (e.g. a remote SSH server for the server side and your local machine for the client), follow these steps:

Install and Configure the Server:
Ensure that all server components (both audio and video) are installed and configured on your SSH server.

Set Up SSH Port Forwarding:
Open separate terminal windows for port forwarding:

For Video:

ssh -L 5558:localhost:5558 -L 5559:localhost:5559 -L 5560:localhost:5560 -p <SSH_PORT> <USERNAME>@<IP>

For Audio:

ssh -L 5003:localhost:5003 -p <SSH_PORT> <USERNAME>@<IP>

Adjust the port numbers, username, and IP address as needed.

Start the Servers on the SSH Server:
- In the video terminal, navigate to the video/ directory and run the video server:
```
python server.py
```
- In the audio terminal, navigate to the audio/ directory and run the audio server:
```
python server.py
```
GUI File Upload Functions

For functions such as upload_target_audio and upload_video_source:

Private Key Path (if required):
If your SSH server uses key-based authentication, replace the placeholder (e.g., your private Key for SSH) with the path to your actual private key.
SSH Port:
Replace the placeholder port value with your actual SSH port.
Remote Destination:
Update the remote path to include the proper username and target directory on your SSH server.

Server Connection Command

For the function that establishes an SSH connection (e.g., connect_ssh_for_video):

SSH Credentials:
Modify the SSH command to use your own private key (if required), port, username, and IP address.
Server Path:
Ensure that the command navigates to the correct directory on your SSH server. Replace any example usernames (like “ali”) with your actual SSH user ID.

Additional Configurations

If you need to adjust parameters such as source image, chunk size, or other settings in the GUI:

Function Parameters:
Make sure the functions handling these parameters are updated with your SSH credentials (port, username, IP) and any necessary file paths or configuration values.
Modular Approach:
It is recommended to centralize these settings so they can be easily changed across the application.

Run the Client Locally:
Open a new terminal on your local device and execute:
```
python GUI-Client.py
```

The client will connect to the forwarded ports and communicate with the servers on the SSH server.

Note: Only the client (e.g. GUI-Client.py) needs to be executed on your local machine. All other files must be installed and run on the server.

Usage Workflow 🛠️

Server Side
- Launch the audio server (server.py in audio/).
- Launch the video server (server.py in video/).
Client Side
- Run GUI-Client.py from the project root.
- Configure the Audio Server and Video Server addresses (local or remote via SSH forwarding).
- In the GUI, adjust audio device, chunk size, vocoder, etc.
- For video, upload or select a source image, set the upscale factor, and start the video stream.
Virtual Devices
- VB-CABLE for audio input.
- OBS Virtual Camera for video output.
Join a Call
- In Skype/Zoom, select CABLE Output as your microphone.
- In Skype/Zoom, select OBS Virtual Camera as your webcam.
- Experience real-time deepfake voice and video.

Recommended Settings

Audio

Chunk Size
- Default: 16000 samples.
- Decreasing chunk size (e.g., 8000 or 4000) reduces latency but may lower audio quality.
- Increasing chunk size improves audio quality but adds more delay.
- This project also uses audio smoothing to reduce audible artifacts between chunks. See:
```
def smooth_audio_transition(prev_chunk, current_chunk, overlap_ratio=0.15): ...
def play_audio_with_smoothing(data, buffer_size=3072, overlap_ratio=0.15): ...
```
  - buffer_size (e.g., 3072) determines how many samples are processed at a time during playback.
  - overlap_ratio (e.g., 0.15) controls how much of the chunk overlaps with the previous chunk.
    - A larger overlap_ratio can provide smoother transitions (fewer artifacts), but requires more compute and can introduce slight additional delay.
    - A smaller overlap_ratio is faster but may cause more abrupt transitions between chunks.
Vocoder
- BigVGAN: Higher audio quality, slightly slower inference.
- HiFiGAN: Faster inference, slightly lower fidelity.
Diffusion Steps
In the code, you may see:
```
def get_adaptive_diff_params(audio_length):
    return 15, 15
```
which returns two parameters: diffpitch_ts and diffvoice_ts.
- diffpitch_ts: Number of diffusion steps specifically for pitch. Higher values can lead to smoother pitch transformations but increase compute time.
- diffvoice_ts: Number of diffusion steps for the voice timbre. Higher values can yield more accurate timbre changes, at the cost of extra latency.

Video

Upscale Factor
- Default: 0.4.
- Controls the scaling of the output frame before applying face enhancement.
- Higher values (e.g., 1.0) can produce sharper facial features—especially when using GFPGAN-but also increase computation time.
- Lower values (e.g., 0.4) reduce inference time with minimal perceived loss in quality, making them ideal for real-time applications.
- Note: The original Deep-Live-Cam used a default of 1.0, but we set 0.4 as a better trade-off between quality and performance.
Resolution
- Default: (1280, 720) – HD
- Set in server.py under the process_frame function:
```
def process_frame(frame, wrapper):
  frame_resized = cv2.resize(frame, (1280, 720), interpolation=cv2.INTER_AREA)
  processed_frame = wrapper.generate(frame_resized)
  return processed_frame
```
- Our testing suggests:
  - HD (1280x720) offers the best balance of image quality and processing time (~0.40s per frame with face enhancement).
  - Full HD (1920x1080) provides slightly better visual quality, but slows processing to ~0.70s per frame.
  - Lower resolutions (e.g., 640x360) don’t significantly reduce processing time (~0.30s–0.40s) and degrade quality due to insufficient model input size.
Face Enhancement Models (GFPGAN)
- GFPGANv1.3.pth (default):
  - Optimized for speed
  - Delivers good quality for real-time applications
- GFPGANv1.4.pth:
  - Offers slightly better enhancement, especially around eyes and mouth
  - Slightly slower inference time
- Recommendation: Use v1.3 for faster, near real-time performance unless highest quality is critical.
Face Swapping Models (InsightFace – inswapper)
- inswapper_128_fp16.onnx (default):
  - Faster inference using half-precision (FP16), slightly worse precision
  - Less memory-intensive
- inswapper_128.onnx (FP32):
  - Higher precision, slightly better results
  - Slower inference and more memory-intensive
- Recommendation: Stick with the FP16 model unless you're targeting maximum quality and latency isn't a concern.

Future Improvements 🚀

macOS Compatibility:
- Manual Workflow:
  - Use the existing non-GUI command-line client for video (video/client.py). This client connects with the video server and displays processed frames in an application window.
  - Users then need to manually start OBS, add the application window as a source, and launch the virtual camera.
  - The audio part should be integrated into this workflow so that both audio and video can be handled manually via OBS on macOS.
- Automated Workflow:
  - Alternatively, investigate and integrate an alternative to pyvirtualcam that works on macOS (especially for M1/M2 devices).
  - This would allow the virtual camera to be started automatically, enabling the normal GUI client (GUI-Client.py) to be used without any manual intervention in OBS.
Model Enhancements:
- Integrate and test the new inswapper-512-live model, which was announced on March 2, 2025 in the Top News section of insightface.
- Evaluate and incorporate additional models for both face swapping and face enhancement to further improve quality or speed.
- Address Occlusion Limitations:
  - Tackle current limitations where the face swapper struggles with obstacles (e.g., objects partially occluding the face).
  - Investigate alternative models or enhancements that can better handle occlusions.
- Benchmark overall inference time for the entire pipeline and optimize where possible.
General Performance Improvements:
- Optimizing Model and Vocoder Loading:
  - Use TorchScript or torch.compile to optimize model loading. Implement proper fallbacks if model compilation fails.
  - Consistently apply Mixed Precision (as already implemented) to reduce memory usage and computation time.
- Improving the Audio Pipeline:
  - Create an asynchronous endpoint (using frameworks such as Quart or FastAPI) to further reduce waiting times in the Flask server.
  - Analyze the F0 computation to determine if it can be accelerated further, potentially through vectorization or a GPU-based alternative.
- Threading and Queue Management:
  - Consider optimizing multithreading by dynamically adjusting the number of workers based on current server load.
  - Expand caching strategies (currently using lru_cache and a simple dictionary) to include additional intermediate results for repeated computations.
- Batch Processing:
  - If possible, process multiple audio chunks in batches to reduce the overhead associated with individual function calls.
- Resource and Memory Optimization:
  - Focus on efficient memory management (e.g., explicitly freeing GPU memory after processing) and optimize padding strategies to save both compute power and memory.
User Interface Enhancements:
- Improve the GUI for better usability and more configuration options.
- Provide detailed error messages and logs for troubleshooting.

Credits & Contributors 🤝

This project was created and is maintained by Ali Shariaty and Mert Arslan.
Feel free to reach out via email:

License 📝

This project is licensed under the MIT License.

Citation 🎓

If you use this work in your research, please consider citing the following key papers:

Deepfake Voice Conversion (Diff-HierVC)

@inproceedings{choi23d_interspeech,
  author={Ha-Yeong Choi and Sang-Hoon Lee and Seong-Whan Lee},
  title={{Diff-HierVC: Diffusion-based Hierarchical Voice Conversion with Robust Pitch Generation and Masked Prior for Zero-shot Speaker Adaptation}},
  year=2023,
  booktitle={Proc. INTERSPEECH 2023},
  pages={2283--2287},
  doi={10.21437/Interspeech.2023-817}
}

Face Restoration (GFPGAN)

@InProceedings{wang2021gfpgan,
    author = {Xintao Wang and Yu Li and Honglun Zhang and Ying Shan},
    title = {Towards Real-World Blind Face Restoration with Generative Facial Prior},
    booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
    year = {2021}
}

Face Recognition and Swapping (InsightFace)

@inproceedings{ren2023pbidr,
  title={Facial Geometric Detail Recovery via Implicit Representation},
  author={Ren, Xingyu and Lattas, Alexandros and Gecer, Baris and Deng, Jiankang and Ma, Chao and Yang, Xiaokang},
  booktitle={2023 IEEE 17th International Conference on Automatic Face and Gesture Recognition (FG)},  
  year={2023}
 }

@article{guo2021sample,
  title={Sample and Computation Redistribution for Efficient Face Detection},
  author={Guo, Jia and Deng, Jiankang and Lattas, Alexandros and Zafeiriou, Stefanos},
  journal={arXiv preprint arXiv:2105.04714},
  year={2021}
}

For a comprehensive list of InsightFace citations, please refer to the InsightFace repository's citation section.

Acknowledgements 💎

This project is based on Diff-HierVC, HiFiGAN, BigVGAN, Deep-Live-Cam, insightface and GFPGAN.

🎧 Enjoy real-time deepfake voice and video conversion! 🚀

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
audio		audio
media		media
video		video
.gitignore		.gitignore
GUI-Client.py		GUI-Client.py
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Real-Time Deepfake Voice & Video Conversion

Goal

Overview

System Requirements

Audio Server Setup 🎤

Setup

Video Server Setup 🎥

Setup

Running the Single GUI Client

SSH Server Compatibility 🔧

Usage Workflow 🛠️

Recommended Settings

Audio

Video

Future Improvements 🚀

Credits & Contributors 🤝

License 📝

Citation 🎓

Deepfake Voice Conversion (Diff-HierVC)

Face Restoration (GFPGAN)

Face Recognition and Swapping (InsightFace)

Acknowledgements 💎

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

Ali-Shariati-Najafabadi/Real-Time-Deepfake-Pipeline

Folders and files

Latest commit

History

Repository files navigation

Real-Time Deepfake Voice & Video Conversion

Goal

Overview

System Requirements

Audio Server Setup 🎤

Setup

Video Server Setup 🎥

Setup

Running the Single GUI Client

SSH Server Compatibility 🔧

Usage Workflow 🛠️

Recommended Settings

Audio

Video

Future Improvements 🚀

Credits & Contributors 🤝

License 📝

Citation 🎓

Deepfake Voice Conversion (Diff-HierVC)

Face Restoration (GFPGAN)

Face Recognition and Swapping (InsightFace)

Acknowledgements 💎

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages