Instructions for improving webcam live frame rate through pipelining (tested on M1 Pro: from 1 to 10 FPS) #1553

SanderGi · 2025-09-10T03:34:12Z

SanderGi
Sep 10, 2025

Since lots of people are running into low frame rates (1 - 3 FPS) when using the live webcam mode, I thought I'd document a simple solution to improving this. This is especially relevant for MacOS since GPU acceleration with MPS isn't fully supported but might also be relevant to some Nvidia GPU setups. It would definitely be useful if you have no GPU (or a very weak GPU) and have to run things on the CPU.

Explanation of the problem and solution

Essentially the live webcam mode is slow because it does everything sequentially and mostly on the CPU meaning it has to skip a lot of frames while it is calculating. The problem lies in using just one core of the CPU to first find the face, then align it, then swap it, and that goes on repeat. Most machines have multiple CPU cores and, for a pre-recorded video, they can easily be utilized by computing multiple frames in parallel (which the app already does). However, when streaming in frames live from the webcam, we don't have access to future frames (after all they haven't happened), so we can't start computing them in parallel. We could wait for some to arrive and then compute those in parallel, but that introduces a delay/latency that feels just as bad as low frame rate.

The solution is pipelining (the same technique that your CPU already uses at the instruction level to optimize memory fetching, etc.!). In simplified terms, start finding the face in the first frame as soon as it arrives, then when the second frame arrives start finding the face in it while concurrently aligning the face from the first frame, then swap the face in the first frame while concurrently aligning the face in the second frame and finding a face in the third frame, and so on. That roughly looks like this:

In this simplified representation, the pipelined approach finishes 3 frames in the same time that the sequential approach finishes 1.67 frames, both with the same latency (time between something happens in the camera feed and the corresponding change is rendered in the face swapped stream). Of course in reality there are many more tasks than just find, align, and swap, plus not all the tasks take the same amount of time. This means they won't fit as nicely between the available CPU cores. Moreover, the frames definitely won't be streaming in at a rate that lines up with when the CPU cores are done processing them, so the frames must be skipped and distributed smartly across available cores to result in an even stream that doesn't freeze and jump in time. This can be done by keeping track of the moving average of computation time per frame and distributing the CPU cores evenly across frames from that timespan.

To try this out for yourself, set up the repo following the manual installation, open modules/ui.py and replace the create_webcam_preview function with the following code:

The code that makes the live webcam mode use pipelining

import queue
import threading
import numpy as np


def process_frame_pipeline(frame, source_image, frame_processors, width, height):
    temp_frame = frame.copy()

    if modules.globals.live_mirror:
        temp_frame = cv2.flip(temp_frame, 1)

    temp_frame = fit_image_to_size(
        temp_frame,
        width,
        height,
    )

    if not modules.globals.map_faces:
        for frame_processor in frame_processors:
            if frame_processor.NAME == "DLC.FACE-ENHANCER":
                if modules.globals.fp_ui["face_enhancer"]:
                    temp_frame = frame_processor.process_frame(None, temp_frame)
            else:
                temp_frame = frame_processor.process_frame(source_image, temp_frame)
    else:
        modules.globals.target_path = None
        for frame_processor in frame_processors:
            if frame_processor.NAME == "DLC.FACE-ENHANCER":
                if modules.globals.fp_ui["face_enhancer"]:
                    temp_frame = frame_processor.process_frame_v2(temp_frame)
            else:
                temp_frame = frame_processor.process_frame_v2(temp_frame)

    return temp_frame


def create_webcam_preview(camera_index: int):
    global preview_label, PREVIEW
    assert preview_label and PREVIEW and ROOT

    if not modules.globals.source_path:
        update_status("Please select a source image first")
        return

    cap = VideoCapturer(camera_index)
    if not cap.start(PREVIEW_DEFAULT_WIDTH, PREVIEW_DEFAULT_HEIGHT, 60):
        update_status("Failed to start camera")
        return

    preview_label.configure(width=PREVIEW_DEFAULT_WIDTH, height=PREVIEW_DEFAULT_HEIGHT)
    PREVIEW.deiconify()

    frame_processors = get_frame_processors_modules(modules.globals.frame_processors)
    source_face = get_one_face(cv2.imread(modules.globals.source_path))
    width, height = PREVIEW_DEFAULT_WIDTH, PREVIEW_DEFAULT_HEIGHT

    average_k = 20
    latencies_last_k = []
    computations_last_k = []
    times_last_k = []
    frames = 0
    fps_interval = 0.5
    fps = 0
    last_fps_update = 0

    NUM_WORKERS: int = modules.globals.execution_threads  # type: ignore
    frame_queue = queue.Queue(maxsize=1)
    result_queue = queue.Queue(maxsize=NUM_WORKERS)
    stop_event = threading.Event()

    def worker(
        frame_queue: queue.Queue,
        result_queue: queue.Queue,
        stop_event: threading.Event,
    ):
        while not stop_event.is_set():
            try:
                frame, start = frame_queue.get(timeout=0.1)
            except queue.Empty:
                continue

            before = time.perf_counter()
            processed = process_frame_pipeline(
                frame, source_face, frame_processors, width, height
            )
            computation_time = time.perf_counter() - before

            result_queue.put((processed, start, computation_time))

    # Launch staggered workers
    workers = [
        threading.Thread(
            target=worker,
            args=[
                frame_queue,
                result_queue,
                stop_event,
            ],
            daemon=True,
        )
        for _ in range(NUM_WORKERS)
    ]
    for w in workers:
        w.start()

    last_shown_time = 0
    previous_frame_time = time.perf_counter()
    while True:
        ret, frame = cap.read()
        if not ret:
            break

        try:
            current_time = time.perf_counter()
            frame_queue.put_nowait((frame, current_time))
            if modules.globals.show_fps:
                time_since_last_frame = current_time - previous_frame_time
                times_last_k.append(time_since_last_frame)
                times_last_k = times_last_k[-100:]
        except queue.Full:
            pass  # drop frame if workers busy

        try:
            temp_frame, start, computation_time = result_queue.get_nowait()
            start = current_time
            before = time.perf_counter()
            temp_frame = process_frame_pipeline(
                frame, source_face, frame_processors, width, height
            )
            computation_time = time.perf_counter() - before
            if start < last_shown_time:  # discard outdated frames
                continue
            last_shown_time = start
            if modules.globals.show_fps:
                latencies_last_k.append(time.perf_counter() - start)
                latencies_last_k = latencies_last_k[-average_k:]
                computations_last_k.append(computation_time)
                computations_last_k = computations_last_k[-average_k:]
        except queue.Empty:
            continue

        if modules.globals.live_resizable:
            width, height = PREVIEW.winfo_width(), PREVIEW.winfo_height()

        # Show FPS, latency, and computation time
        if modules.globals.show_fps:
            frames += 1
            current_time = time.perf_counter()
            if current_time - last_fps_update >= fps_interval:
                fps = frames / (current_time - last_fps_update)
                frames = 0
                last_fps_update = current_time
            previous_frame_time = current_time
            cv2.putText(
                temp_frame,
                f"FPS: {fps:3.1f}, STD: {int(1000 * np.std(times_last_k))}ms, LAT: {int(1000 * sum(latencies_last_k) / average_k)}ms, CMP: {int(1000 * sum(computations_last_k) / average_k)}ms",
                (5, 15),
                cv2.FONT_HERSHEY_SIMPLEX,
                0.5,
                (0, 255, 0),
                1,
            )

        # Render
        image = cv2.cvtColor(temp_frame, cv2.COLOR_BGR2RGB)
        image = Image.fromarray(image)
        image = ImageOps.contain(
            image, (temp_frame.shape[1], temp_frame.shape[0]), Image.LANCZOS  # type: ignore
        )
        image = ctk.CTkImage(image, size=image.size)
        preview_label.configure(image=image)
        ROOT.update()

        if PREVIEW.state() == "withdrawn":
            break

    stop_event.set()
    cap.release()
    PREVIEW.withdraw()

You'll need to experiment a bit with the number of threads you make available depending on your system. On my M1 Pro with 10 cpu cores and 16 GB RAM, about 4 threads seems to be the sweet spot so I run the app with python run.py --execution-provider coreml --live-mirror --execution-threads 4. Feel free to share what works with your setup to help others find a good number of threads.

Of course this is a rather primitive fix. There are a number of ways it could/should be improved before being merged into the repo.

For one, it currently the sequencing of frames is very primitive. It simply discards out of order frames which is wasted work. For another, it only uses pipelining with either CPU cores or GPU cores, not both. A quick improvement would be combining both CPU and GPU cores. Finally, only FPS is improved for a smoother stream. The latency is not improved. For that, less work must be done per frame, either by reusing work (and doing quick estimates of some of the values from previous frames) or making it faster (e.g., quantization, parallelizing at the model graph level).

VN-BugMaker · 2025-09-12T03:25:12Z

VN-BugMaker
Sep 12, 2025

Good solution. I'm improved from 6fps to 13-20fps on my M4

0 replies

samundra · 2025-09-15T05:10:11Z

samundra
Sep 15, 2025

@VN-BugMaker Could you please share your settings on how you execute it?

0 replies

hdd99009 · 2025-09-26T08:44:39Z

hdd99009
Sep 26, 2025

This method causes my M1 Pro 32GB to crash and restart, and the time on the startup lock screen shows April 1st at 8:00

0 replies

SanderGi · 2025-09-26T14:47:34Z

SanderGi
Sep 26, 2025
Author

@hdd99009, you probably need to set the execution threads lower then. I’m also working on a more stable version of this code that includes a rewrite of the models into native CoreML packages (significantly lowers the latency too!). It would help if you can provide some debugging information: Can you brew install asitop, run it with sudo asitop, and record what it shows while running the app using your phone until it crashes? If you feel comfortable, emailing me the crash report ([email protected]) would be extra helpful. You can find it using the preinstalled Console.app.

Also, for the best experience, combine this solution with the quick patch to use the GPU rather than the ANE: #1373 (comment)

0 replies

visel · 2025-10-09T15:53:54Z

visel
Oct 9, 2025

On Apple M4 Max after adjusting modules/ui.py the script is stuck when initiating live preview at [DLC.FACE-SWAPPER] Loading FP32 model: inswapper_128.onnx.

Starting the script via:

python3.10 run.py --execution-provider coreml
python3.10 run.py --execution-provider coreml --live-mirror --execution-threads 1
python3.10 run.py --execution-provider coreml --live-mirror --execution-threads 2
python3.10 run.py --execution-provider coreml --live-mirror --execution-threads 3
python3.10 run.py --execution-provider coreml --live-mirror --execution-threads 4
python3.10 run.py --execution-provider coreml --live-mirror --execution-threads 12
python3.10 run.py --execution-provider coreml --execution-threads 12

All bear the same results.

0 replies

SanderGi · 2025-10-09T16:44:03Z

SanderGi
Oct 9, 2025
Author

@visel, how long are you waiting and what MacOS version are you on? Have you tried opening facetime? It should take a bit longer to spin up than the unmodified app, how long does that take for you? Make sure you modified starting from the latest version of DLC in this repo. I came across a similar issue setting this up on a Windows machine. Perhaps the changes I made are also necessary for your setup. Shouldn’t be on MacOS but maybe it is if you’ve waited more than twice what you usually wait.

0 replies

visel · 2025-10-09T18:49:50Z

visel
Oct 9, 2025

@SanderGi I downloaded the latest version of Deep-Live-Cam and now it is no longer loading infinitely.

MacOS version: 15.1 (24B2082)

However my fps didn't go up (cores 16 (12 performance and 4 efficiency)) and 48 GB, furthermore, the more threads are assigned the more "jittery" the live presentation becomes, where the frames overwrite each other of some sorts.

0 replies

SanderGi · 2025-10-09T20:02:20Z

SanderGi
Oct 9, 2025
Author

@visel, ah, I've since fixed the jitter but forgot to update this issue. I've updated the code in the original comment. As for the FPS not going up, make sure you don't have the face enhancer enabled. The patch doesn't support that currently. It is more a proof of concept and thus hasn't been tuned to work for all devices/use cases. I'm working on a robust version using CoreML directly (and fine-tuning+quantizing a custom smaller model with higher resolution like inswapper-512). If you have time, you can help me test the following setup. It should improve latency and maybe slightly improve FPS, but not significantly:

Step 1) Replace your requirements.txt with this, recreate your virtual environment and `pip install -r requirements.txt`

--extra-index-url https://download.pytorch.org/whl/cu128

numpy>=1.23.5
typing-extensions>=4.8.0
opencv-python==4.10.0.84
cv2_enumerate_cameras==1.1.15
onnx==1.18.0
insightface==0.7.3
psutil==5.9.8
tk==0.1.0
customtkinter==5.2.2
pillow==11.1.0
torch==2.8.0
torchvision==0.23.0
onnxruntime==1.22.1
tensorflow==2.20.0
opennsfw2==0.10.2
protobuf==6.32.0
onnx2torch==1.5.15
coremltools==8.3.0

Step 2) Create a file called `inswapper_coreml.py` inside the `models` folder and put the following code in it

import cv2
import numpy as np

from insightface.utils import face_align

import coremltools as ct
from coremltools.models import MLModel

input_mean = 0.0
input_std = 255.0
input_shape = [1, 3, 128, 128]
input_shape = input_shape
input_size = tuple(input_shape[2:4][::-1])


def coreml_load_face_swap_model(device="GPU"):
    model = MLModel(
        (
            "models/inswapper_128.mlpackage"
            if device == "CPU"
            else "models/inswapper_128_fp16.mlpackage"
        ),
        compute_units={
            "GPU": ct.ComputeUnit.CPU_AND_GPU,
            "CPU": ct.ComputeUnit.CPU_ONLY,
            "ANE": ct.ComputeUnit.CPU_AND_NE,
        }[device],
        optimization_hints={
            "specializationStrategy": ct.SpecializationStrategy.FastPrediction,
            "reshapeFrequency": ct.ReshapeFrequency.Infrequent,
        },
    )
    return model


def coreml_swap_face(model, target_image, target_face, source_face):
    aimg, M = face_align.norm_crop2(target_image, target_face.kps, input_size[0])
    blob = cv2.dnn.blobFromImage(
        aimg,
        1.0 / input_std,
        input_size,
        (input_mean, input_mean, input_mean),
        swapRB=True,
    )
    latent = source_face.normed_embedding.reshape((1, -1))

    out = model.predict(
        {
            "target_blob": blob,
            "source_latent": latent,
        }
    )
    pred = next(iter(out.values()))
    img_fake = pred.transpose((0, 2, 3, 1))[0]
    bgr_fake = np.clip(255 * img_fake, 0, 255).astype(np.uint8)[:, :, ::-1]
    fake_diff = bgr_fake.astype(np.float32) - aimg.astype(np.float32)
    fake_diff = np.abs(fake_diff).mean(axis=2)
    fake_diff[:2, :] = 0
    fake_diff[-2:, :] = 0
    fake_diff[:, :2] = 0
    fake_diff[:, -2:] = 0
    IM = cv2.invertAffineTransform(M)
    img_white = np.full((aimg.shape[0], aimg.shape[1]), 255, dtype=np.float32)
    bgr_fake = cv2.warpAffine(
        bgr_fake,
        IM,
        (target_image.shape[1], target_image.shape[0]),
        borderValue=0.0,  # type: ignore
    )
    img_white = cv2.warpAffine(
        img_white,
        IM,
        (target_image.shape[1], target_image.shape[0]),
        borderValue=0.0,  # type: ignore
    )
    fake_diff = cv2.warpAffine(
        fake_diff,
        IM,
        (target_image.shape[1], target_image.shape[0]),
        borderValue=0.0,  # type: ignore
    )
    img_white[img_white > 20] = 255
    fthresh = 10
    fake_diff[fake_diff < fthresh] = 0
    fake_diff[fake_diff >= fthresh] = 255
    img_mask = img_white
    mask_h_inds, mask_w_inds = np.where(img_mask == 255)
    mask_h = np.max(mask_h_inds) - np.min(mask_h_inds)
    mask_w = np.max(mask_w_inds) - np.min(mask_w_inds)
    mask_size = int(np.sqrt(mask_h * mask_w))
    k = max(mask_size // 10, 10)
    kernel = np.ones((k, k), np.uint8)
    img_mask = cv2.erode(img_mask, kernel, iterations=1)
    kernel = np.ones((2, 2), np.uint8)
    fake_diff = cv2.dilate(fake_diff, kernel, iterations=1)
    k = max(mask_size // 20, 5)
    kernel_size = (k, k)
    blur_size = tuple(2 * i + 1 for i in kernel_size)
    img_mask = cv2.GaussianBlur(img_mask, blur_size, 0)
    k = 5
    kernel_size = (k, k)
    blur_size = tuple(2 * i + 1 for i in kernel_size)
    fake_diff = cv2.GaussianBlur(fake_diff, blur_size, 0)
    img_mask /= 255
    fake_diff /= 255
    img_mask = np.reshape(img_mask, [img_mask.shape[0], img_mask.shape[1], 1])
    fake_merged = img_mask * bgr_fake + (1 - img_mask) * target_image.astype(np.float32)
    fake_merged = fake_merged.astype(np.uint8)
    return fake_merged


if __name__ == "__main__":
    import time
    import insightface

    source_path = "thanh.png"
    target_path = "hulk.png"
    output_path = "models/test.png"

    source_image = cv2.imread(source_path)
    target_image = cv2.imread(target_path)

    model = coreml_load_face_swap_model()
    FACE_ANALYSER = insightface.app.FaceAnalysis(
        name="buffalo_l", providers=["CoreMLExecutionProvider", "CPUExecutionProvider"]
    )
    FACE_ANALYSER.prepare(ctx_id=0, det_size=(640, 640))

    #### WARMUP ####
    source_face = min(
        FACE_ANALYSER.get(source_image, max_num=1), key=lambda x: x.bbox[0]
    )
    target_face = min(
        FACE_ANALYSER.get(target_image, max_num=1), key=lambda x: x.bbox[0]
    )
    result = coreml_swap_face(model, target_image, target_face, source_face)
    ################

    start = time.perf_counter()
    source_face = min(
        FACE_ANALYSER.get(source_image, max_num=1), key=lambda x: x.bbox[0]
    )
    target_face = min(
        FACE_ANALYSER.get(target_image, max_num=1), key=lambda x: x.bbox[0]
    )
    found_faces = time.perf_counter()

    result = coreml_swap_face(model, target_image, target_face, source_face)

    end = time.perf_counter()
    print("Face Analyzer:", found_faces - start, "sec")
    print("Face Swapping:", end - found_faces, "sec")
    print("TOTAL TIME:", end - start, "sec")

    cv2.imwrite(output_path, result)

Step 3) Replace the `get_face_swapper` and `swap_face` methods in `modules/processors/frame/face_swapper.py` with this code

from models.inswapper_coreml import coreml_load_face_swap_model, coreml_swap_face
USE_COREML = False
def get_face_swapper() -> Any:
    global FACE_SWAPPER, USE_COREML

    with THREAD_LOCK:
        if FACE_SWAPPER is None:
            if os.path.exists(os.path.join(models_dir, "inswapper_128.mlpackage")):
                USE_COREML = True
                FACE_SWAPPER = coreml_load_face_swap_model("GPU")
            else:
                model_name = "inswapper_128.onnx"
                if "CUDAExecutionProvider" in modules.globals.execution_providers:
                    model_name = "inswapper_128_fp16.onnx"
                model_path = os.path.join(models_dir, model_name)
                FACE_SWAPPER = insightface.model_zoo.get_model(
                    model_path,
                    # providers=modules.globals.execution_providers,
                    providers=[
                        (
                            (
                                "CoreMLExecutionProvider",
                                {
                                    "ModelFormat": "MLProgram",
                                    "MLComputeUnits": "CPUAndGPU",
                                    "SpecializationStrategy": "FastPrediction",
                                    "AllowLowPrecisionAccumulationOnGPU": 1,
                                },
                            )
                            if p == "CoreMLExecutionProvider"
                            else p
                        )
                        for p in modules.globals.execution_providers
                    ],
                )
    return FACE_SWAPPER

def swap_face(source_face: Face, target_face: Face, temp_frame: Frame) -> Frame:
    face_swapper = get_face_swapper()

    # Apply the face swap
    if USE_COREML:
        swapped_frame = coreml_swap_face(
            face_swapper, temp_frame, target_face, source_face
        )
    else:
        swapped_frame = face_swapper.get(
            temp_frame, target_face, source_face, paste_back=True
        )

    if modules.globals.mouth_mask:
        # Create a mask for the target face
        face_mask = create_face_mask(target_face, temp_frame)

        # Create the mouth mask
        mouth_mask, mouth_cutout, mouth_box, lower_lip_polygon = (
            create_lower_mouth_mask(target_face, temp_frame)
        )

        # Apply the mouth area
        swapped_frame = apply_mouth_area(
            swapped_frame, mouth_cutout, mouth_box, face_mask, lower_lip_polygon
        )

        if modules.globals.show_mouth_mask_box:
            mouth_mask_data = (mouth_mask, mouth_cutout, mouth_box, lower_lip_polygon)
            swapped_frame = draw_mouth_mask_visualization(
                swapped_frame, target_face, mouth_mask_data
            )

    return swapped_frame

Step 4) Download `inswapper_128.mlpackage` or `inswapper_128_fp16.mlpackage` and place the unzipped directory in the `models` folder

https://koel-labs-standard.s3.amazonaws.com/inswapper_128.mlpackage.zip?AWSAccessKeyId=AKIA4DJDZRWKCACU5UIH&Signature=%2FbTZRp51JJaTUD9QHpd0FRiGHFc%3D&Expires=1760644722

Step 5) Instead of the original content that you placed in `modules/ui.py`, use this

import numpy as np


def process_frame_pipeline(frame, source_image, frame_processors, width, height):
    temp_frame = frame.copy()

    if modules.globals.live_mirror:
        temp_frame = cv2.flip(temp_frame, 1)

    temp_frame = fit_image_to_size(
        temp_frame,
        width,
        height,
    )

    if not modules.globals.map_faces:
        for frame_processor in frame_processors:
            if frame_processor.NAME == "DLC.FACE-ENHANCER":
                if modules.globals.fp_ui["face_enhancer"]:
                    temp_frame = frame_processor.process_frame(None, temp_frame)
            else:
                temp_frame = frame_processor.process_frame(source_image, temp_frame)
    else:
        modules.globals.target_path = None
        for frame_processor in frame_processors:
            if frame_processor.NAME == "DLC.FACE-ENHANCER":
                if modules.globals.fp_ui["face_enhancer"]:
                    temp_frame = frame_processor.process_frame_v2(temp_frame)
            else:
                temp_frame = frame_processor.process_frame_v2(temp_frame)

    return temp_frame


def create_webcam_preview(camera_index: int):
    global preview_label, PREVIEW
    assert preview_label and PREVIEW and ROOT

    if not modules.globals.source_path:
        update_status("Please select a source image first")
        return

    cap = VideoCapturer(camera_index)
    if not cap.start(PREVIEW_DEFAULT_WIDTH, PREVIEW_DEFAULT_HEIGHT, 60):
        update_status("Failed to start camera")
        return

    preview_label.configure(width=PREVIEW_DEFAULT_WIDTH, height=PREVIEW_DEFAULT_HEIGHT)
    PREVIEW.deiconify()

    frame_processors = get_frame_processors_modules(modules.globals.frame_processors)
    source_face = get_one_face(cv2.imread(modules.globals.source_path))
    width, height = PREVIEW_DEFAULT_WIDTH, PREVIEW_DEFAULT_HEIGHT

    average_k = 20
    latencies_last_k = []
    computations_last_k = []
    times_last_k = []
    frames = 0
    fps_interval = 0.5
    fps = 0
    last_fps_update = 0

    # NUM_WORKERS: int = modules.globals.execution_threads  # type: ignore
    # frame_queue = queue.Queue(maxsize=1)
    # result_queue = queue.Queue(maxsize=NUM_WORKERS)
    # stop_event = threading.Event()

    # def worker(
    #     frame_queue: queue.Queue,
    #     result_queue: queue.Queue,
    #     stop_event: threading.Event,
    # ):
    #     while not stop_event.is_set():
    #         try:
    #             frame, start = frame_queue.get(timeout=0.1)
    #         except queue.Empty:
    #             continue

    #         before = time.perf_counter()
    #         processed = process_frame_pipeline(
    #             frame, source_face, frame_processors, width, height
    #         )
    #         computation_time = time.perf_counter() - before

    #         result_queue.put((processed, start, computation_time))

    # # Launch staggered workers
    # workers = [
    #     threading.Thread(
    #         target=worker,
    #         args=[
    #             frame_queue,
    #             result_queue,
    #             stop_event,
    #         ],
    #         daemon=True,
    #     )
    #     for _ in range(NUM_WORKERS)
    # ]
    # for w in workers:
    #     w.start()

    last_shown_time = 0
    previous_frame_time = time.perf_counter()
    while True:
        ret, frame = cap.read()
        if not ret:
            break

        # try:
        current_time = time.perf_counter()
        # frame_queue.put_nowait((frame, current_time))
        if modules.globals.show_fps:
            time_since_last_frame = current_time - previous_frame_time
            times_last_k.append(time_since_last_frame)
            times_last_k = times_last_k[-100:]
        # except queue.Full:
        #     pass  # drop frame if workers busy

        # try:
        # temp_frame, start, computation_time = result_queue.get_nowait()
        start = current_time
        before = time.perf_counter()
        temp_frame = process_frame_pipeline(
            frame, source_face, frame_processors, width, height
        )
        computation_time = time.perf_counter() - before
        if start < last_shown_time:  # discard outdated frames
            continue
        last_shown_time = start
        if modules.globals.show_fps:
            latencies_last_k.append(time.perf_counter() - start)
            latencies_last_k = latencies_last_k[-average_k:]
            computations_last_k.append(computation_time)
            computations_last_k = computations_last_k[-average_k:]
        # except queue.Empty:
        #     continue

        if modules.globals.live_resizable:
            width, height = PREVIEW.winfo_width(), PREVIEW.winfo_height()

        # Show FPS, latency, and computation time
        if modules.globals.show_fps:
            frames += 1
            current_time = time.perf_counter()
            if current_time - last_fps_update >= fps_interval:
                fps = frames / (current_time - last_fps_update)
                frames = 0
                last_fps_update = current_time
            previous_frame_time = current_time
            cv2.putText(
                temp_frame,
                f"FPS: {fps:3.1f}, STD: {int(1000 * np.std(times_last_k))}ms, LAT: {int(1000 * sum(latencies_last_k) / average_k)}ms, CMP: {int(1000 * sum(computations_last_k) / average_k)}ms",
                (5, 15),
                cv2.FONT_HERSHEY_SIMPLEX,
                0.5,
                (0, 255, 0),
                1,
            )

        # Render
        image = cv2.cvtColor(temp_frame, cv2.COLOR_BGR2RGB)
        image = Image.fromarray(image)
        image = ImageOps.contain(
            image, (temp_frame.shape[1], temp_frame.shape[0]), Image.LANCZOS  # type: ignore
        )
        image = ctk.CTkImage(image, size=image.size)
        preview_label.configure(image=image)
        ROOT.update()

        if PREVIEW.state() == "withdrawn":
            break

    # stop_event.set()
    cap.release()
    PREVIEW.withdraw()

0 replies

hacksider · 2025-10-10T13:58:47Z

hacksider
Oct 10, 2025
Maintainer

Just to give a feedback, thanks for this thread! This'll help a lot of mac users 😃

0 replies

saurabhthesuperhero · 2025-10-17T09:28:47Z

saurabhthesuperhero
Oct 17, 2025

@SanderGi just to verify, is your top comment updated, I just have to paste that code right, I have currently 5fps on m4 24gb ram.

0 replies

SanderGi · 2025-10-17T09:37:01Z

SanderGi
Oct 17, 2025
Author

@saurabhthesuperhero, top comment should be updated and you should be able to just replace the relevant code with the new code. You can let me know if it doesn’t work. You might find the later comment works better, but I haven’t yet extensively tested on different devices

0 replies

saurabhthesuperhero · 2025-10-17T15:48:03Z

saurabhthesuperhero
Oct 17, 2025

@SanderGi Quick Update:
I tried Original Top comment and actually it did used mor ram and was quite slow.
withoght any fixes below is my speed on m4 mac mini 24 gb ram, 5-6fps variable.
with script many times it was black screen , and with face enhancer with script it was 0.0 to 0.1 fps etc, and withought script its around1 fps.

meanwhile I am going to try this models method once i have enough storage, just want to confirm we do have mps compatible models right is this same way ? like used in draw things apps etc. or in stable diffusion there are mpx models.

0 replies

SanderGi · 2025-10-18T00:26:27Z

SanderGi
Oct 18, 2025
Author

@saurabhthesuperhero the script doesn’t yet support face enhancer so it makes sense that it would be slow. Regarding why the general speedup is so minuscule, there are a couple factors. Most importantly, make sure you are combining it with the patch to use your GPU rather than the ANE: #1373 (comment). Secondly, make sure you play around with the execution threads to find the optimal number python run.py --execution-provider coreml --live-mirror --execution-threads 4 <- the four is just what worked best on my machine. Lastly, for the best experience, make sure your Macbook is currently charging, at >80% battery, and has a minimum of other processes running.

Regarding your model compatibility question, Apple has 3 main ways of running hardware accelerated models (MLX, CoreML, and MPS) and two types of hardware acceleration (ANE for lightweight ML tasks and GPU for intense parallel computation). MLX targets the GPU but is meant for training models and as such can be used for inference as that is necessary for training but it is not its main purpose. CoreML is specifically tuned for inference and can access both the ANE and GPU. MPS stands for Metal Performance Shaders and is a low level way of instructing the GPU specifically. It is useful for tuning general purpose tasks on the GPU not supported by CoreML/MPX as those are mainly for Machine Learning. Except very niche cases, it is not better to implement ML pipelines with MPS. When I converted the models in this repo to run with PyTorch MPS, for instance, they ran slower than with CoreML/MPX (there is a number of reasons among which is frequency throttling). Now with this background we can answer your question: the original models in this repo are ONNX models which mean they have limited CoreML compatibility (some things will still run on the CPU) and they were not configured to run on the GPU but the ANE (hence this fix: #1373 (comment)). The top comment does not change these models but instead runs the models in parallel to increase ANE/GPU utilization while the ONNX runtime falls back to CPU. This means we can increase throughput (FPS), at the expense of more RAM utilization and slower latency. The model fix (#1495 (comment)) includes a model that is fully CoreML compatible meaning it can take full advantage of the GPU without fallbacks to the CPU. This is lighter on RAM and should improve both FPS and latency. Currently it only includes one model, not the full pipeline of models included in face swapping so there is still plenty of room for improvement. I’m experimenting with an optimized end-to-end model for instance which should be much faster and utilize the GPU even better.

0 replies

saurabhthesuperhero · 2025-10-18T00:38:53Z

saurabhthesuperhero
Oct 18, 2025

Combine with patch? I did not understand this,

I followed whole comment changing code as mentioned and then run the command, also played with threads as m4 have 10 execution threads ( mac mini base 24gb ram) on 10 it was bad , though,
So no improvement.

Regarding 2nd method , when u mentioned
the expense of more RAM utilization and slower latency
I think in this case if person has more ram it would be best as per my understanding for eg in my case my ram usage are getting used up.

PS. Huge thanks to explaining all related to silicon I used to think MPS and MLX are kind of same.

0 replies

SanderGi · 2025-10-18T01:14:07Z

SanderGi
Oct 18, 2025
Author

Combine with patch as in, in addition to the top comment instructions, also follow the instructions here: #1373 (comment). So you’ll end up replacing some code in two files.

When you have enough RAM available, using more or less isn’t by itself a good nor bad thing. If it is possible to use more RAM to get more computation done faster, and the RAM is available, then using more is a good thing. Otherwise, using as little RAM as possible is desirable because not only is there an overhead to using RAM (pretty negligible), but using less also leaves more resources to run other processes. When you use more RAM than is available and start having to use SWAP, then things become really slow and it is almost always worth it to reduce RAM usage. Since you have plenty of RAM, one thing to look into is combining the top comment and method two. This is not something I have a quick set of instructions to try out ready for yet

0 replies

saurabhthesuperhero · 2025-10-18T06:09:56Z

saurabhthesuperhero
Oct 18, 2025

@SanderGi
Update,
This Screenshot is withoguht, code optimizations,
And when i combined code from
#1373 amd #1495 it got reduced to below 5 FPS, looks like for me its not working, I did played with execution threads.
Also Yes What I meant is if I had more ram it would have been better performance.

Anyway thanks.

0 replies

SanderGi · 2025-10-18T07:30:11Z

SanderGi
Oct 18, 2025
Author

Unfortunate. Definitely let me know if the second method yields better results if you end up trying it. Also, it seems that version 2.0c was released since I made the above patches. It includes some interpolation and other optimizations that the patches probably don’t play nice with. What FPS do you get on f9270c5? And what FPS do you then get when modifying from that as a starting point?

0 replies

saurabhthesuperhero · 2025-10-18T07:54:05Z

saurabhthesuperhero
Oct 18, 2025

Unfortunate. Definitely let me know if the second method yields better results if you end up trying it. Also, it seems that version 2.0c was released since I made the above patches. It includes some interpolation and other optimizations that the patches probably don’t play nice with. What FPS do you get on f9270c5? And what FPS do you then get when modifying from that as a starting point?

Yes new version is released, in new 2.0 c version as shared I am getting 6fps before i used to get around 4.5 to 5fps.

While I tried ur code in 2.0c only,
But Yes have to try second method where it requires to do installation with new env etc will try it.

0 replies

SanderGi · 2025-10-18T08:58:47Z

SanderGi
Oct 18, 2025
Author

Right. My code here is not compatible with 2.0c. The optimizations that 2.0c introduced to go from 5 to 6 FPS on you device could definitely be adapted to work with the changes in this thread for the best of both worlds

0 replies

saurabhthesuperhero · 2025-10-18T18:31:36Z

saurabhthesuperhero
Oct 18, 2025

@SanderGi yes right,
btw wrt to 2.0c if you have any suggestions do share will test it out, while I am gonna go through code and try my own fixes to see how it goes will update in this thread only.

Just Want to confirm my gpu is being used. is this expected ? because we are assuming on the base that gpu is not being utilized right

0 replies

etanhanbaiki · 2025-10-22T20:35:32Z

etanhanbaiki
Oct 22, 2025

@SanderGi - I very much appreciate all your hard work on this project. If you have a minute, could you help me locate ‘inswapper_128_fp16.mlpackage’?

In reference to step no. 4 in your comment above, I was able to previously download ‘inswapper_128.mlpackage’ from the link you provided, but I can’t find ‘inswapper_128_fp16.mlpackage’ anywhere, and when I try to run the program without it, I get the following error message:

FileNotFoundError: [Errno 2] No such file or directory: '/Users/xxxxxxxxxx/Deep-Live-Cam/models/inswapper_128_fp16.mlpackage'

0 replies

samundra · 2025-10-22T21:22:30Z

samundra
Oct 22, 2025

@etanhanbaiki Have you tried downloading from Download link for inswapper_128_fp16.onnx

0 replies

SanderGi · 2025-10-22T21:38:14Z

SanderGi
Oct 22, 2025
Author

@saurabhthesuperhero It is indeed expected that the GPU should be used. The base will have some GPU utilization but it should be very low because it has not been configured to fully utilize the Apple Silicon GPU. All the patches in this thread are aimed at improving performance through greater GPU utilization. Regarding 2.0c, I'll have to find the time to review the changes in detail (likely won't have time until December unfortunately), but I believe it should be a straightforward matter of adapting these patches to also do interpolation between frames.

@etanhanbaiki Here's the link for the f16 package.

@samundra Sorry for the confusion, this is specifically about the .mlpackage version of the file, not .onnx.

0 replies

hacksider · 2025-10-28T03:18:57Z

hacksider
Oct 28, 2025
Maintainer

Please put this on the discussion moving forward.

0 replies

visel · 2025-10-28T10:51:00Z

visel
Oct 28, 2025

@hacksider how is this completed?

0 replies

hacksider · 2025-10-28T11:06:17Z

hacksider
Oct 28, 2025
Maintainer

@visel I wont tell you BUT the details are on my last reply

0 replies

hacksider · 2025-10-28T11:07:58Z

hacksider
Oct 28, 2025
Maintainer

Moved to discussion :)

1 reply

visel Oct 28, 2025

ah ok sorry

Instructions for improving webcam live frame rate through pipelining (tested on M1 Pro: from 1 to 10 FPS) #1553

Uh oh!

Uh oh!

Replies: 27 comments · 1 reply

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SanderGi Sep 26, 2025 Author

Uh oh!

Uh oh!

Uh oh!

SanderGi Oct 9, 2025 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SanderGi Oct 9, 2025 Author

Uh oh!

hacksider Oct 10, 2025 Maintainer

Uh oh!

Uh oh!

SanderGi Oct 17, 2025 Author

Uh oh!

Uh oh!

Uh oh!

SanderGi Oct 18, 2025 Author

Uh oh!

Uh oh!

Uh oh!

SanderGi Oct 18, 2025 Author

Uh oh!

Uh oh!

SanderGi Oct 18, 2025 Author

Uh oh!

Uh oh!

SanderGi Oct 18, 2025 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SanderGi Oct 22, 2025 Author

Uh oh!

hacksider Oct 28, 2025 Maintainer

Uh oh!

Uh oh!

hacksider Oct 28, 2025 Maintainer

Uh oh!

hacksider Oct 28, 2025 Maintainer

Uh oh!

Replies: 27 comments 1 reply

SanderGi
Sep 26, 2025
Author

SanderGi
Oct 9, 2025
Author

SanderGi
Oct 9, 2025
Author

hacksider
Oct 10, 2025
Maintainer

SanderGi
Oct 17, 2025
Author

SanderGi
Oct 18, 2025
Author

SanderGi
Oct 18, 2025
Author

SanderGi
Oct 18, 2025
Author

SanderGi
Oct 18, 2025
Author

SanderGi
Oct 22, 2025
Author

hacksider
Oct 28, 2025
Maintainer

hacksider
Oct 28, 2025
Maintainer

hacksider
Oct 28, 2025
Maintainer