Replies: 27 comments 1 reply
-
|
Good solution. I'm improved from 6fps to 13-20fps on my M4 |
Beta Was this translation helpful? Give feedback.
-
|
@VN-BugMaker Could you please share your settings on how you execute it? |
Beta Was this translation helpful? Give feedback.
-
|
This method causes my M1 Pro 32GB to crash and restart, and the time on the startup lock screen shows April 1st at 8:00 |
Beta Was this translation helpful? Give feedback.
-
|
@hdd99009, you probably need to set the execution threads lower then. I’m also working on a more stable version of this code that includes a rewrite of the models into native CoreML packages (significantly lowers the latency too!). It would help if you can provide some debugging information: Can you Also, for the best experience, combine this solution with the quick patch to use the GPU rather than the ANE: #1373 (comment) |
Beta Was this translation helpful? Give feedback.
-
|
On Apple M4 Max after adjusting Starting the script via: All bear the same results. |
Beta Was this translation helpful? Give feedback.
-
|
@visel, how long are you waiting and what MacOS version are you on? Have you tried opening facetime? It should take a bit longer to spin up than the unmodified app, how long does that take for you? Make sure you modified starting from the latest version of DLC in this repo. I came across a similar issue setting this up on a Windows machine. Perhaps the changes I made are also necessary for your setup. Shouldn’t be on MacOS but maybe it is if you’ve waited more than twice what you usually wait. |
Beta Was this translation helpful? Give feedback.
-
|
@SanderGi I downloaded the latest version of Deep-Live-Cam and now it is no longer loading infinitely.
However my fps didn't go up (cores 16 (12 performance and 4 efficiency)) and 48 GB, furthermore, the more threads are assigned the more "jittery" the live presentation becomes, where the frames overwrite each other of some sorts. |
Beta Was this translation helpful? Give feedback.
-
|
@visel, ah, I've since fixed the jitter but forgot to update this issue. I've updated the code in the original comment. As for the FPS not going up, make sure you don't have the face enhancer enabled. The patch doesn't support that currently. It is more a proof of concept and thus hasn't been tuned to work for all devices/use cases. I'm working on a robust version using CoreML directly (and fine-tuning+quantizing a custom smaller model with higher resolution like inswapper-512). If you have time, you can help me test the following setup. It should improve latency and maybe slightly improve FPS, but not significantly: Step 1) Replace your requirements.txt with this, recreate your virtual environment and `pip install -r requirements.txt`Step 2) Create a file called `inswapper_coreml.py` inside the `models` folder and put the following code in itimport cv2
import numpy as np
from insightface.utils import face_align
import coremltools as ct
from coremltools.models import MLModel
input_mean = 0.0
input_std = 255.0
input_shape = [1, 3, 128, 128]
input_shape = input_shape
input_size = tuple(input_shape[2:4][::-1])
def coreml_load_face_swap_model(device="GPU"):
model = MLModel(
(
"models/inswapper_128.mlpackage"
if device == "CPU"
else "models/inswapper_128_fp16.mlpackage"
),
compute_units={
"GPU": ct.ComputeUnit.CPU_AND_GPU,
"CPU": ct.ComputeUnit.CPU_ONLY,
"ANE": ct.ComputeUnit.CPU_AND_NE,
}[device],
optimization_hints={
"specializationStrategy": ct.SpecializationStrategy.FastPrediction,
"reshapeFrequency": ct.ReshapeFrequency.Infrequent,
},
)
return model
def coreml_swap_face(model, target_image, target_face, source_face):
aimg, M = face_align.norm_crop2(target_image, target_face.kps, input_size[0])
blob = cv2.dnn.blobFromImage(
aimg,
1.0 / input_std,
input_size,
(input_mean, input_mean, input_mean),
swapRB=True,
)
latent = source_face.normed_embedding.reshape((1, -1))
out = model.predict(
{
"target_blob": blob,
"source_latent": latent,
}
)
pred = next(iter(out.values()))
img_fake = pred.transpose((0, 2, 3, 1))[0]
bgr_fake = np.clip(255 * img_fake, 0, 255).astype(np.uint8)[:, :, ::-1]
fake_diff = bgr_fake.astype(np.float32) - aimg.astype(np.float32)
fake_diff = np.abs(fake_diff).mean(axis=2)
fake_diff[:2, :] = 0
fake_diff[-2:, :] = 0
fake_diff[:, :2] = 0
fake_diff[:, -2:] = 0
IM = cv2.invertAffineTransform(M)
img_white = np.full((aimg.shape[0], aimg.shape[1]), 255, dtype=np.float32)
bgr_fake = cv2.warpAffine(
bgr_fake,
IM,
(target_image.shape[1], target_image.shape[0]),
borderValue=0.0, # type: ignore
)
img_white = cv2.warpAffine(
img_white,
IM,
(target_image.shape[1], target_image.shape[0]),
borderValue=0.0, # type: ignore
)
fake_diff = cv2.warpAffine(
fake_diff,
IM,
(target_image.shape[1], target_image.shape[0]),
borderValue=0.0, # type: ignore
)
img_white[img_white > 20] = 255
fthresh = 10
fake_diff[fake_diff < fthresh] = 0
fake_diff[fake_diff >= fthresh] = 255
img_mask = img_white
mask_h_inds, mask_w_inds = np.where(img_mask == 255)
mask_h = np.max(mask_h_inds) - np.min(mask_h_inds)
mask_w = np.max(mask_w_inds) - np.min(mask_w_inds)
mask_size = int(np.sqrt(mask_h * mask_w))
k = max(mask_size // 10, 10)
kernel = np.ones((k, k), np.uint8)
img_mask = cv2.erode(img_mask, kernel, iterations=1)
kernel = np.ones((2, 2), np.uint8)
fake_diff = cv2.dilate(fake_diff, kernel, iterations=1)
k = max(mask_size // 20, 5)
kernel_size = (k, k)
blur_size = tuple(2 * i + 1 for i in kernel_size)
img_mask = cv2.GaussianBlur(img_mask, blur_size, 0)
k = 5
kernel_size = (k, k)
blur_size = tuple(2 * i + 1 for i in kernel_size)
fake_diff = cv2.GaussianBlur(fake_diff, blur_size, 0)
img_mask /= 255
fake_diff /= 255
img_mask = np.reshape(img_mask, [img_mask.shape[0], img_mask.shape[1], 1])
fake_merged = img_mask * bgr_fake + (1 - img_mask) * target_image.astype(np.float32)
fake_merged = fake_merged.astype(np.uint8)
return fake_merged
if __name__ == "__main__":
import time
import insightface
source_path = "thanh.png"
target_path = "hulk.png"
output_path = "models/test.png"
source_image = cv2.imread(source_path)
target_image = cv2.imread(target_path)
model = coreml_load_face_swap_model()
FACE_ANALYSER = insightface.app.FaceAnalysis(
name="buffalo_l", providers=["CoreMLExecutionProvider", "CPUExecutionProvider"]
)
FACE_ANALYSER.prepare(ctx_id=0, det_size=(640, 640))
#### WARMUP ####
source_face = min(
FACE_ANALYSER.get(source_image, max_num=1), key=lambda x: x.bbox[0]
)
target_face = min(
FACE_ANALYSER.get(target_image, max_num=1), key=lambda x: x.bbox[0]
)
result = coreml_swap_face(model, target_image, target_face, source_face)
################
start = time.perf_counter()
source_face = min(
FACE_ANALYSER.get(source_image, max_num=1), key=lambda x: x.bbox[0]
)
target_face = min(
FACE_ANALYSER.get(target_image, max_num=1), key=lambda x: x.bbox[0]
)
found_faces = time.perf_counter()
result = coreml_swap_face(model, target_image, target_face, source_face)
end = time.perf_counter()
print("Face Analyzer:", found_faces - start, "sec")
print("Face Swapping:", end - found_faces, "sec")
print("TOTAL TIME:", end - start, "sec")
cv2.imwrite(output_path, result)Step 3) Replace the `get_face_swapper` and `swap_face` methods in `modules/processors/frame/face_swapper.py` with this codefrom models.inswapper_coreml import coreml_load_face_swap_model, coreml_swap_face
USE_COREML = False
def get_face_swapper() -> Any:
global FACE_SWAPPER, USE_COREML
with THREAD_LOCK:
if FACE_SWAPPER is None:
if os.path.exists(os.path.join(models_dir, "inswapper_128.mlpackage")):
USE_COREML = True
FACE_SWAPPER = coreml_load_face_swap_model("GPU")
else:
model_name = "inswapper_128.onnx"
if "CUDAExecutionProvider" in modules.globals.execution_providers:
model_name = "inswapper_128_fp16.onnx"
model_path = os.path.join(models_dir, model_name)
FACE_SWAPPER = insightface.model_zoo.get_model(
model_path,
# providers=modules.globals.execution_providers,
providers=[
(
(
"CoreMLExecutionProvider",
{
"ModelFormat": "MLProgram",
"MLComputeUnits": "CPUAndGPU",
"SpecializationStrategy": "FastPrediction",
"AllowLowPrecisionAccumulationOnGPU": 1,
},
)
if p == "CoreMLExecutionProvider"
else p
)
for p in modules.globals.execution_providers
],
)
return FACE_SWAPPER
def swap_face(source_face: Face, target_face: Face, temp_frame: Frame) -> Frame:
face_swapper = get_face_swapper()
# Apply the face swap
if USE_COREML:
swapped_frame = coreml_swap_face(
face_swapper, temp_frame, target_face, source_face
)
else:
swapped_frame = face_swapper.get(
temp_frame, target_face, source_face, paste_back=True
)
if modules.globals.mouth_mask:
# Create a mask for the target face
face_mask = create_face_mask(target_face, temp_frame)
# Create the mouth mask
mouth_mask, mouth_cutout, mouth_box, lower_lip_polygon = (
create_lower_mouth_mask(target_face, temp_frame)
)
# Apply the mouth area
swapped_frame = apply_mouth_area(
swapped_frame, mouth_cutout, mouth_box, face_mask, lower_lip_polygon
)
if modules.globals.show_mouth_mask_box:
mouth_mask_data = (mouth_mask, mouth_cutout, mouth_box, lower_lip_polygon)
swapped_frame = draw_mouth_mask_visualization(
swapped_frame, target_face, mouth_mask_data
)
return swapped_frameStep 4) Download `inswapper_128.mlpackage` or `inswapper_128_fp16.mlpackage` and place the unzipped directory in the `models` folderStep 5) Instead of the original content that you placed in `modules/ui.py`, use thisimport numpy as np
def process_frame_pipeline(frame, source_image, frame_processors, width, height):
temp_frame = frame.copy()
if modules.globals.live_mirror:
temp_frame = cv2.flip(temp_frame, 1)
temp_frame = fit_image_to_size(
temp_frame,
width,
height,
)
if not modules.globals.map_faces:
for frame_processor in frame_processors:
if frame_processor.NAME == "DLC.FACE-ENHANCER":
if modules.globals.fp_ui["face_enhancer"]:
temp_frame = frame_processor.process_frame(None, temp_frame)
else:
temp_frame = frame_processor.process_frame(source_image, temp_frame)
else:
modules.globals.target_path = None
for frame_processor in frame_processors:
if frame_processor.NAME == "DLC.FACE-ENHANCER":
if modules.globals.fp_ui["face_enhancer"]:
temp_frame = frame_processor.process_frame_v2(temp_frame)
else:
temp_frame = frame_processor.process_frame_v2(temp_frame)
return temp_frame
def create_webcam_preview(camera_index: int):
global preview_label, PREVIEW
assert preview_label and PREVIEW and ROOT
if not modules.globals.source_path:
update_status("Please select a source image first")
return
cap = VideoCapturer(camera_index)
if not cap.start(PREVIEW_DEFAULT_WIDTH, PREVIEW_DEFAULT_HEIGHT, 60):
update_status("Failed to start camera")
return
preview_label.configure(width=PREVIEW_DEFAULT_WIDTH, height=PREVIEW_DEFAULT_HEIGHT)
PREVIEW.deiconify()
frame_processors = get_frame_processors_modules(modules.globals.frame_processors)
source_face = get_one_face(cv2.imread(modules.globals.source_path))
width, height = PREVIEW_DEFAULT_WIDTH, PREVIEW_DEFAULT_HEIGHT
average_k = 20
latencies_last_k = []
computations_last_k = []
times_last_k = []
frames = 0
fps_interval = 0.5
fps = 0
last_fps_update = 0
# NUM_WORKERS: int = modules.globals.execution_threads # type: ignore
# frame_queue = queue.Queue(maxsize=1)
# result_queue = queue.Queue(maxsize=NUM_WORKERS)
# stop_event = threading.Event()
# def worker(
# frame_queue: queue.Queue,
# result_queue: queue.Queue,
# stop_event: threading.Event,
# ):
# while not stop_event.is_set():
# try:
# frame, start = frame_queue.get(timeout=0.1)
# except queue.Empty:
# continue
# before = time.perf_counter()
# processed = process_frame_pipeline(
# frame, source_face, frame_processors, width, height
# )
# computation_time = time.perf_counter() - before
# result_queue.put((processed, start, computation_time))
# # Launch staggered workers
# workers = [
# threading.Thread(
# target=worker,
# args=[
# frame_queue,
# result_queue,
# stop_event,
# ],
# daemon=True,
# )
# for _ in range(NUM_WORKERS)
# ]
# for w in workers:
# w.start()
last_shown_time = 0
previous_frame_time = time.perf_counter()
while True:
ret, frame = cap.read()
if not ret:
break
# try:
current_time = time.perf_counter()
# frame_queue.put_nowait((frame, current_time))
if modules.globals.show_fps:
time_since_last_frame = current_time - previous_frame_time
times_last_k.append(time_since_last_frame)
times_last_k = times_last_k[-100:]
# except queue.Full:
# pass # drop frame if workers busy
# try:
# temp_frame, start, computation_time = result_queue.get_nowait()
start = current_time
before = time.perf_counter()
temp_frame = process_frame_pipeline(
frame, source_face, frame_processors, width, height
)
computation_time = time.perf_counter() - before
if start < last_shown_time: # discard outdated frames
continue
last_shown_time = start
if modules.globals.show_fps:
latencies_last_k.append(time.perf_counter() - start)
latencies_last_k = latencies_last_k[-average_k:]
computations_last_k.append(computation_time)
computations_last_k = computations_last_k[-average_k:]
# except queue.Empty:
# continue
if modules.globals.live_resizable:
width, height = PREVIEW.winfo_width(), PREVIEW.winfo_height()
# Show FPS, latency, and computation time
if modules.globals.show_fps:
frames += 1
current_time = time.perf_counter()
if current_time - last_fps_update >= fps_interval:
fps = frames / (current_time - last_fps_update)
frames = 0
last_fps_update = current_time
previous_frame_time = current_time
cv2.putText(
temp_frame,
f"FPS: {fps:3.1f}, STD: {int(1000 * np.std(times_last_k))}ms, LAT: {int(1000 * sum(latencies_last_k) / average_k)}ms, CMP: {int(1000 * sum(computations_last_k) / average_k)}ms",
(5, 15),
cv2.FONT_HERSHEY_SIMPLEX,
0.5,
(0, 255, 0),
1,
)
# Render
image = cv2.cvtColor(temp_frame, cv2.COLOR_BGR2RGB)
image = Image.fromarray(image)
image = ImageOps.contain(
image, (temp_frame.shape[1], temp_frame.shape[0]), Image.LANCZOS # type: ignore
)
image = ctk.CTkImage(image, size=image.size)
preview_label.configure(image=image)
ROOT.update()
if PREVIEW.state() == "withdrawn":
break
# stop_event.set()
cap.release()
PREVIEW.withdraw() |
Beta Was this translation helpful? Give feedback.
-
|
Just to give a feedback, thanks for this thread! This'll help a lot of mac users 😃 |
Beta Was this translation helpful? Give feedback.
-
|
@SanderGi just to verify, is your top comment updated, I just have to paste that code right, I have currently 5fps on m4 24gb ram. |
Beta Was this translation helpful? Give feedback.
-
|
@saurabhthesuperhero, top comment should be updated and you should be able to just replace the relevant code with the new code. You can let me know if it doesn’t work. You might find the later comment works better, but I haven’t yet extensively tested on different devices |
Beta Was this translation helpful? Give feedback.
-
|
@SanderGi Quick Update: meanwhile I am going to try this models method once i have enough storage, just want to confirm we do have mps compatible models right is this same way ? like used in draw things apps etc. or in stable diffusion there are mpx models. |
Beta Was this translation helpful? Give feedback.
-
|
@saurabhthesuperhero the script doesn’t yet support face enhancer so it makes sense that it would be slow. Regarding why the general speedup is so minuscule, there are a couple factors. Most importantly, make sure you are combining it with the patch to use your GPU rather than the ANE: #1373 (comment). Secondly, make sure you play around with the execution threads to find the optimal number Regarding your model compatibility question, Apple has 3 main ways of running hardware accelerated models (MLX, CoreML, and MPS) and two types of hardware acceleration (ANE for lightweight ML tasks and GPU for intense parallel computation). MLX targets the GPU but is meant for training models and as such can be used for inference as that is necessary for training but it is not its main purpose. CoreML is specifically tuned for inference and can access both the ANE and GPU. MPS stands for Metal Performance Shaders and is a low level way of instructing the GPU specifically. It is useful for tuning general purpose tasks on the GPU not supported by CoreML/MPX as those are mainly for Machine Learning. Except very niche cases, it is not better to implement ML pipelines with MPS. When I converted the models in this repo to run with PyTorch MPS, for instance, they ran slower than with CoreML/MPX (there is a number of reasons among which is frequency throttling). Now with this background we can answer your question: the original models in this repo are ONNX models which mean they have limited CoreML compatibility (some things will still run on the CPU) and they were not configured to run on the GPU but the ANE (hence this fix: #1373 (comment)). The top comment does not change these models but instead runs the models in parallel to increase ANE/GPU utilization while the ONNX runtime falls back to CPU. This means we can increase throughput (FPS), at the expense of more RAM utilization and slower latency. The model fix (#1495 (comment)) includes a model that is fully CoreML compatible meaning it can take full advantage of the GPU without fallbacks to the CPU. This is lighter on RAM and should improve both FPS and latency. Currently it only includes one model, not the full pipeline of models included in face swapping so there is still plenty of room for improvement. I’m experimenting with an optimized end-to-end model for instance which should be much faster and utilize the GPU even better. |
Beta Was this translation helpful? Give feedback.
-
|
Combine with patch? I did not understand this, I followed whole comment changing code as mentioned and then run the command, also played with threads as m4 have 10 execution threads ( mac mini base 24gb ram) on 10 it was bad , though, Regarding 2nd method , when u mentioned PS. Huge thanks to explaining all related to silicon I used to think MPS and MLX are kind of same. |
Beta Was this translation helpful? Give feedback.
-
|
Combine with patch as in, in addition to the top comment instructions, also follow the instructions here: #1373 (comment). So you’ll end up replacing some code in two files. When you have enough RAM available, using more or less isn’t by itself a good nor bad thing. If it is possible to use more RAM to get more computation done faster, and the RAM is available, then using more is a good thing. Otherwise, using as little RAM as possible is desirable because not only is there an overhead to using RAM (pretty negligible), but using less also leaves more resources to run other processes. When you use more RAM than is available and start having to use SWAP, then things become really slow and it is almost always worth it to reduce RAM usage. Since you have plenty of RAM, one thing to look into is combining the top comment and method two. This is not something I have a quick set of instructions to try out ready for yet |
Beta Was this translation helpful? Give feedback.
-
@SanderGi Anyway thanks. |
Beta Was this translation helpful? Give feedback.
-
|
Unfortunate. Definitely let me know if the second method yields better results if you end up trying it. Also, it seems that version 2.0c was released since I made the above patches. It includes some interpolation and other optimizations that the patches probably don’t play nice with. What FPS do you get on f9270c5? And what FPS do you then get when modifying from that as a starting point? |
Beta Was this translation helpful? Give feedback.
-
Yes new version is released, in new 2.0 c version as shared I am getting 6fps before i used to get around 4.5 to 5fps. While I tried ur code in 2.0c only, |
Beta Was this translation helpful? Give feedback.
-
|
Right. My code here is not compatible with 2.0c. The optimizations that 2.0c introduced to go from 5 to 6 FPS on you device could definitely be adapted to work with the changes in this thread for the best of both worlds |
Beta Was this translation helpful? Give feedback.
-
|
@SanderGi yes right, Just Want to confirm my gpu is being used. is this expected ? because we are assuming on the base that gpu is not being utilized right |
Beta Was this translation helpful? Give feedback.
-
|
@SanderGi - I very much appreciate all your hard work on this project. If you have a minute, could you help me locate ‘inswapper_128_fp16.mlpackage’? In reference to step no. 4 in your comment above, I was able to previously download ‘inswapper_128.mlpackage’ from the link you provided, but I can’t find ‘inswapper_128_fp16.mlpackage’ anywhere, and when I try to run the program without it, I get the following error message: FileNotFoundError: [Errno 2] No such file or directory: '/Users/xxxxxxxxxx/Deep-Live-Cam/models/inswapper_128_fp16.mlpackage' |
Beta Was this translation helpful? Give feedback.
-
|
@etanhanbaiki Have you tried downloading from Download link for inswapper_128_fp16.onnx |
Beta Was this translation helpful? Give feedback.
-
|
@saurabhthesuperhero It is indeed expected that the GPU should be used. The base will have some GPU utilization but it should be very low because it has not been configured to fully utilize the Apple Silicon GPU. All the patches in this thread are aimed at improving performance through greater GPU utilization. Regarding 2.0c, I'll have to find the time to review the changes in detail (likely won't have time until December unfortunately), but I believe it should be a straightforward matter of adapting these patches to also do interpolation between frames. @etanhanbaiki Here's the link for the f16 package. @samundra Sorry for the confusion, this is specifically about the |
Beta Was this translation helpful? Give feedback.
-
|
Please put this on the discussion moving forward. |
Beta Was this translation helpful? Give feedback.
-
|
@hacksider how is this completed? |
Beta Was this translation helpful? Give feedback.
-
|
@visel I wont tell you BUT the details are on my last reply |
Beta Was this translation helpful? Give feedback.
-
|
Moved to discussion :) |
Beta Was this translation helpful? Give feedback.


Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Since lots of people are running into low frame rates (1 - 3 FPS) when using the live webcam mode, I thought I'd document a simple solution to improving this. This is especially relevant for MacOS since GPU acceleration with MPS isn't fully supported but might also be relevant to some Nvidia GPU setups. It would definitely be useful if you have no GPU (or a very weak GPU) and have to run things on the CPU.
Explanation of the problem and solution
Essentially the live webcam mode is slow because it does everything sequentially and mostly on the CPU meaning it has to skip a lot of frames while it is calculating. The problem lies in using just one core of the CPU to first find the face, then align it, then swap it, and that goes on repeat. Most machines have multiple CPU cores and, for a pre-recorded video, they can easily be utilized by computing multiple frames in parallel (which the app already does). However, when streaming in frames live from the webcam, we don't have access to future frames (after all they haven't happened), so we can't start computing them in parallel. We could wait for some to arrive and then compute those in parallel, but that introduces a delay/latency that feels just as bad as low frame rate.
The solution is pipelining (the same technique that your CPU already uses at the instruction level to optimize memory fetching, etc.!). In simplified terms, start finding the face in the first frame as soon as it arrives, then when the second frame arrives start finding the face in it while concurrently aligning the face from the first frame, then swap the face in the first frame while concurrently aligning the face in the second frame and finding a face in the third frame, and so on. That roughly looks like this:
In this simplified representation, the pipelined approach finishes 3 frames in the same time that the sequential approach finishes 1.67 frames, both with the same latency (time between something happens in the camera feed and the corresponding change is rendered in the face swapped stream). Of course in reality there are many more tasks than just
find,align, andswap, plus not all the tasks take the same amount of time. This means they won't fit as nicely between the available CPU cores. Moreover, the frames definitely won't be streaming in at a rate that lines up with when the CPU cores are done processing them, so the frames must be skipped and distributed smartly across available cores to result in an even stream that doesn't freeze and jump in time. This can be done by keeping track of the moving average of computation time per frame and distributing the CPU cores evenly across frames from that timespan.To try this out for yourself, set up the repo following the manual installation, open
modules/ui.pyand replace thecreate_webcam_previewfunction with the following code:The code that makes the live webcam mode use pipelining
You'll need to experiment a bit with the number of threads you make available depending on your system. On my M1 Pro with 10 cpu cores and 16 GB RAM, about 4 threads seems to be the sweet spot so I run the app with
python run.py --execution-provider coreml --live-mirror --execution-threads 4. Feel free to share what works with your setup to help others find a good number of threads.Of course this is a rather primitive fix. There are a number of ways it could/should be improved before being merged into the repo.
For one, it currently the sequencing of frames is very primitive. It simply discards out of order frames which is wasted work. For another, it only uses pipelining with either CPU cores or GPU cores, not both. A quick improvement would be combining both CPU and GPU cores. Finally, only FPS is improved for a smoother stream. The latency is not improved. For that, less work must be done per frame, either by reusing work (and doing quick estimates of some of the values from previous frames) or making it faster (e.g., quantization, parallelizing at the model graph level).
Beta Was this translation helpful? Give feedback.
All reactions