This repository was archived by the owner on Jan 24, 2024. It is now read-only.

Description
As mentioned in #179, users need multiple models. On a multi-GPU on-prem machine, I want to write a config file that's like:
CUDA_VISIBLE_DEVICES=0 MODEL=meta-llama/Llama-2-7b-chat-hf
CUDA_VISIBLE_DEVICES=1,2,3 MODEL=meta-llama/Llama-2-13b-chat-hf
Then users should be able to specify "model": "<either_model>", in their requests.
I can start a PR if you want this feature. Let me know if you have any suggestions on the best way to load these models and keep them mostly separate from each other.