You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
So here is my situation:
I have a somewhat heterogeneous kubernetes cluster with nodes with several different types of GPUs (lets assume A and B). I would like to host some LLMs on that cluster and scale those depending on demand.
However, some of the models I want to provide require more memory than provided by some of the GPUs (e.g. A has only 48GB, while the model needs 60GB )but can be run on multiple of those GPUs, while they can run on a single GPU of type B (which provides 96GB).
I could easily create two deployments for those models, one that specifies the resources and requirements for multiple GPUs of type A and another for type B which only requires one GPU.
And here is my problem/question:
Is there a way to have the Auto-scaler work, such that it scales the total replica count of the combination of Deployments of type A and type B ?
And/or does someone know whether there is a good way to achieve this or whether this is just out of scope for kubernetes?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
So here is my situation:
I have a somewhat heterogeneous kubernetes cluster with nodes with several different types of GPUs (lets assume A and B). I would like to host some LLMs on that cluster and scale those depending on demand.
However, some of the models I want to provide require more memory than provided by some of the GPUs (e.g. A has only 48GB, while the model needs 60GB )but can be run on multiple of those GPUs, while they can run on a single GPU of type B (which provides 96GB).
I could easily create two deployments for those models, one that specifies the resources and requirements for multiple GPUs of type A and another for type B which only requires one GPU.
And here is my problem/question:
Is there a way to have the Auto-scaler work, such that it scales the total replica count of the combination of Deployments of type A and type B ?
And/or does someone know whether there is a good way to achieve this or whether this is just out of scope for kubernetes?
Beta Was this translation helpful? Give feedback.
All reactions