Skip to content

Commit 65b6c8c

Browse files
authored
Fix regression to Istio deployment caused by recent commits (vllm-project#558)
1 parent 3ba6641 commit 65b6c8c

File tree

6 files changed

+276
-170
lines changed

6 files changed

+276
-170
lines changed

deploy/kubernetes/istio/README.md

Lines changed: 13 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,14 @@
11
# vLLM Semantic Router as ExtProc server for Istio Gateway
22

3-
This guide provides step-by-step instructions for deploying the vLLM Semantic Router (vsr) with Istio Gateway on Kubernetes. Istio Gateway uses Envoy under the covers so it is possible to use vsr with it. There are multiple topologies possible to combine Istio Gateway with vsr. This document describes one of the common topologies.
3+
This guide provides step-by-step instructions for deploying the vLLM Semantic Router (vsr) with Istio Gateway on Kubernetes. Istio Gateway uses Envoy under the covers so it is possible to use vsr with it. However there are differences between how different Envoy based Gateways process the ExtProc protocol, hence the deployment described here is different from the deployment of vsr alongwith other types of Envoy based Gateways as described in the other guides in this repo. There are multiple architecture options possible to combine Istio Gateway with vsr. This document describes one of the options.
44

55
## Architecture Overview
66

77
The deployment consists of:
88

9-
- **vLLM Semantic Router**: Provides intelligent request routing and classification
10-
- **Istio Gateway**: Istio Gateway that uses an Envoy proxy under the covers
11-
- **Gateway API Inference Extension**: Additional control and data plane for endpoint picking that can optionally attach to the same Istio gateway as vLLM Semantic Router.
9+
- **vLLM Semantic Router**: Provides intelligent request routing and processing decisions to Envoy based Gateways
10+
- **Istio Gateway**: Istio's implementation of Kubernetes Gateway API that uses an Envoy proxy under the covers
11+
- **Gateway API Inference Extension**: Additional APIs to extend the Gateway API for Inference via ExtProc servers
1212
- **Two instances of vLLM serving 1 model each**: Example backend LLMs for illustrating semantic routing in this topology
1313

1414
## Prerequisites
@@ -43,14 +43,18 @@ $ kubectl wait --for=condition=Ready nodes --all --timeout=300s
4343

4444
## Step 2: Deploy LLM models service
4545

46-
As noted earlier in this exercise we deploy two LLMs viz. a llama3-8b model (meta-llama/Llama-3.1-8B-Instruct) and a phi4-mini model (microsoft/Phi-4-mini-instruct). In this exercise we chose to serve these models using two separate instances of the [vLLM inference server](https://docs.vllm.ai/en/latest/) running in the default namespace of the kubernetes cluster. For this exercise you may choose to use any inference server to serve these models but we have provided manifests to run these in vLLM containers as a reference.
46+
In this exercise we deploy two LLMs viz. a llama3-8b model (meta-llama/Llama-3.1-8B-Instruct) and a phi4-mini model (microsoft/Phi-4-mini-instruct). We serve these models using two separate instances of the [vLLM inference server](https://docs.vllm.ai/en/latest/) running in the default namespace of the kubernetes cluster. You may choose any other inference engines as long as they expose OpenAI API endpoints. First install a secret for your HuggingFace token previously stored in env variable HF_TOKEN and then deploy the models as shown below.
47+
48+
```bash
49+
kubectl create secret generic hf-token --from-literal=token=$HF_TOKEN
50+
```
4751

4852
```bash
4953
# Create vLLM service running llama3-8b
5054
kubectl apply -f deploy/kubernetes/istio/vLlama3.yaml
5155
```
5256

53-
This may take several (10+) minutes the first time this is run to download the model up until the vLLM pod running this model is in READY state. Similarly also deploy the second LLM (phi4-mini) and wait for several minutes until the pod is in READY state..
57+
This may take several (10+) minutes the first time this is run to download the model up until the vLLM pod running this model is in READY state. Similarly also deploy the second LLM (phi4-mini) and wait for several minutes until the pod is in READY state.
5458

5559
```bash
5660
# Create vLLM service running phi4-mini
@@ -76,9 +80,9 @@ llama-8b ClusterIP 10.108.250.109 <none>
7680
phi4-mini ClusterIP 10.97.252.33 <none> 80/TCP 9d
7781
```
7882

79-
## Step 3: Update vsr config if needed
83+
## Step 3: Update vsr config
8084

81-
The file deploy/kubernetes/istio/config.yaml will get used to configure vsr when it is installed in the next step. The example config file provided already in this repo should work f you use the same LLMs as in this exercise but you can choose to play with this config to enable or disable individual vsr features. Ensure that your vllm_endpoints in the file match the ip/ port of the llm services you are running. It is usually good to start with basic features of vsr such as prompt classification and model routing before experimenting with other features as described elsewhere in the vsr documentation.
85+
The file deploy/kubernetes/istio/config.yaml will get used to configure vsr when it is installed in the next step. Ensure that the models in the config file match the models you are using and that the vllm_endpoints in the file match the ip/ port of the llm kubernetes services you are running. It is usually good to start with basic features of vsr such as prompt classification and model routing before experimenting with other features such as PromptGuard or ToolCalling.
8286

8387
## Step 4: Deploy vLLM Semantic Router
8488

@@ -99,7 +103,7 @@ kubectl get pods -n vllm-semantic-router-system
99103

100104
We will use a recent build of Istio for this exercise so that we have the option of also using the v1.0.0 GA version of the Gateway API Inference Extension CRDs and EPP functionality.
101105

102-
Follow the procedures described in the Gateway API [Inference Extensions documentation](https://gateway-api-inference-extension.sigs.k8s.io/guides/) to deploy the 1.28 (or newer) version of Istio Gateway, the Kubernetes Gateway API CRDs and the Gateway API Inference Extension v1.0.0. Do not install any of the HTTPRoute resources from that guide however, just use it to deploy the Istio gateway and CRDs. If installed correctly you should see the api CRDs for gateway api and inference extension as well as pods running for the Istio gateway and Istiod using the commands shown below.
106+
Follow the procedures described in the Gateway API [Inference Extensions documentation](https://gateway-api-inference-extension.sigs.k8s.io/guides/) to deploy the 1.28 (or newer) version of Istio control plane, Istio Gateway, the Kubernetes Gateway API CRDs and the Gateway API Inference Extension v1.0.0. Do not install any of the HTTPRoute resources from that guide however, just use it to deploy the Istio gateway and CRDs. If installed correctly you should see the api CRDs for gateway api and inference extension as well as pods running for the Istio gateway and Istiod using the commands shown below.
103107

104108
```bash
105109
kubectl get crds | grep gateway

deploy/kubernetes/istio/config.yaml

Lines changed: 84 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,18 @@
11
bert_model:
2-
model_id: sentence-transformers/all-MiniLM-L12-v2
2+
model_id: models/all-MiniLM-L12-v2
33
threshold: 0.6
44
use_cpu: true
55

66
semantic_cache:
77
enabled: false
8-
backend_type: "memory" # Options: "memory" or "milvus"
8+
backend_type: "memory" # Options: "memory" or "milvus"
99
similarity_threshold: 0.8
10-
max_entries: 1000 # Only applies to memory backend
10+
max_entries: 1000 # Only applies to memory backend
1111
ttl_seconds: 3600
12-
eviction_policy: "fifo"
12+
eviction_policy: "fifo"
13+
# Embedding model for semantic similarity matching
14+
# Options: "bert" (fast, 384-dim), "qwen3" (high quality, 1024-dim, 32K context), "gemma" (balanced, 768-dim, 8K context)
15+
embedding_model: "bert" # Default: BERT (fastest, lowest memory for Kubernetes)
1316

1417
tools:
1518
enabled: false
@@ -19,7 +22,7 @@ tools:
1922
fallback_to_empty: true
2023

2124
prompt_guard:
22-
enabled: false
25+
enabled: false # Global default - can be overridden per category with jailbreak_enabled
2326
use_modernbert: true
2427
model_id: "models/jailbreak_classifier_modernbert-base_model"
2528
threshold: 0.7
@@ -32,23 +35,25 @@ prompt_guard:
3235
# NOT supported: domain names (example.com), protocol prefixes (http://), paths (/api), ports in address (use 'port' field)
3336
vllm_endpoints:
3437
- name: "endpoint1"
35-
address: "10.104.192.205" # IPv4 address - REQUIRED format
38+
address: "10.98.150.102" # Static IPv4 of llama3-8b k8s service
3639
port: 80
3740
weight: 1
3841
- name: "endpoint2"
39-
address: "10.99.27.202" # IPv4 address - REQUIRED format
42+
address: "10.98.118.242" # Static IPv4 of phi4-mini k8s service
4043
port: 80
4144
weight: 1
4245

4346
model_config:
4447
"llama3-8b":
48+
# reasoning_family: "" # This model uses Qwen-3 reasoning syntax
4549
preferred_endpoints: ["endpoint1"]
4650
pii_policy:
47-
allow_by_default: false
51+
allow_by_default: true
4852
"phi4-mini":
53+
# reasoning_family: "" # This model uses Qwen-3 reasoning syntax
4954
preferred_endpoints: ["endpoint2"]
5055
pii_policy:
51-
allow_by_default: false
56+
allow_by_default: true
5257

5358
# Classifier configuration
5459
classifier:
@@ -68,83 +73,116 @@ classifier:
6873
# Categories with new use_reasoning field structure
6974
categories:
7075
- name: business
76+
system_prompt: "You are a senior business consultant and strategic advisor with expertise in corporate strategy, operations management, financial analysis, marketing, and organizational development. Provide practical, actionable business advice backed by proven methodologies and industry best practices. Consider market dynamics, competitive landscape, and stakeholder interests in your recommendations."
77+
# jailbreak_enabled: true # Optional: Override global jailbreak detection per category
78+
# jailbreak_threshold: 0.8 # Optional: Override global jailbreak threshold per category
7179
model_scores:
72-
- model: llama3-8b
80+
- model: llama3-8b
7381
score: 0.8
74-
use_reasoning: false # Business performs better without reasoning
75-
- model: phi4-mini
82+
use_reasoning: false # Business performs better without reasoning
83+
- model: phi4-mini
7684
score: 0.3
77-
use_reasoning: false # Business performs better without reasoning
85+
use_reasoning: false # Business performs better without reasoning
7886
- name: law
87+
system_prompt: "You are a knowledgeable legal expert with comprehensive understanding of legal principles, case law, statutory interpretation, and legal procedures across multiple jurisdictions. Provide accurate legal information and analysis while clearly stating that your responses are for informational purposes only and do not constitute legal advice. Always recommend consulting with qualified legal professionals for specific legal matters."
7988
model_scores:
8089
- model: llama3-8b
81-
score: 0.7
90+
score: 0.4
8291
use_reasoning: false
8392
- name: psychology
93+
system_prompt: "You are a psychology expert with deep knowledge of cognitive processes, behavioral patterns, mental health, developmental psychology, social psychology, and therapeutic approaches. Provide evidence-based insights grounded in psychological research and theory. When discussing mental health topics, emphasize the importance of professional consultation and avoid providing diagnostic or therapeutic advice."
94+
semantic_cache_enabled: true
95+
semantic_cache_similarity_threshold: 0.92 # High threshold for psychology - sensitive to nuances
8496
model_scores:
8597
- model: llama3-8b
86-
score: 0.7
98+
score: 0.6
8799
use_reasoning: false
88100
- name: biology
101+
system_prompt: "You are a biology expert with comprehensive knowledge spanning molecular biology, genetics, cell biology, ecology, evolution, anatomy, physiology, and biotechnology. Explain biological concepts with scientific accuracy, use appropriate terminology, and provide examples from current research. Connect biological principles to real-world applications and emphasize the interconnectedness of biological systems."
89102
model_scores:
90103
- model: llama3-8b
91104
score: 0.9
92105
use_reasoning: false
93106
- name: chemistry
107+
system_prompt: "You are a chemistry expert specializing in chemical reactions, molecular structures, and laboratory techniques. Provide detailed, step-by-step explanations."
94108
model_scores:
95109
- model: llama3-8b
96110
score: 0.6
97-
use_reasoning: false # Enable reasoning for complex chemistry
111+
use_reasoning: false # Enable reasoning for complex chemistry
98112
- name: history
113+
system_prompt: "You are a historian with expertise across different time periods and cultures. Provide accurate historical context and analysis."
99114
model_scores:
100115
- model: llama3-8b
101116
score: 0.7
102117
use_reasoning: false
103118
- name: other
119+
system_prompt: "You are a helpful and knowledgeable assistant. Provide accurate, helpful responses across a wide range of topics."
120+
semantic_cache_enabled: true
121+
semantic_cache_similarity_threshold: 0.75 # Lower threshold for general chat - less sensitive
104122
model_scores:
105123
- model: llama3-8b
106124
score: 0.7
107125
use_reasoning: false
108126
- name: health
127+
system_prompt: "You are a health and medical information expert with knowledge of anatomy, physiology, diseases, treatments, preventive care, nutrition, and wellness. Provide accurate, evidence-based health information while emphasizing that your responses are for educational purposes only and should never replace professional medical advice, diagnosis, or treatment. Always encourage users to consult healthcare professionals for medical concerns and emergencies."
128+
semantic_cache_enabled: true
129+
semantic_cache_similarity_threshold: 0.95 # High threshold for health - very sensitive to word changes
109130
model_scores:
110131
- model: llama3-8b
111132
score: 0.5
112133
use_reasoning: false
113134
- name: economics
135+
system_prompt: "You are an economics expert with deep understanding of microeconomics, macroeconomics, econometrics, financial markets, monetary policy, fiscal policy, international trade, and economic theory. Analyze economic phenomena using established economic principles, provide data-driven insights, and explain complex economic concepts in accessible terms. Consider both theoretical frameworks and real-world applications in your responses."
114136
model_scores:
115137
- model: llama3-8b
116-
score: 0.8
138+
score: 1.0
117139
use_reasoning: false
118140
- name: math
141+
system_prompt: "You are a mathematics expert. Provide step-by-step solutions, show your work clearly, and explain mathematical concepts in an understandable way."
119142
model_scores:
120143
- model: phi4-mini
121-
score: 0.8
122-
use_reasoning: false
123-
- model: llama3-8b
124-
score: 0.3
125-
use_reasoning: false
144+
score: 1.0
145+
use_reasoning: false # Enable reasoning for complex math
126146
- name: physics
147+
system_prompt: "You are a physics expert with deep understanding of physical laws and phenomena. Provide clear explanations with mathematical derivations when appropriate."
127148
model_scores:
128149
- model: llama3-8b
129150
score: 0.7
130-
use_reasoning: false
151+
use_reasoning: false # Enable reasoning for physics
131152
- name: computer science
153+
system_prompt: "You are a computer science expert with knowledge of algorithms, data structures, programming languages, and software engineering. Provide clear, practical solutions with code examples when helpful."
132154
model_scores:
133155
- model: llama3-8b
134-
score: 0.7
156+
score: 0.6
135157
use_reasoning: false
136158
- name: philosophy
159+
system_prompt: "You are a philosophy expert with comprehensive knowledge of philosophical traditions, ethical theories, logic, metaphysics, epistemology, political philosophy, and the history of philosophical thought. Engage with complex philosophical questions by presenting multiple perspectives, analyzing arguments rigorously, and encouraging critical thinking. Draw connections between philosophical concepts and contemporary issues while maintaining intellectual honesty about the complexity and ongoing nature of philosophical debates."
137160
model_scores:
138161
- model: llama3-8b
139162
score: 0.5
140163
use_reasoning: false
141164
- name: engineering
165+
system_prompt: "You are an engineering expert with knowledge across multiple engineering disciplines including mechanical, electrical, civil, chemical, software, and systems engineering. Apply engineering principles, design methodologies, and problem-solving approaches to provide practical solutions. Consider safety, efficiency, sustainability, and cost-effectiveness in your recommendations. Use technical precision while explaining concepts clearly, and emphasize the importance of proper engineering practices and standards."
142166
model_scores:
143167
- model: llama3-8b
144168
score: 0.7
145169
use_reasoning: false
146170

147-
default_model: llama3-8b
171+
default_model: "llama3-8b"
172+
173+
# Auto model name for automatic model selection (optional)
174+
# This is the model name that clients should use to trigger automatic model selection
175+
# If not specified, defaults to "MoM" (Mixture of Models)
176+
# For backward compatibility, "auto" is always accepted as an alias
177+
# Example: auto_model_name: "MoM" # or any other name you prefer
178+
# auto_model_name: "MoM"
179+
180+
# Include configured models in /v1/models list endpoint (optional, default: false)
181+
# When false (default): only the auto model name is returned in the /v1/models endpoint
182+
# When true: all models configured in model_config are also included in the /v1/models endpoint
183+
# This is useful for clients that need to discover all available models
184+
# Example: include_config_models_in_list: true
185+
# include_config_models_in_list: false
148186

149187
# Reasoning family configurations
150188
reasoning_families:
@@ -166,6 +204,9 @@ reasoning_families:
166204
# Global default reasoning effort level
167205
default_reasoning_effort: high
168206

207+
# Gateway route cache clearing
208+
clear_route_cache: true # Enable for some gateways such as Istio
209+
169210
# API Configuration
170211
api:
171212
batch_classification:
@@ -177,8 +218,23 @@ api:
177218
detailed_goroutine_tracking: true
178219
high_resolution_timing: false
179220
sample_rate: 1.0
180-
duration_buckets: [0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30]
221+
duration_buckets:
222+
[0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30]
181223
size_buckets: [1, 2, 5, 10, 20, 50, 100, 200]
182224

183-
# Gateway route cache clearing
184-
clear_route_cache: true # Enable for testing
225+
# Observability Configuration
226+
observability:
227+
tracing:
228+
enabled: true # Enable distributed tracing for docker-compose stack
229+
provider: "opentelemetry" # Provider: opentelemetry, openinference, openllmetry
230+
exporter:
231+
type: "otlp" # Export spans to Jaeger (via OTLP gRPC)
232+
endpoint: "jaeger:4317" # Jaeger collector inside compose network
233+
insecure: true # Use insecure connection (no TLS)
234+
sampling:
235+
type: "always_on" # Sampling: always_on, always_off, probabilistic
236+
rate: 1.0 # Sampling rate for probabilistic (0.0-1.0)
237+
resource:
238+
service_name: "vllm-semantic-router"
239+
service_version: "v0.1.0"
240+
deployment_environment: "development"

0 commit comments

Comments
 (0)