Phi-4 basic inference with native/vllm (#1563)

optas · web-flow · commit e9c12e33e66b · 2025-03-23T16:14:26.000-07:00
diff --git a/configs/recipes/vision/phi4/README.md b/configs/recipes/vision/phi4/README.md
@@ -1,6 +1,9 @@
-# Phi-4-multimodal-instruct
+# **Phi-4-multimodal-instruct 5.6B**
 
-Configs for Phi-4-multimodal-instruct 5.6Β model. See https://huggingface.co/microsoft/Phi-4-multimodal-instruct
+Configs for Phi-4-multimodal-instruct 5.6Β model.
+🔗 **Reference:** [Phi-4-multimodal-instruct on Hugging Face](https://huggingface.co/microsoft/Phi-4-multimodal-instruct)
+
+---
 
 This is a multimodal model that combines text, visual, and audio inputs.
 It uses a "Mixture of LoRAs" approach, allowing you to plug in adapters for each
@@ -9,3 +12,8 @@ reading the following:
 
 - [Mixture-of-LoRAs](https://arxiv.org/abs/2403.03432)
 - [Phi-4 Multimodal Technical Report](https://arxiv.org/abs/2503.01743)
+
+⚠️ This model requires `flash attention 2`. Run the following if executing in a custom fashion:
+```sh
+pip install -U flash-attn --no-build-isolation
+```
diff --git a/configs/recipes/vision/phi4/inference/infer.yaml b/configs/recipes/vision/phi4/inference/infer.yaml
@@ -0,0 +1,28 @@
+# Phi-4-multimodal-instruct 5.6B inference config.
+#
+# Requirements:
+#   - Run `pip install -U flash-attn --no-build-isolation`
+#
+# Usage:
+#   oumi infer -i -c configs/recipes/vision/phi4/inference/infer.yaml \
+#     --image "tests/testdata/images/the_great_wave_off_kanagawa.jpg"
+#
+#
+# See Also:
+#   - Documentation: https://oumi.ai/docs/en/latest/user_guides/infer/infer.html
+#   - Config class: oumi.core.configs.InferenceConfig
+#   - Config source: https://github.com/oumi-ai/oumi/blob/main/src/oumi/core/configs/inference_config.py
+#   - Other inference configs: configs/**/inference/
+
+model:
+  model_name: "microsoft/Phi-4-multimodal-instruct"
+  torch_dtype_str: "bfloat16"
+  model_max_length: 4096
+  trust_remote_code: True
+  attn_implementation: "flash_attention_2" # The model requires Flash Attention.
+
+generation:
+  max_new_tokens: 64
+  batch_size: 1
+
+engine: NATIVE
diff --git a/configs/recipes/vision/phi4/inference/vllm_infer.yaml b/configs/recipes/vision/phi4/inference/vllm_infer.yaml
@@ -0,0 +1,28 @@
+# Phi-4-multimodal-instruct 5.6B vLLM inference config.
+#
+# Requirements:
+#   - Run `pip install vllm`
+#   - Run `pip install -U flash-attn --no-build-isolation`
+#
+# Usage:
+#   oumi infer -i -c configs/recipes/vision/phi4/inference/vllm_infer.yaml \
+#     --image "tests/testdata/images/the_great_wave_off_kanagawa.jpg"
+#
+# See Also:
+#   - Documentation: https://oumi.ai/docs/en/latest/user_guides/infer/infer.html
+#   - Config class: oumi.core.configs.InferenceConfig
+#   - Config source: https://github.com/oumi-ai/oumi/blob/main/src/oumi/core/configs/inference_config.py
+#   - Other inference configs: configs/**/inference/
+
+model:
+  model_name: "microsoft/Phi-4-multimodal-instruct"
+  torch_dtype_str: "bfloat16"
+  model_max_length: 4096
+  trust_remote_code: True
+  attn_implementation: "flash_attention_2" # The model requires Flash Attention.
+
+generation:
+  max_new_tokens: 64
+  batch_size: 1
+
+engine: VLLM