`--enforce-eager` option slows down token generation

## Description

The current script suggests to use `--enforce-eager` in vllm server

https://github.com/Project-MONAI/VLM-Surgical-Agent-Framework/blob/4fa04340248da6f8f913d35066b47abc5d1d51cf/scripts/run_vllm_server.sh#L16

However, this disables the [cuda graph acceleration](https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/), and as a result, the token generation on A6000 (Holoscan IGX, arm64) is very slow (11.2 tokes/s)

## Possible solution

We can add documentation about the flag.

Also, removing the flag will increase the speed 4x (45.9 tokens/s )


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`--enforce-eager` option slows down token generation #5

Description

Possible solution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

--enforce-eager option slows down token generation #5

Description

Description

Possible solution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`--enforce-eager` option slows down token generation #5