|
| 1 | +This file provides a elastic proxy demo to support elastic scaling for P/D instances based on KV pool. |
| 2 | + |
| 3 | +We can launch multiple vllm instances (2 for prefill and 2 for decode), and |
| 4 | +launch this proxy demo through: |
| 5 | + |
| 6 | +```shell |
| 7 | +export ADMIN_API_KEY=YOUR_ADMIN_API_KEY |
| 8 | +python3 examples/elastic_scaling/elastic_proxy.py \ |
| 9 | + --model $model_name \ |
| 10 | + --prefill localhost:8100 localhost:8101 \ |
| 11 | + --decode localhost:8200 localhost:8201 \ |
| 12 | + --port 8000 |
| 13 | +``` |
| 14 | + |
| 15 | +After the proxy is deployed: |
| 16 | +```text |
| 17 | +INFO: Started server process [xxxxx] |
| 18 | +INFO: Waiting for application startup. |
| 19 | +INFO: Application startup complete. |
| 20 | +INFO: Uvicorn running on http://0.0.0.0:xxxx |
| 21 | +``` |
| 22 | + |
| 23 | +### Support API routes |
| 24 | +* `/v1/completions`: get completions request response. |
| 25 | +* `/v1/chat/completions`: get chat request response. |
| 26 | +* `/status`: get the supported prefill nodes and decode nodes list. |
| 27 | +* `/instances/add`: add prefill nodes or decode nodes to the list. |
| 28 | +* `/instances/remove`: remove prefill nodes or decode nodes from the list. |
| 29 | + |
| 30 | +Examples: |
| 31 | +#### get request response |
| 32 | +```shell |
| 33 | +# /v1/completions |
| 34 | +curl -X POST http://0.0.0.0:8000/v1/completions \ |
| 35 | +-H "Content-Type: application/json" \ |
| 36 | +-d '{"model": "'$model_name'", "max_tokens": 50, "prompt": "hello"}' |
| 37 | + |
| 38 | +# /v1/chat/completions |
| 39 | +curl -X POST http://0.0.0.0:8000/v1/chat/completions \ |
| 40 | +-H "Content-Type: application/json" \ |
| 41 | +-d '{"model": "'$model_name'", "max_tokens": 50, |
| 42 | + "messages": [{ |
| 43 | + "role": "user", |
| 44 | + "content": "hello" |
| 45 | + }]}' |
| 46 | +``` |
| 47 | + |
| 48 | +#### get server status |
| 49 | +```shell |
| 50 | +# /status |
| 51 | +curl -X POST http://0.0.0.0:8000/status |
| 52 | +``` |
| 53 | +The response: |
| 54 | +```text |
| 55 | +{"prefill_node_count":x,"decode_node_count":x,"prefill_nodes":[xx.xx.xx.xx:xxxx],"decode_nodes":[xx.xx.xx.xx:xxxx]} |
| 56 | +``` |
| 57 | + |
| 58 | +#### add nodes to the server |
| 59 | +```shell |
| 60 | +# /instance/add |
| 61 | +curl -X POST http://0.0.0.0:8000/instances/add \ |
| 62 | +-H "Content-Type: application/json" \ |
| 63 | +-H "X-Api-Key: YOUR_ADMIN_API_KEY" \ |
| 64 | +-d '{"type": "prefill", "instance": "0.0.0.0:8100"}' |
| 65 | +``` |
| 66 | +* Case 1: If the node is not available, the server will waiting for the node to be available: |
| 67 | +```text |
| 68 | +INFO: Verifying xx.xx.xx.xx:xxxx ... |
| 69 | +ERROR: Cannot connect to host xx.xx.xx.xx:xxxx ... |
| 70 | +INFO: Waiting for prefill_instance xx.xx.xx.xx:xxxx to start. |
| 71 | +INFO: Verifying xx.xx.xx.xx:xxxx ... |
| 72 | +... |
| 73 | +``` |
| 74 | +The response: |
| 75 | +```text |
| 76 | +{"message":"Waiting for prefill_instance xx.xx.xx.xx:xxxx to start."} |
| 77 | +``` |
| 78 | +* Case 2: If the node is available, try to add the node to the server: |
| 79 | +```text |
| 80 | +INFO: Verifying xx.xx.xx.xx:xxxx ... |
| 81 | +INFO: Instance: xx.xx.xx.xx:xxxx could be added. |
| 82 | +INFO: Added xx.xx.xx.xx:xxxx to prefill_instances. prefill node counts: x, decode node counts: x |
| 83 | +``` |
| 84 | +If the node has been added to the server before: |
| 85 | +```text |
| 86 | +INFO: prefill_instance xx.xx.xx.xx:xxxx already exists. |
| 87 | +``` |
| 88 | +The response: |
| 89 | +```text |
| 90 | +{"message":"Added xx.xx.xx.xx:xxxx to prefill_instances."} |
| 91 | +``` |
| 92 | + |
| 93 | +#### remove nodes from the server |
| 94 | +```shell |
| 95 | +# /instance/remove |
| 96 | +curl -X POST http://0.0.0.0:8000/instances/remove \ |
| 97 | +-H "Content-Type: application/json" \ |
| 98 | +-H "X-Api-Key: YOUR_ADMIN_API_KEY" \ |
| 99 | +-d '{"type": "prefill", "instance": "0.0.0.0:8100"}' |
| 100 | +``` |
| 101 | +After the node is removed: |
| 102 | +```text |
| 103 | +INFO: Removed xx.xx.xx.xx:xxxx from prefill_instances. prefill node counts: x, decode node counts: x |
| 104 | +``` |
| 105 | +The response: |
| 106 | +```text |
| 107 | +{"message":"Removed xx.xx.xx.xx:xxxx from prefill_instances."} |
| 108 | +``` |
| 109 | + |
| 110 | +### Support functions |
| 111 | + |
| 112 | +* Support adding prefill nodes or decode nodes at any time. |
| 113 | + - If prefill or decode server has been deployed, proxy can add nodes when the proxy is deployed. |
| 114 | + - If prefill or decode server deployed after the proxy deployed, server can use `/instances/add` API to join the proxy server. The prefill server or decode server sends a signal to the proxy server, and the proxy server will check the status of the node util the node is available. |
| 115 | +* Support removing nodes for the following two situations: |
| 116 | + - Support removing nodes when the prefill or decode server failed more than a certain number of times. |
| 117 | + - Support using `/instances/remove` API to delete the node from the proxy server. |
| 118 | +* Support elastic scaling. |
| 119 | + - When the current node is unavailable, the proxy server will schedule to the next available node. |
0 commit comments