Skip to content
This repository was archived by the owner on Sep 20, 2025. It is now read-only.
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,13 @@ Easy Model Deployer is a lightweight tool designed for simplify deploy **Open-So
- Different instance types (CPU/GPU/AWS Inferentia)
- Convenient integration (OpenAI Compatible API, LangChain client, etc.)

## Support Models
<details>
<summary>Deepseek Reasoning Model</summary>
<a href="https://github.com/deepseek-ai/DeepSeek-R1"><strong>DeepSeek-R1-Distill-Qwen-14B</strong></a>

</details>

## 🔧 Get Started

### Installation
Expand Down
377 changes: 377 additions & 0 deletions docs/en/api.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,377 @@
# API Documentation

> **Getting Started**: To obtain the base URL and API key for your deployed models, run `emd status` in your terminal. The command will display a table with your deployed models and their details, including a link to retrieve the API key from AWS Secrets Manager. The base URL is shown at the bottom of the output.
>
> Example output:
> ```
> Models
> ┌────────────────────────┬───────────────────────────────────────────────────────────────────────┐
> │ Model ID │ Qwen2.5-0.5B-Instruct/dev │
> │ Status │ CREATE_COMPLETE │
> │ Service Type │ Amazon SageMaker AI Real-time inference with OpenAI Compatible API │
> │ Instance Type │ ml.g5.2xlarge │
> │ Create Time │ 2025-05-08 12:27:05 UTC │
> │ Query Model API Key │ https://console.aws.amazon.com/secretsmanager/secret?name=EMD-APIKey- │
> │ │ Secrets&region=us-east-1 │
> │ SageMakerEndpointName │ EMD-Model-qwen2-5-0-5b-instruct-endpoint │
> └────────────────────────┴───────────────────────────────────────────────────────────────────────┘
>
> Base URL
> http://your-emd-endpoint.region.elb.amazonaws.com/v1
> ```

## List Models

Returns a list of available models.

**Endpoint:** `GET /v1/models`

**Curl Example:**
```bash
curl https://BASE_URL/v1/models
```

**Python Example:**
```python
from openai import OpenAI

client = OpenAI(
# No API key needed for listing models
base_url="https://BASE_URL"
)

# List available models
models = client.models.list()
for model in models.data:
print(model.id)
```

## Chat Completions

Create a model response for a conversation.

**Endpoint:** `POST /v1/chat/completions`

**Parameters:**

- `model` (required): ID of the model to use (e.g., "Qwen2.5-7B-Instruct/dev", "Llama-3.3-70B-Instruct/dev")
- `messages` (required): Array of message objects with `role` and `content`
- `temperature`: Sampling temperature (0-2, default: 1)
- `top_p`: Nucleus sampling parameter (0-1, default: 1)
- `n`: Number of chat completion choices to generate (default: 1)
- `stream`: Whether to stream partial progress (default: false)
- `stop`: Sequences where the API will stop generating
- `max_tokens`: Maximum number of tokens to generate
- `presence_penalty`: Penalty for new tokens based on presence (-2.0 to 2.0)
- `frequency_penalty`: Penalty for new tokens based on frequency (-2.0 to 2.0)
- `function_call`: Controls how the model responds to function calls
- `functions`: List of functions the model may generate JSON inputs for

**Curl Example:**
```bash
curl https://BASE_URL/v1/chat/completions \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen2.5-7B-Instruct/dev",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"}
],
"temperature": 0.7
}'
```

**Python Example:**
```python
from openai import OpenAI

client = OpenAI(
api_key="YOUR_API_KEY",
base_url="https://BASE_URL"
)

# Create a chat completion
response = client.chat.completions.create(
model="Qwen2.5-7B-Instruct/dev", # Model ID with tag
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"}
],
temperature=0.7,
stream=False
)

# Print the response
print(response.choices[0].message.content)
```

**Streaming Example:**
```python
from openai import OpenAI

client = OpenAI(
api_key="YOUR_API_KEY",
base_url="https://BASE_URL"
)

# Create a streaming chat completion
stream = client.chat.completions.create(
model="Llama-3.3-70B-Instruct/dev", # Model ID with tag
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write a short poem about AI."}
],
stream=True
)

# Process the stream
for chunk in stream:
if chunk.choices[0].delta.content is not None:
print(chunk.choices[0].delta.content, end="")
print()
```

## Embeddings

Get vector representations of text.

**Endpoint:** `POST /v1/embeddings`

**Parameters:**

- `model` (required): ID of the model to use (e.g., "bge-m3/dev")
- `input` (required): Input text to embed or array of texts
- `user`: A unique identifier for the end-user

**Curl Example:**
```bash
curl https://BASE_URL/v1/embeddings \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "bge-m3/dev",
"input": "The food was delicious and the waiter..."
}'
```

**Python Example:**
```python
from openai import OpenAI

client = OpenAI(
api_key="YOUR_API_KEY",
base_url="https://BASE_URL"
)

# Get embeddings for a single text
response = client.embeddings.create(
model="bge-m3/dev", # Embedding model ID with tag
input="The food was delicious and the service was excellent."
)

# Print the embedding vector
print(response.data[0].embedding)

# Get embeddings for multiple texts
response = client.embeddings.create(
model="bge-m3/dev", # Embedding model ID with tag
input=[
"The food was delicious and the service was excellent.",
"The restaurant was very expensive and the food was mediocre."
]
)

# Print the number of embeddings
print(f"Generated {len(response.data)} embeddings")
```

## Rerank

Rerank a list of documents based on their relevance to a query.

**Endpoint:** `POST /v1/rerank`

**Parameters:**

- `model` (required): ID of the model to use (e.g., "bge-reranker-v2-m3/dev")
- `query` (required): The search query
- `documents` (required): List of documents to rerank
- `max_rerank`: Maximum number of documents to rerank (default: all)
- `return_metadata`: Whether to return metadata (default: false)

**Curl Example:**
```bash
curl https://BASE_URL/v1/rerank \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "bge-reranker-v2-m3/dev",
"query": "What is the capital of France?",
"documents": [
"Paris is the capital of France.",
"Berlin is the capital of Germany.",
"London is the capital of England."
]
}'
```

**Python Example:**
```python
from openai import OpenAI

client = OpenAI(
api_key="YOUR_API_KEY",
base_url="https://BASE_URL"
)

# Rerank documents based on a query
response = client.reranking.create(
model="bge-reranker-v2-m3/dev", # Reranking model ID with tag
query="What is the capital of France?",
documents=[
"Paris is the capital of France.",
"Berlin is the capital of Germany.",
"London is the capital of England."
],
max_rerank=3
)

# Print the reranked documents
for result in response.data:
print(f"Document: {result.document}")
print(f"Relevance Score: {result.relevance_score}")
print("---")
```

## Invocations

General-purpose endpoint for model invocations.

**Endpoint:** `POST /v1/invocations`

**Parameters:**

- `model` (required): ID of the model to use
- `input`: Input data for the model
- `parameters`: Additional parameters for the model

**Curl Example:**
```bash
curl https://BASE_URL/v1/invocations \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen2.5-7B-Instruct/dev",
"input": {
"query": "What is machine learning?"
},
"parameters": {
"max_tokens": 100
}
}'
```

**Python Example:**
```python
import requests
import json

# Set up the API endpoint and headers
url = "https://BASE_URL/v1/invocations"
headers = {
"Authorization": "Bearer YOUR_API_KEY",
"Content-Type": "application/json"
}

# Prepare the payload
payload = {
"model": "Qwen2.5-7B-Instruct/dev", # Model ID with tag
"input": {
"query": "What is machine learning?"
},
"parameters": {
"max_tokens": 100
}
}

# Make the API call
response = requests.post(url, headers=headers, data=json.dumps(payload))

# Print the response
print(response.json())
```

## Vision Models

Process images along with text prompts.

**Endpoint:** `POST /v1/chat/completions`

**Parameters:**
Same as Chat Completions, but with messages that include image content.

**Python Example:**
```python
from openai import OpenAI
import base64

# Function to encode the image
def encode_image(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode('utf-8')

# Path to your image
image_path = "path/to/your/image.jpg"
base64_image = encode_image(image_path)

client = OpenAI(
api_key="YOUR_API_KEY",
base_url="https://BASE_URL"
)

response = client.chat.completions.create(
model="Qwen2-VL-7B-Instruct/dev", # Vision model ID with tag
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{base64_image}"
}
}
]
}
]
)

print(response.choices[0].message.content)
```

## Audio Transcription

Transcribe audio files to text.

**Endpoint:** `POST /v1/audio/transcriptions`

**Python Example:**
```python
from openai import OpenAI

client = OpenAI(
api_key="YOUR_API_KEY",
base_url="https://BASE_URL"
)

audio_file_path = "path/to/audio.mp3"
with open(audio_file_path, "rb") as audio_file:
response = client.audio.transcriptions.create(
model="whisper-large-v3/dev", # ASR model ID with tag
file=audio_file
)

print(response.text) # Transcribed text
```
Loading