-
-
Notifications
You must be signed in to change notification settings - Fork 27
Open
Labels
questionFurther information is requestedFurther information is requested
Description
Hi,
Some of the models in Hugeface shows the support of create_chat_completion
, but now this plugin seems only support the Simple inference
, will Chat Completion
be supported in the future version?
https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF
from llama_cpp import Llama
# Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
llm = Llama(
model_path="./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf", # Download the model file first
n_ctx=2048, # The max sequence length to use - note that longer sequence lengths require much more resources
n_threads=8, # The number of CPU threads to use, tailor to your system and the resulting performance
n_gpu_layers=35 # The number of layers to offload to GPU, if you have GPU acceleration available
)
# Simple inference example
output = llm(
"<|system|>\n{system_message}</s>\n<|user|>\n{prompt}</s>\n<|assistant|>", # Prompt
max_tokens=512, # Generate up to 512 tokens
stop=["</s>"], # Example stop token - not necessarily correct for this specific model! Please check before using.
echo=True # Whether to echo the prompt
)
# Chat Completion API
llm = Llama(model_path="./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf", chat_format="llama-2") # Set chat_format according to the model you are using
llm.create_chat_completion(
messages = [
{"role": "system", "content": "You are a story writing assistant."},
{
"role": "user",
"content": "Write a story about llamas."
}
]
)
Metadata
Metadata
Assignees
Labels
questionFurther information is requestedFurther information is requested