This sample application demonstrates an end-to-end Retrieval Augmented Generation application in Vespa, leveraging AWS Bedrock hosted models.
This sample application focuses on the generation part of RAG, and builds upon the MS Marco passage ranking sample application. Please refer to that sample application for details on more advanced forms of retrieval, such as vector search and cross-encoder re-ranking. The generation steps in this sample application happen after retrieval, so the techniques there can easily be used in this application as well. For the purposes of this sample application, we will use a simple example of hybrid search and ranking to demonstrate Vespa capabilities.
For more details on using retrieval augmented generation in Vespa, please refer to the RAG in Vespa documentation page. For more on the general use of LLMs in Vespa, please refer to LLMs in Vespa.
This integration relies on the ability to invoke LLM endpoints with an OpenAI chat completion API from Vespa. At the time of writing, the only AWS Bedrock models which can be invoked with the OpenAI Chat completions API are the OpenAI models gpt-oss-20b and gpt-oss-120b.
If you want to use another model, an alternate way is to expose an OpenAI chat completions endpoint through a Bedrock access gateway. The same integration instructions apply after creating the endpoint.
Availability of the models may vary per region. The format of the bedrock runtime endpoint is as follows:
https://bedrock-runtime.{region}.amazonaws.com
You may want to collocate your model endpoint with the AWS region where
Vespa is deployed. The default Vespa zone where this application will be deployed is in dev environment in aws-us-east-1 region.
Create an AWS Bedrock API key.
You can test your endpoint from curl:
export AWS_BEARER_TOKEN_BEDROCK=ABSKQmVk....
curl -X POST https://bedrock-runtime.us-east-1.amazonaws.com/openai/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $AWS_BEARER_TOKEN_BEDROCK" \
-d '{
"model": "openai.gpt-oss-20b-1:0",
"messages": [
{
"role": "user",
"content": "Hello! How are you today?"
}
]
}'
Once this test completes successfully, you can proceed to next step.
The following is a quick start recipe for getting started with a tiny slice of the MS Marco passage ranking dataset to showcase a RAG pattern leveraging AWS Bedrock models.
Please follow the instructions in the MS Marco passage ranking sample application for instructions on downloading the entire dataset.
In the following we will deploy the sample application to Vespa Cloud.
Make sure that Vespa CLI is installed. Update to the newest version:
$ brew install vespa-cli
Download this sample application:
$ vespa clone examples/aws-simple-rag bedrock-rag && cd bedrock-rag
Deploy the sample application to Vespa Cloud. Note that this application can fit within the free quota, so it is free to try.
In the following section, we will set the Vespa CLI target to the cloud. Make sure you have created a tenant at console.vespa-cloud.com. Make a note of the tenant's name; it will be used in the next steps. For more information, see the Vespa Cloud getting started guide.
Add your AWS Bedrock API key to the Vespa secret store as described in Secret Management. Unless you already have one, create a new vault, and add your AWS Bedrock API key as a secret.
The services.xml file must refer to the newly added secret in the secret store.
Replace <my-vault-name> and <my-secret-name> below with your own values:
<secrets>
<bedrock-api-key vault="<my-vault-name>" name="<my-secret-name>"/>
</secrets>Configure the vespa client. Replace tenant-name below with your tenant name.
We use the application name aws-app here, but you are free to choose your own
application name:
$ vespa config set target cloud $ vespa config set application tenant-name.aws-app
Log in and add your public certificates to the application for Dataplane access:
$ vespa auth login $ vespa auth cert
Grant application access to the secret. Applications must be created first so one can use the Vespa Cloud Console to grant access. The easiest way is to deploy, which will auto-create the application. The first deployment will fail:
$ vespa deploy --wait 900
[09:47:43] warning Deployment failed: Invalid application: Vault 'my_vault' does not exist,
or application does not have access to it
At this point, open the console (the link is like https://console.vespa-cloud.com/tenant/mytenant/account/secrets) and grant access:
Deploy the application again. This can take some time for all nodes to be provisioned:
$ vespa deploy --wait 900
The application should now be deployed!
Let's feed the documents:
$ vespa feed ext/docs.jsonl
Run a query first to check the retrieval:
$ vespa query \
'yql=select * from passage where ({targetHits:10}userInput(@query)) or ({targetHits:10}nearestNeighbor(embedding,e))' \
'query=What is the Manhattan Project' \
'input.query(e)=embed(@query)' \
hits=3 \
language=en \
ranking=hybrid
To test generation using the OpenAI client, post a query that runs the bedrock search chain:
$ vespa query \
'yql=select * from passage where ({targetHits:10}userInput(@query)) or ({targetHits:10}nearestNeighbor(embedding,e))' \
'query=What is the Manhattan Project' \
'input.query(e)=embed(@query)' \
hits=3 \
language=en \
ranking=hybrid \
searchChain=bedrock \
format=sse \
traceLevel=1
timeout=60s
Here, we specifically set the search chain to bedrock.
This calls the
RAGSearcher
which is set up to use the
OpenAI client, as we are leveraging the AWS Bedrock OpenAI chat completions API endpoint.
Note that this requires the AWS Bedrock API key.
We also add a timeout as token generation can take some time.
You can also specify a structured output format for the LLM. In the example below, we provide a JSON schema to force the LLM to return the answer in 3 different formats:
answer-short: a short answer to the questionanswer-short-french: a translation of the short answer in Frenchanswer-short-eli5: an explanation of the answer as if the user was 5 years old
$ vespa query \
'yql=select * from passage where ({targetHits:10}userInput(@query)) or ({targetHits:10}nearestNeighbor(embedding,e))' \
'query=What is the Manhattan Project' \
'input.query(e)=embed(@query)' \
hits=3 \
language=en \
ranking=hybrid \
searchChain=bedrock \
format=sse \
llm.json_schema="{\"type\":\"object\",\"properties\":{\"answer-short\":{\"type\":\"string\"},\"answer-short-french\":{\"type\":\"string\",\"description\":\"exact translation of short answer in French language\"},\"answer-short-eli5\":{\"type\":\"string\",\"description\":\"explain the answer like I am 5 years old\"}},\"required\":[\"answer-short\",\"answer-short-french\",\"answer-short-eli5\"],\"additionalProperties\":false}" \
traceLevel=1
timeout=60s
The llm.json_schema parameter is used to specify the expected output format of the LLM.
The schema is defined in JSON Schema format, which allows you to specify the expected structure of the output.
The parameters here are:
query: the query used both for retrieval and the prompt question.hits: the number of hits that Vespa should return in the retrieval stagesearchChain: the search chain set up inservices.xmlthat calls the generative processformat: sets the format to server-sent events, which will stream the tokens as they are generated.traceLevel: outputs some debug information, such as the actual prompt that was sent to the LLM and token timing.
For more information on how to customize the prompt, please refer to the RAG in Vespa documentation.
To remove the application from Vespa Cloud:
$ vespa destroy
