Name	Name	Last commit message	Last commit date
parent directory ..
ext	ext
schemas	schemas
.vespaignore	.vespaignore
README.md	README.md
services.xml	services.xml

Retrieval Augmented Generation (RAG) in Vespa using AWS Bedrock models

This sample application demonstrates an end-to-end Retrieval Augmented Generation application in Vespa, leveraging AWS Bedrock hosted models.

This sample application focuses on the generation part of RAG, and builds upon the MS Marco passage ranking sample application. Please refer to that sample application for details on more advanced forms of retrieval, such as vector search and cross-encoder re-ranking. The generation steps in this sample application happen after retrieval, so the techniques there can easily be used in this application as well. For the purposes of this sample application, we will use a simple example of hybrid search and ranking to demonstrate Vespa capabilities.

For more details on using retrieval augmented generation in Vespa, please refer to the RAG in Vespa documentation page. For more on the general use of LLMs in Vespa, please refer to LLMs in Vespa.

AWS Setup

Choose your model

This integration relies on the ability to invoke LLM endpoints with an OpenAI chat completion API from Vespa. At the time of writing, the only AWS Bedrock models which can be invoked with the OpenAI Chat completions API are the OpenAI models gpt-oss-20b and gpt-oss-120b.

If you want to use another model, an alternate way is to expose an OpenAI chat completions endpoint through a Bedrock access gateway. The same integration instructions apply after creating the endpoint.

Choose your region

Availability of the models may vary per region. The format of the bedrock runtime endpoint is as follows:

https://bedrock-runtime.{region}.amazonaws.com

You may want to collocate your model endpoint with the AWS region where Vespa is deployed. The default Vespa zone where this application will be deployed is in dev environment in aws-us-east-1 region.

Set-up an AWS Bedrock API Key

Create an AWS Bedrock API key.

Test your endpoint

You can test your endpoint from curl:

export AWS_BEARER_TOKEN_BEDROCK=ABSKQmVk....
curl -X POST https://bedrock-runtime.us-east-1.amazonaws.com/openai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $AWS_BEARER_TOKEN_BEDROCK" \
  -d '{
   "model": "openai.gpt-oss-20b-1:0",
   "messages": [
       {
           "role": "user",
           "content": "Hello! How are you today?"
       }
   ]
}'

Once this test completes successfully, you can proceed to next step.

Vespa setup

The following is a quick start recipe for getting started with a tiny slice of the MS Marco passage ranking dataset to showcase a RAG pattern leveraging AWS Bedrock models.

Please follow the instructions in the MS Marco passage ranking sample application for instructions on downloading the entire dataset.

In the following we will deploy the sample application to Vespa Cloud.

Make sure that Vespa CLI is installed. Update to the newest version:

$ brew install vespa-cli

Download this sample application:

$ vespa clone examples/aws-simple-rag bedrock-rag && cd bedrock-rag

Deploying to Vespa Cloud

Deploy the sample application to Vespa Cloud. Note that this application can fit within the free quota, so it is free to try.

In the following section, we will set the Vespa CLI target to the cloud. Make sure you have created a tenant at console.vespa-cloud.com. Make a note of the tenant's name; it will be used in the next steps. For more information, see the Vespa Cloud getting started guide.

Add your AWS Bedrock API key to the Vespa secret store as described in Secret Management. Unless you already have one, create a new vault, and add your AWS Bedrock API key as a secret.

The services.xml file must refer to the newly added secret in the secret store. Replace <my-vault-name> and <my-secret-name> below with your own values:

<secrets>
    <bedrock-api-key vault="<my-vault-name>" name="<my-secret-name>"/>
</secrets>

Configure the vespa client. Replace tenant-name below with your tenant name. We use the application name aws-app here, but you are free to choose your own application name:

$ vespa config set target cloud
$ vespa config set application tenant-name.aws-app

$ vespa auth login
$ vespa auth cert

Grant application access to the secret. Applications must be created first so one can use the Vespa Cloud Console to grant access. The easiest way is to deploy, which will auto-create the application. The first deployment will fail:

$ vespa deploy --wait 900

[09:47:43] warning Deployment failed: Invalid application: Vault 'my_vault' does not exist,
or application does not have access to it

At this point, open the console (the link is like https://console.vespa-cloud.com/tenant/mytenant/account/secrets) and grant access:

Deploy the application again. This can take some time for all nodes to be provisioned:

$ vespa deploy --wait 900

The application should now be deployed!

Feeding

Let's feed the documents:

$ vespa feed ext/docs.jsonl

Querying: Hybrid Retrieval

Run a query first to check the retrieval:

$ vespa query \
    'yql=select * from passage where ({targetHits:10}userInput(@query)) or ({targetHits:10}nearestNeighbor(embedding,e))' \
    'query=What is the Manhattan Project' \
    'input.query(e)=embed(@query)' \
    hits=3 \
    language=en \
    ranking=hybrid

RAG with AWS Bedrock

To test generation using the OpenAI client, post a query that runs the bedrock search chain:

$ vespa query \
  'yql=select * from passage where ({targetHits:10}userInput(@query)) or ({targetHits:10}nearestNeighbor(embedding,e))' \
  'query=What is the Manhattan Project' \
  'input.query(e)=embed(@query)' \
   hits=3 \
  language=en \
  ranking=hybrid \
  searchChain=bedrock \
  format=sse \
  traceLevel=1
  timeout=60s

Here, we specifically set the search chain to bedrock. This calls the RAGSearcher which is set up to use the OpenAI client, as we are leveraging the AWS Bedrock OpenAI chat completions API endpoint. Note that this requires the AWS Bedrock API key. We also add a timeout as token generation can take some time.

Structured output

You can also specify a structured output format for the LLM. In the example below, we provide a JSON schema to force the LLM to return the answer in 3 different formats:

answer-short: a short answer to the question
answer-short-french: a translation of the short answer in French
answer-short-eli5: an explanation of the answer as if the user was 5 years old

$ vespa query \
  'yql=select * from passage where ({targetHits:10}userInput(@query)) or ({targetHits:10}nearestNeighbor(embedding,e))' \
  'query=What is the Manhattan Project' \
  'input.query(e)=embed(@query)' \
   hits=3 \
  language=en \
  ranking=hybrid \
  searchChain=bedrock \
  format=sse \
  llm.json_schema="{\"type\":\"object\",\"properties\":{\"answer-short\":{\"type\":\"string\"},\"answer-short-french\":{\"type\":\"string\",\"description\":\"exact translation of short answer in French language\"},\"answer-short-eli5\":{\"type\":\"string\",\"description\":\"explain the answer like I am 5 years old\"}},\"required\":[\"answer-short\",\"answer-short-french\",\"answer-short-eli5\"],\"additionalProperties\":false}" \
  traceLevel=1
  timeout=60s

The llm.json_schema parameter is used to specify the expected output format of the LLM. The schema is defined in JSON Schema format, which allows you to specify the expected structure of the output.

Query parameters

The parameters here are:

query: the query used both for retrieval and the prompt question.
hits: the number of hits that Vespa should return in the retrieval stage
searchChain: the search chain set up in services.xml that calls the generative process
format: sets the format to server-sent events, which will stream the tokens as they are generated.
traceLevel: outputs some debug information, such as the actual prompt that was sent to the LLM and token timing.

For more information on how to customize the prompt, please refer to the RAG in Vespa documentation.

Shutdown and remove the RAG application

To remove the application from Vespa Cloud:

$ vespa destroy

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Retrieval Augmented Generation (RAG) in Vespa using AWS Bedrock models

AWS Setup

Choose your model

Choose your region

Set-up an AWS Bedrock API Key

Test your endpoint

Vespa setup

Deploying to Vespa Cloud

Feeding

Querying: Hybrid Retrieval

RAG with AWS Bedrock

Structured output

Query parameters

Shutdown and remove the RAG application

FilesExpand file tree

aws-simple-rag

Directory actions

More options

Directory actions

More options

Latest commit

History

aws-simple-rag

Folders and files

parent directory

README.md

Retrieval Augmented Generation (RAG) in Vespa using AWS Bedrock models

AWS Setup

Choose your model

Choose your region

Set-up an AWS Bedrock API Key

Test your endpoint

Vespa setup

Deploying to Vespa Cloud

Feeding

Querying: Hybrid Retrieval

RAG with AWS Bedrock

Structured output

Query parameters

Shutdown and remove the RAG application