Skip to content

Commit 232e371

Browse files
committed
COH-31963 - Create example for using Vectors in Python Client
1 parent f4f98bc commit 232e371

File tree

5 files changed

+201
-1
lines changed

5 files changed

+201
-1
lines changed

.pre-commit-config.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@ repos:
1414
- id: end-of-file-fixer
1515
- id: check-yaml
1616
- id: check-added-large-files
17+
exclude: \.json.gzip
1718

1819
- repo: https://github.com/PyCQA/flake8
1920
rev: 7.1.1

examples/README.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,10 +13,19 @@ Be sure a Coherence gRPC proxy is available for the examples to work against.
1313
docker run -d -p 1408:1408 ghcr.io/oracle/coherence-ce:22.06.11
1414
```
1515

16+
> [!NOTE]
17+
> Coherence AI Vector search [vector_search.py] example requires installation of `light-embed` package so that the example code can use the `onnx-models/all-MiniLM-L6-v2-onnx` model for generating text embeddings
18+
>
19+
> ```bash
20+
> python3 -m pip install light-embed
21+
> ```
22+
23+
1624
### The Examples
1725
* basics.py - basic CRUD operations
1826
* python_object_keys_and_values.py - shows how to use standard Python objects as keys or values of a cache
1927
* filters.py - using filters to filter results
2028
* processors.py - using entry processors to mutate cache entries on the server without get/put
2129
* aggregators.py - using entry aggregators to query a subset of entries to produce a result
2230
* events.py - demonstrates cache lifecycle and cache entry events
31+
* vector_search.py - shows how to use some of the Coherence AI features to store vectors and perform a k-nearest neighbors (k-nn) search on those vectors.

examples/movies.json.gzip

629 KB
Binary file not shown.

examples/vector_search.py

Lines changed: 190 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,190 @@
1+
# Copyright (c) 2025, Oracle and/or its affiliates.
2+
# Licensed under the Universal Permissive License v 1.0 as shown at
3+
# https://oss.oracle.com/licenses/upl.
4+
5+
import asyncio
6+
import gzip
7+
import json
8+
from typing import Final, List
9+
10+
from light_embed import TextEmbedding
11+
12+
from coherence import NamedMap, Session
13+
from coherence.ai import FloatVector, QueryResult, SimilaritySearch, Vectors
14+
from coherence.extractor import Extractors, ValueExtractor
15+
from coherence.filter import Filter, Filters
16+
17+
"""This example shows how to use some of the Coherence AI features to store
18+
vectors and perform a k-nearest neighbors (k-nn) search on those vectors to
19+
find matches for search text.
20+
21+
Coherence includes an implementation of the HNSW index which can be used to
22+
index vectors to improve search times.
23+
24+
Coherence is only a vector store so in order to actually create vectors from
25+
text snippets this example uses the `light-embed` package to integrate with a
26+
model and produce vector embeddings from text.
27+
28+
This example shows just some basic usages of vectors in Coherence including
29+
using Coherence HNSW indexes. It has not been optimised at all for speed of
30+
loading vector data or searches.
31+
32+
Coherence Vectors
33+
=================
34+
35+
Coherence Python client can handle few different types of vector,
36+
this example will use the FloatVector type
37+
38+
Just like any other data type in Coherence, vectors are stored in normal
39+
Coherence caches. The vector may be stored as the actual cache value,
40+
or it may be in a field of another type that is the cache value. Vector data
41+
is then loaded into Coherence the same way that any other data is loaded
42+
using the NamedMap API.
43+
44+
Movie Database
45+
==============
46+
47+
This example is going to build a small database of movies. The database is
48+
small because the data used is stored in the source repository along with the
49+
code. The same techniques could be used to load any of the freely available
50+
much larger JSON datasets with the required field names.
51+
52+
The Data Model
53+
==============
54+
55+
This example is not going to use an specialized classes to store the data in
56+
the cache. The dataset is a json file and the example will use Coherence json
57+
support to read and store the data.
58+
59+
The schema of the JSON movie data looks like this
60+
61+
+--------------------+-------------------------------------------------------+
62+
| Field Name | Description |
63+
+====================+=======================================================+
64+
| title + The title of the movie |
65+
+--------------------+-------------------------------------------------------+
66+
| plot | A short summary of the plot of the movie |
67+
+--------------------+-------------------------------------------------------+
68+
| fullplot | A longer summary of the plot of the movie |
69+
+--------------------+-------------------------------------------------------+
70+
| cast + A list of the names of the actors in the movie |
71+
+--------------------+-------------------------------------------------------+
72+
| genres | A list of string values representing the different |
73+
| | genres the movie belongs to |
74+
+--------------------+-------------------------------------------------------+
75+
| runtime | How long the move runs for in minutes |
76+
+--------------------+-------------------------------------------------------+
77+
| poster | A link to the poster for the movie |
78+
+--------------------+-------------------------------------------------------+
79+
| languages | A list of string values representing the different |
80+
| | languages for the movie |
81+
+--------------------+-------------------------------------------------------+
82+
| directors | A list of the names of the directors of the movie |
83+
+--------------------+-------------------------------------------------------+
84+
| writers | A list of the names of the writers of the movie |
85+
+--------------------+-------------------------------------------------------+
86+
87+
This example uses the fullplot to create the vector embeddings for each
88+
movie. Other fields can be used by normal Coherence filters to further narrow
89+
down vector searches.
90+
91+
"""
92+
93+
94+
class MovieRepository:
95+
"""This class represents the repository of movies. It contains all the
96+
code to load and search movie data."""
97+
98+
MODEL_NAME: Final[str] = "onnx-models/all-MiniLM-L6-v2-onnx"
99+
"""
100+
The ONNX-ported version of the sentence-transformers/all-MiniLM-L6-v2
101+
for generating text embeddings.
102+
See https://huggingface.co/onnx-models/all-MiniLM-L6-v2-onnx
103+
"""
104+
105+
VECTOR_FIELD: Final[str] = "embeddings"
106+
"""The name of the field in the json containing the embeddings."""
107+
108+
VALUE_EXTRACTOR: Final[ValueExtractor] = Extractors.extract(VECTOR_FIELD)
109+
"""The ValueExtractor to extract the embeddings vector from the json."""
110+
111+
def __init__(self, movies: NamedMap) -> None:
112+
"""
113+
Creates an instance of the MovieRepository
114+
115+
:param movies: The Coherence NamedMap is the cache used to store the
116+
movie data.
117+
118+
"""
119+
self.movies = movies
120+
self.model = TextEmbedding(self.MODEL_NAME) # embedding model to generate embeddings
121+
122+
async def load(self, filename: str) -> None:
123+
"""
124+
Loads the movie data into the NamedMao using the specified zip file
125+
126+
:param filename: Name of the movies json zip file
127+
:return: None
128+
"""
129+
try:
130+
with gzip.open(filename, "rt", encoding="utf-8") as f:
131+
# the JSON data should be a JSON list of movie objects in the
132+
# format described above.
133+
data = json.load(f)
134+
except FileNotFoundError:
135+
print("Error: The file was not found.")
136+
except Exception as e:
137+
print(f"An unexpected error occurred: {e}")
138+
finally:
139+
try:
140+
f.close()
141+
except NameError:
142+
pass # File was never opened, so nothing to close
143+
except Exception as e:
144+
print(f"An error occurred while closing the file: {e}")
145+
146+
for movie in data:
147+
title: str = movie.get("title")
148+
plot: str = movie.get("fullplot")
149+
key: str = title
150+
vector: FloatVector = self.vectorize(plot)
151+
movie[self.VECTOR_FIELD] = vector
152+
await self.movies.put(key, movie)
153+
154+
def vectorize(self, input_string: str) -> FloatVector:
155+
embeddings: List[float] = self.model.encode(input_string).tolist()
156+
return FloatVector(Vectors.normalize(embeddings))
157+
158+
async def search(self, search_text: str, count: int, filter: Filter = Filters.always()) -> List[QueryResult]:
159+
vector: FloatVector = self.vectorize(search_text)
160+
search: SimilaritySearch = SimilaritySearch(self.VALUE_EXTRACTOR, vector, count)
161+
return await self.movies.aggregate(search, filter=filter)
162+
163+
164+
MOVIE_JSON_FILENAME: Final[str] = "movies.json.gzip"
165+
166+
167+
async def do_run() -> None:
168+
169+
session: Session = await Session.create()
170+
movie_db: NamedMap[str, dict] = await session.get_map("movies")
171+
try:
172+
movies_repo = MovieRepository(movie_db)
173+
174+
await movies_repo.load(MOVIE_JSON_FILENAME)
175+
results = await movies_repo.search("star travel and space ships", 5)
176+
for e in results:
177+
print(f"key = {e.key}, distance = {e.distance}, plot = {e.value.get('plot')}")
178+
179+
cast_extractor = Extractors.extract("cast")
180+
filter = Filters.contains(cast_extractor, "Harrison Ford")
181+
results = await movies_repo.search("star travel and space ships", 5, filter)
182+
for e in results:
183+
print(f"key = {e.key}, distance = {e.distance}, plot = {e.value.get('plot')}")
184+
185+
finally:
186+
await movie_db.truncate()
187+
await session.close()
188+
189+
190+
asyncio.run(do_run())

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ pymitter = ">=0.4,<1.1"
3030
typing-extensions = ">=4.11,<4.14"
3131
types-protobuf = "5.29.1.20250403"
3232
pympler = "1.1"
33-
numpy = "2.0.2"
33+
numpy = "1.26.4"
3434

3535
[tool.poetry.dev-dependencies]
3636
pytest = "~8.3"

0 commit comments

Comments
 (0)