You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/_posts/2021-07-30-how-does-elastiknn-work.md
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -43,8 +43,8 @@ The name is a combination of _Elastic_ and _KNN_ (K-Nearest Neighbors).
43
43
The full list of features (copied from the home page) is as follows:
44
44
45
45
- Datatypes to efficiently store dense and sparse numerical vectors in Elasticsearch documents, including multiple vectors per document.
46
-
- Exact nearest neighbor queries for five similarity functions: [L1](https://en.wikipedia.org/wiki/Taxicab_geometry), [L2](https://en.wikipedia.org/wiki/Euclidean_distance), [Cosine](https://en.wikipedia.org/wiki/Cosine_similarity), [Jaccard](https://en.wikipedia.org/wiki/Jaccard_index), and [Hamming](https://en.wikipedia.org/wiki/Hamming_distance).
47
-
- Approximate queries using [Locality Sensitive Hashing](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) for L2, Cosine, Jaccard, and Hamming similarity.
46
+
- Exact nearest neighbor queries for five similarity functions: [L1](https://en.wikipedia.org/wiki/Taxicab_geometry), [L2](https://en.wikipedia.org/wiki/Euclidean_distance), [Cosine](https://en.wikipedia.org/wiki/Cosine_similarity), [Dot](https://en.wikipedia.org/wiki/Dot_product), [Jaccard](https://en.wikipedia.org/wiki/Jaccard_index), and [Hamming](https://en.wikipedia.org/wiki/Hamming_distance).
47
+
- Approximate queries using [Locality Sensitive Hashing](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) for L2, Cosine, Dot, Jaccard, and Hamming similarity.
48
48
- Integration of nearest neighbor queries with standard Elasticsearch queries.
49
49
- Incremental index updates: start with any number of vectors and incrementally create/update/delete more without ever re-building the entire index.
50
50
- Implementation based on standard Elasticsearch and Lucene primitives, entirely in the JVM. Indexing and querying scale horizontally with Elasticsearch.
@@ -88,13 +88,13 @@ So Java is used for all the CPU-bound LSH models and Lucene abstractions, and Sc
88
88
89
89
Elasticsearch requires non-negative scores, with higher scores indicating higher relevance.
90
90
91
-
Elastiknn supports five vector similarity functions (L1, L2, Cosine, Jaccard, and Hamming).
91
+
Elastiknn supports five vector similarity functions (L1, L2, Cosine,Dot, Jaccard, and Hamming).
92
92
Three of these are problematic with respect to this scoring requirement.
93
93
94
94
Specifically, L1 and L2 are generally defined as _distance_ functions, rather than similarity functions,
95
95
which means that higher relevance (i.e., lower distance) yields _lower_ scores.
96
96
Cosine similarity is defined over $$[-1, 1]$$, and we can't have negative scores.
97
-
97
+
Dot similarity is defined over $$[-1, 1]$$, and we can't have negative scores, if vectors have a magnitude of 1, then it's equivalent to cosine similarity.
98
98
To work around this, Elastiknn applies simple transformations to produce L1, L2, and Cosine _similarity_ in accordance with the Elasticsearch requirements.
99
99
The exact transformations are documented [on the API page](/api/#similarity-scoring).
Copy file name to clipboardExpand all lines: docs/pages/api.md
+67-8Lines changed: 67 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -292,6 +292,30 @@ PUT /my-index/_mapping
292
292
}
293
293
}
294
294
```
295
+
### Dot LSH Mapping
296
+
297
+
Uses the [Random Projection algorithm](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Random_projection)
298
+
to hash and store dense float vectors such that they support approximate Dot similarity queries. Equivalent to Cosine similarity if the vectors are normalized
299
+
300
+
The implementation is influenced by Chapter 3 of [Mining Massive Datasets.](http://www.mmds.org/)
301
+
302
+
```json
303
+
PUT /my-index/_mapping
304
+
{
305
+
"properties": {
306
+
"my_vec": {
307
+
"type": "elastiknn_dense_float_vector", # 1
308
+
"elastiknn": {
309
+
"dims": 100, # 2
310
+
"model": "lsh", # 3
311
+
"similarity": "dot", # 4
312
+
"L": 99, # 5
313
+
"k": 1# 6
314
+
}
315
+
}
316
+
}
317
+
}
318
+
```
295
319
296
320
|#|Description|
297
321
|:--|:--|
@@ -425,7 +449,7 @@ GET /my-index/_search
425
449
### Compatibility of Vector Types and Similarities
426
450
427
451
Jaccard and Hamming similarity only work with sparse bool vectors.
428
-
Cosine,[^note-angular-cosine] L1, and L2 similarity only work with dense float vectors.
452
+
Cosine,[^note-angular-cosine],Dot[^note-dot-product], L1, and L2 similarity only work with dense float vectors.
429
453
The following documentation assume this restriction is known.
430
454
431
455
These restrictions aren't inherent to the types and algorithms, i.e., you could in theory run cosine similarity on sparse vectors.
@@ -446,9 +470,12 @@ The exact transformations are described below.
Dot similirarity will produce negative scores if the vectors are not normalized
478
+
452
479
If you're using the `elastiknn_nearest_neighbors` query with other queries, and the score values are inconvenient (e.g. huge values like 1e6), consider wrapping the query in a [Script Score Query](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-script-score-query.html), where you can access and transform the `_score` value.
453
480
454
481
### Query Vector
@@ -621,6 +648,36 @@ GET /my-index/_search
621
648
|5|Number of candidates per segment. See the section on LSH Search Strategy.|
622
649
|6|Set to true to use the more-like-this heuristic to pick a subset of hashes. Generally faster but still experimental.|
623
650
651
+
### Dot LSH Query
652
+
653
+
Retrieve dense float vectors based on approximate Cosine similarity.[^note-angular-cosine]
654
+
655
+
```json
656
+
GET /my-index/_search
657
+
{
658
+
"query": {
659
+
"elastiknn_nearest_neighbors": {
660
+
"field": "my_vec", # 1
661
+
"vec": { # 2
662
+
"values": [0.1, 0.2, 0.3, ...]
663
+
},
664
+
"model": "lsh", # 3
665
+
"similarity": "dot", # 4
666
+
"candidates": 50# 5
667
+
}
668
+
}
669
+
}
670
+
```
671
+
672
+
|#|Description|
673
+
|:--|:--|
674
+
|1|Indexed field. Must use `lsh` mapping model with `dot`[^note-dot-product] similarity.|
675
+
|2|Query vector. Must be literal dense float or a pointer to an indexed dense float vector.|
676
+
|3|Model name.|
677
+
|4|Similarity function.|
678
+
|5|Number of candidates per segment. See the section on LSH Search Strategy.|
679
+
|6|Set to true to use the more-like-this heuristic to pick a subset of hashes. Generally faster but still experimental.|
680
+
624
681
### L1 LSH Query
625
682
626
683
Not yet implemented.
@@ -707,12 +764,13 @@ The similarity functions are abbreviated (J: Jaccard, H: Hamming, C: Cosine,[^no
|Exact (i.e. no model specified) |✔ (C, D, L1, L2) |x |x |x |x |
770
+
|Cosine LSH |✔ (C, D, L1, L2) |✔ |✔ |x |x |
771
+
|Dot LSH |✔ (C, D, L1, L2) |✔ |✔ |x |x |
772
+
|L2 LSH |✔ (C, D, L1, L2) |x |x |✔ |x |
773
+
|Permutation LSH |✔ (C, D, L1, L2) |x |x |x |✔ |
716
774
717
775
### Running Nearest Neighbors Query on a Filtered Subset of Documents
718
776
@@ -860,4 +918,5 @@ PUT /my-index
860
918
861
919
See the [create index documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-create-index.html) for more details.
862
920
863
-
[^note-angular-cosine]: Cosine similarity used to be (incorrectly) called "angular" similarity. All references to "angular" were renamed to "Cosine" in 7.13.3.2. You can still use "angular" in the JSON/HTTP API; it will convert to "cosine" internally.
921
+
[^note-angular-cosine]: Cosine similarity used to be (incorrectly) called "angular" similarity. All references to "angular" were renamed to "Cosine" in 7.13.3.2. You can still use "angular" in the JSON/HTTP API; it will convert to "cosine" internally.
922
+
[^note-dot-product]: Dot product is thought to be used with normalized vectors V, meaning that ||v||==1.
Copy file name to clipboardExpand all lines: docs/pages/index.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -15,8 +15,8 @@ This enables users to combine traditional queries (e.g., "some product") with ve
15
15
## Features
16
16
17
17
- Datatypes to efficiently store dense and sparse numerical vectors in Elasticsearch documents, including multiple vectors per document.
18
-
- Exact nearest neighbor queries for five similarity functions: [L1](https://en.wikipedia.org/wiki/Taxicab_geometry), [L2](https://en.wikipedia.org/wiki/Euclidean_distance), [Cosine](https://en.wikipedia.org/wiki/Cosine_similarity), [Jaccard](https://en.wikipedia.org/wiki/Jaccard_index), and [Hamming](https://en.wikipedia.org/wiki/Hamming_distance).
19
-
- Approximate queries using [Locality Sensitive Hashing](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) for L2, Cosine, Jaccard, and Hamming similarity.
18
+
- Exact nearest neighbor queries for five similarity functions: [L1](https://en.wikipedia.org/wiki/Taxicab_geometry), [L2](https://en.wikipedia.org/wiki/Euclidean_distance), [Cosine](https://en.wikipedia.org/wiki/Cosine_similarity), [Dot](https://en.wikipedia.org/wiki/Dot_product) (for normalized vectors), [Jaccard](https://en.wikipedia.org/wiki/Jaccard_index), and [Hamming](https://en.wikipedia.org/wiki/Hamming_distance).
19
+
- Approximate queries using [Locality Sensitive Hashing](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) for L2, Cosine, Dot, Jaccard, and Hamming similarity.
20
20
- Integration of nearest neighbor queries with standard Elasticsearch queries.
21
21
- Incremental index updates. Start with 1 vector or 1 million vectors and then create/update/delete documents and vectors without ever re-building the entire index.
22
22
- Implementation based on standard Elasticsearch and Lucene primitives, entirely in the JVM. Indexing and querying scale horizontally with Elasticsearch.
0 commit comments