-
Notifications
You must be signed in to change notification settings - Fork 1.1k
global semantic search: neural sparse search #10696
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: feature/global-semantic-search-neural-sparse
Are you sure you want to change the base?
global semantic search: neural sparse search #10696
Conversation
❌ Empty Changelog SectionThe Changelog section in your PR description is empty. Please add a valid changelog entry or entries. If you did add a changelog entry, check to make sure that it was not accidentally included inside the comment block in the Changelog section. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## feature/global-semantic-search-neural-sparse #10696 +/- ##
================================================================================
- Coverage 60.25% 52.77% -7.49%
================================================================================
Files 4385 4099 -286
Lines 116753 112834 -3919
Branches 19010 18387 -623
================================================================================
- Hits 70346 59543 -10803
- Misses 41568 48981 +7413
+ Partials 4839 4310 -529
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…se-search Signed-off-by: Zhenxing Shen <[email protected]>
❌ Empty Changelog SectionThe Changelog section in your PR description is empty. Please add a valid changelog entry or entries. If you did add a changelog entry, check to make sure that it was not accidentally included inside the comment block in the Changelog section. |
1 similar comment
❌ Empty Changelog SectionThe Changelog section in your PR description is empty. Please add a valid changelog entry or entries. If you did add a changelog entry, check to make sure that it was not accidentally included inside the comment block in the Changelog section. |
❌ Empty Changelog SectionThe Changelog section in your PR description is empty. Please add a valid changelog entry or entries. If you did add a changelog entry, check to make sure that it was not accidentally included inside the comment block in the Changelog section. |
1 similar comment
❌ Empty Changelog SectionThe Changelog section in your PR description is empty. Please add a valid changelog entry or entries. If you did add a changelog entry, check to make sure that it was not accidentally included inside the comment block in the Changelog section. |
Signed-off-by: Zhenxing Shen <[email protected]>
); | ||
|
||
const debouncedSearch = useMemo(() => { | ||
return debounce(search, 500); // 300ms delay, adjust as needed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: comment need changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for reminding!
const nlpBertTokenizer = new BertWordPieceTokenizer({ vocabContent: Object.keys(this.vocab) }); | ||
const tokenizedResult = nlpBertTokenizer.tokenizeSentence(query); | ||
const tokensArray = tokenizedResult.tokens; | ||
console.log('Tokenization: ', tokensArray); | ||
|
||
const queryVec = this.buildQueryVector(tokensArray); | ||
console.log('Non-zero query dimensions count: ', Object.keys(queryVec).length); | ||
console.log('Non-zero query vector: ', queryVec); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
neural sparse search supports natural language query. You can search with existing model_id
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool! We will try this way in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you can provide some context on why you're doing tokenization here? At first glance, you're doing tokenization to obtain the sparse query vector with IDF value. This is the doc-only search mode: https://docs.opensearch.org/latest/query-dsl/specialized/neural-sparse/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right. We want to use doc-only neural sparse search to achieve semantic search in frontend without using backend service. We generate doc vetor in advance and store them in frontend side. Then we tokenize the query to obtain the sparse query vector with IDF value. After that, relevance is calculated using a dot product between query and document vectors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doc-only queries supports natural language query
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, this way we don't need tokenization anymore.
}); | ||
|
||
try { | ||
console.log('All links:', allSearchAbleLinks); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: AllSearchableLinks
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
core.application.register({ | ||
id: PLUGIN_ID, | ||
title: 'Discover', | ||
description: 'Analyze your data in OpenSearch and visualize key metrics.', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for curiosity, would the performance of semantic search be more accurate if we can have a more explanatory and detailed description for each application?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it will imrpove the relevance. We can enrich them in the future.
...workspaceOptionalAttributesSchema, | ||
}); | ||
|
||
let jsonData: { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it safe to make it as a global variable in nodejs? Would it be nice to create a class, make jsonData
a private field of that class and provide dedicated method to manipulate the jsonData
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is a good concern. We should make it private.
|
||
if (!jsonData) { | ||
const filePath = path.join(__dirname, 'doc_vectors.json'); | ||
const data = await readFile(filePath, 'utf8'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to have error handling for file reading here I guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have a try/catch here, I think it will handle the error message.
Description
Neural Sparse Search offers a lightweight yet effective approach for semantic search by representing text as sparse vectors where most elements are zero. This method bridges the gap between traditional keyword matching and dense neural embeddings.
Neural Sparse Search works in two phases:
Issues Resolved
Screenshot
Testing the changes
Changelog
Check List
yarn test:jest
yarn test:jest_integration