-
Couldn't load subscription status.
- Fork 1.1k
global semantic search: PLM #10695
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: feature/global-semantic-search-PLM
Are you sure you want to change the base?
global semantic search: PLM #10695
Conversation
❌ Empty Changelog SectionThe Changelog section in your PR description is empty. Please add a valid changelog entry or entries. If you did add a changelog entry, check to make sure that it was not accidentally included inside the comment block in the Changelog section. |
❌ Empty Changelog SectionThe Changelog section in your PR description is empty. Please add a valid changelog entry or entries. If you did add a changelog entry, check to make sure that it was not accidentally included inside the comment block in the Changelog section. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## feature/global-semantic-search-PLM #10695 +/- ##
======================================================================
- Coverage 60.25% 60.17% -0.08%
======================================================================
Files 4385 4381 -4
Lines 116753 116581 -172
Branches 19010 18994 -16
======================================================================
- Hits 70346 70154 -192
- Misses 41568 41638 +70
+ Partials 4839 4789 -50
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
❌ Empty Changelog SectionThe Changelog section in your PR description is empty. Please add a valid changelog entry or entries. If you did add a changelog entry, check to make sure that it was not accidentally included inside the comment block in the Changelog section. |
2 similar comments
❌ Empty Changelog SectionThe Changelog section in your PR description is empty. Please add a valid changelog entry or entries. If you did add a changelog entry, check to make sure that it was not accidentally included inside the comment block in the Changelog section. |
❌ Empty Changelog SectionThe Changelog section in your PR description is empty. Please add a valid changelog entry or entries. If you did add a changelog entry, check to make sure that it was not accidentally included inside the comment block in the Changelog section. |
Signed-off-by: Zhenxing Shen <[email protected]>
| ); | ||
|
|
||
| const debouncedSearch = useMemo(() => { | ||
| return debounce(search, 500); // 300ms delay, adjust as needed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Will 500ms has better UI performance than 300ms?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It give user more time to type before semantic search.
| } | ||
|
|
||
| // Generate embeddings for links | ||
| const linkEmbeddings = await Promise.all( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How long will it take to generate embeddings for links? Maybe we can store these info in the memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It take abouts hundreds of ms. Yes, it's better store them into memory since these are fixed data.
Signed-off-by: Zhenxing Shen <[email protected]>
Signed-off-by: Zhenxing Shen <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you provide more context on why you're introducing a tokenizer in this PR? ml-commons already has two tokenizers. FYI: opensearch-project/ml-commons#3708.
If you really think the tokenizer need to be added. Can you make it a backend change to ml-commons repo?
It's a frontend solution to navigate user to different page/link, it's not associated with any particular backend. |
Oh I see. Just our of curiosity whether this tokenizer can also benefit the back-end users. |
Description
To get the semantic information of user input and existing search results, we use pre-trained models (PLM) to generate corresponding embeddings(vectors). By comparing the distance between each embedding, we can find comparably relevant results.
Specifically, we use transformer.js to run pre-trained model in our browser easily. They provide pipeline API, which simplifies complex tasks by abstracting away the intricacies of tokenization (converting raw text into numerical IDs the model understands), preprocessing (adding special tokens, padding, and truncation), model inference (executing the ONNX model via ONNX Runtime), and post-processing (extracting the relevant output, such as pooling the embeddings). Further more, it provides methods like cos_sim to compute the cosine similarity between two embeddings.
Issues Resolved
Screenshot
Testing the changes
Changelog
Check List
yarn test:jestyarn test:jest_integration