Skip to content

Conversation

@s-zx
Copy link
Contributor

@s-zx s-zx commented Oct 9, 2025

Description

To get the semantic information of user input and existing search results, we use pre-trained models (PLM) to generate corresponding embeddings(vectors). By comparing the distance between each embedding, we can find comparably relevant results.

Specifically, we use transformer.js to run pre-trained model in our browser easily. They provide pipeline API, which simplifies complex tasks by abstracting away the intricacies of tokenization (converting raw text into numerical IDs the model understands), preprocessing (adding special tokens, padding, and truncation), model inference (executing the ONNX model via ONNX Runtime), and post-processing (extracting the relevant output, such as pooling the embeddings). Further more, it provides methods like cos_sim to compute the cosine similarity between two embeddings.

Issues Resolved

Screenshot

Testing the changes

Changelog

Check List

  • All tests pass
    • yarn test:jest
    • yarn test:jest_integration
  • New functionality includes testing.
  • New functionality has been documented.
  • Update CHANGELOG.md
  • Commits are signed per the DCO using --signoff

@github-actions
Copy link
Contributor

github-actions bot commented Oct 9, 2025

❌ Empty Changelog Section

The Changelog section in your PR description is empty. Please add a valid changelog entry or entries. If you did add a changelog entry, check to make sure that it was not accidentally included inside the comment block in the Changelog section.

@s-zx s-zx closed this Oct 9, 2025
@s-zx s-zx reopened this Oct 9, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Oct 9, 2025

❌ Empty Changelog Section

The Changelog section in your PR description is empty. Please add a valid changelog entry or entries. If you did add a changelog entry, check to make sure that it was not accidentally included inside the comment block in the Changelog section.

@codecov
Copy link

codecov bot commented Oct 9, 2025

Codecov Report

❌ Patch coverage is 65.38462% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 60.17%. Comparing base (dfc3f48) to head (086983a).

Files with missing lines Patch % Lines
src/core/public/chrome/utils.ts 52.94% 7 Missing and 1 partial ⚠️
src/core/public/chrome/chrome_service.tsx 0.00% 1 Missing ⚠️
Additional details and impacted files
@@                          Coverage Diff                           @@
##           feature/global-semantic-search-PLM   #10695      +/-   ##
======================================================================
- Coverage                               60.25%   60.17%   -0.08%     
======================================================================
  Files                                    4385     4381       -4     
  Lines                                  116753   116581     -172     
  Branches                                19010    18994      -16     
======================================================================
- Hits                                    70346    70154     -192     
- Misses                                  41568    41638      +70     
+ Partials                                 4839     4789      -50     
Flag Coverage Δ
Linux_1 ?
Linux_2 ?
Linux_3 38.41% <0.00%> (-0.02%) ⬇️
Linux_4 32.59% <0.00%> (-0.01%) ⬇️
Windows_1 26.22% <38.46%> (-0.45%) ⬇️
Windows_2 38.75% <62.50%> (-0.05%) ⬇️
Windows_3 38.42% <0.00%> (-0.01%) ⬇️
Windows_4 32.59% <0.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@s-zx s-zx changed the title PLM Semantic search global semantic search: PLM Oct 13, 2025
@github-actions
Copy link
Contributor

❌ Empty Changelog Section

The Changelog section in your PR description is empty. Please add a valid changelog entry or entries. If you did add a changelog entry, check to make sure that it was not accidentally included inside the comment block in the Changelog section.

2 similar comments
@github-actions
Copy link
Contributor

❌ Empty Changelog Section

The Changelog section in your PR description is empty. Please add a valid changelog entry or entries. If you did add a changelog entry, check to make sure that it was not accidentally included inside the comment block in the Changelog section.

@github-actions
Copy link
Contributor

❌ Empty Changelog Section

The Changelog section in your PR description is empty. Please add a valid changelog entry or entries. If you did add a changelog entry, check to make sure that it was not accidentally included inside the comment block in the Changelog section.

);

const debouncedSearch = useMemo(() => {
return debounce(search, 500); // 300ms delay, adjust as needed
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Will 500ms has better UI performance than 300ms?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It give user more time to type before semantic search.

}

// Generate embeddings for links
const linkEmbeddings = await Promise.all(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How long will it take to generate embeddings for links? Maybe we can store these info in the memory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It take abouts hundreds of ms. Yes, it's better store them into memory since these are fixed data.

s-zx added 2 commits October 14, 2025 15:37
Signed-off-by: Zhenxing Shen <[email protected]>
Signed-off-by: Zhenxing Shen <[email protected]>
Copy link
Member

@yuye-aws yuye-aws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide more context on why you're introducing a tokenizer in this PR? ml-commons already has two tokenizers. FYI: opensearch-project/ml-commons#3708.

If you really think the tokenizer need to be added. Can you make it a backend change to ml-commons repo?

@xluo-aws
Copy link
Member

Can you provide more context on why you're introducing a tokenizer in this PR? ml-commons already has two tokenizers. FYI: opensearch-project/ml-commons#3708.

If you really think the tokenizer need to be added. Can you make it a backend change to ml-commons repo?

It's a frontend solution to navigate user to different page/link, it's not associated with any particular backend.

@yuye-aws
Copy link
Member

It's a frontend solution to navigate user to different page/link, it's not associated with any particular backend.

Oh I see. Just our of curiosity whether this tokenizer can also benefit the back-end users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants