global semantic search: PLM #10695

s-zx · 2025-10-09T07:29:16Z

Description

To get the semantic information of user input and existing search results, we use pre-trained models (PLM) to generate corresponding embeddings(vectors). By comparing the distance between each embedding, we can find comparably relevant results.

Specifically, we use transformer.js to run pre-trained model in our browser easily. They provide pipeline API, which simplifies complex tasks by abstracting away the intricacies of tokenization (converting raw text into numerical IDs the model understands), preprocessing (adding special tokens, padding, and truncation), model inference (executing the ONNX model via ONNX Runtime), and post-processing (extracting the relevant output, such as pooling the embeddings). Further more, it provides methods like cos_sim to compute the cosine similarity between two embeddings.

Issues Resolved

Screenshot

Testing the changes

Changelog

Check List

All tests pass
- yarn test:jest
- yarn test:jest_integration
New functionality includes testing.
New functionality has been documented.
Update CHANGELOG.md
Commits are signed per the DCO using --signoff

github-actions · 2025-10-09T07:29:49Z

❌ Empty Changelog Section

The Changelog section in your PR description is empty. Please add a valid changelog entry or entries. If you did add a changelog entry, check to make sure that it was not accidentally included inside the comment block in the Changelog section.

github-actions · 2025-10-09T07:55:54Z

❌ Empty Changelog Section

The Changelog section in your PR description is empty. Please add a valid changelog entry or entries. If you did add a changelog entry, check to make sure that it was not accidentally included inside the comment block in the Changelog section.

codecov · 2025-10-09T08:16:19Z

Codecov Report

❌ Patch coverage is 65.38462% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 60.17%. Comparing base (dfc3f48) to head (086983a).

Files with missing lines	Patch %	Lines
src/core/public/chrome/utils.ts	52.94%	7 Missing and 1 partial ⚠️
src/core/public/chrome/chrome_service.tsx	0.00%	1 Missing ⚠️

Additional details and impacted files

@@                          Coverage Diff                           @@
##           feature/global-semantic-search-PLM   #10695      +/-   ##
======================================================================
- Coverage                               60.25%   60.17%   -0.08%     
======================================================================
  Files                                    4385     4381       -4     
  Lines                                  116753   116581     -172     
  Branches                                19010    18994      -16     
======================================================================
- Hits                                    70346    70154     -192     
- Misses                                  41568    41638      +70     
+ Partials                                 4839     4789      -50

Flag	Coverage Δ
Linux_1	`?`
Linux_2	`?`
Linux_3	`38.41% <0.00%> (-0.02%)`	⬇️
Linux_4	`32.59% <0.00%> (-0.01%)`	⬇️
Windows_1	`26.22% <38.46%> (-0.45%)`	⬇️
Windows_2	`38.75% <62.50%> (-0.05%)`	⬇️
Windows_3	`38.42% <0.00%> (-0.01%)`	⬇️
Windows_4	`32.59% <0.00%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

github-actions · 2025-10-13T02:12:30Z

❌ Empty Changelog Section

The Changelog section in your PR description is empty. Please add a valid changelog entry or entries. If you did add a changelog entry, check to make sure that it was not accidentally included inside the comment block in the Changelog section.

github-actions · 2025-10-13T02:12:36Z

❌ Empty Changelog Section

The Changelog section in your PR description is empty. Please add a valid changelog entry or entries. If you did add a changelog entry, check to make sure that it was not accidentally included inside the comment block in the Changelog section.

github-actions · 2025-10-13T02:13:12Z

❌ Empty Changelog Section

The Changelog section in your PR description is empty. Please add a valid changelog entry or entries. If you did add a changelog entry, check to make sure that it was not accidentally included inside the comment block in the Changelog section.

Signed-off-by: Zhenxing Shen <[email protected]>

wanglam · 2025-10-13T09:01:24Z

src/core/public/chrome/ui/header/header_search_bar.tsx

  );

+  const debouncedSearch = useMemo(() => {
+    return debounce(search, 500); // 300ms delay, adjust as needed


Nit: Will 500ms has better UI performance than 300ms?

It give user more time to type before semantic search.

src/core/public/chrome/chrome_service.tsx

src/core/public/chrome/utils.test.ts

wanglam · 2025-10-13T09:16:59Z

src/plugins/workspace/server/routes/index.ts

+        }
+
+        // Generate embeddings for links
+        const linkEmbeddings = await Promise.all(


How long will it take to generate embeddings for links? Maybe we can store these info in the memory.

It take abouts hundreds of ms. Yes, it's better store them into memory since these are fixed data.

Signed-off-by: Zhenxing Shen <[email protected]>

yuye-aws

Can you provide more context on why you're introducing a tokenizer in this PR? ml-commons already has two tokenizers. FYI: opensearch-project/ml-commons#3708.

If you really think the tokenizer need to be added. Can you make it a backend change to ml-commons repo?

xluo-aws · 2025-10-14T08:56:24Z

Can you provide more context on why you're introducing a tokenizer in this PR? ml-commons already has two tokenizers. FYI: opensearch-project/ml-commons#3708.

If you really think the tokenizer need to be added. Can you make it a backend change to ml-commons repo?

It's a frontend solution to navigate user to different page/link, it's not associated with any particular backend.

yuye-aws · 2025-10-14T09:53:36Z

It's a frontend solution to navigate user to different page/link, it's not associated with any particular backend.

Oh I see. Just our of curiosity whether this tokenizer can also benefit the back-end users.

github-actions bot added the repeat-contributor label Oct 9, 2025

github-actions bot added the failed changeset label Oct 9, 2025

s-zx closed this Oct 9, 2025

s-zx reopened this Oct 9, 2025

s-zx changed the title ~~PLM Semantic search~~ global semantic search: PLM Oct 13, 2025

Merge branch 'feature/global-semantic-search-PLM' into semantic-search

695b1e6

Signed-off-by: Zhenxing Shen <[email protected]>

s-zx force-pushed the semantic-search branch from 5f77c0f to 695b1e6 Compare October 13, 2025 02:15

wanglam reviewed Oct 13, 2025

View reviewed changes

s-zx added 2 commits October 14, 2025 15:37

remove unnecessary code

301c812

Signed-off-by: Zhenxing Shen <[email protected]>

fix unit test

086983a

Signed-off-by: Zhenxing Shen <[email protected]>

yuye-aws reviewed Oct 14, 2025

View reviewed changes

Uh oh!

global semantic search: PLM #10695

Are you sure you want to change the base?

global semantic search: PLM #10695

Uh oh!

Conversation

s-zx commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Issues Resolved

Screenshot

Testing the changes

Changelog

Check List

Uh oh!

github-actions bot commented Oct 9, 2025

❌ Empty Changelog Section

Uh oh!

github-actions bot commented Oct 9, 2025

❌ Empty Changelog Section

Uh oh!

codecov bot commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions bot commented Oct 13, 2025

❌ Empty Changelog Section

Uh oh!

github-actions bot commented Oct 13, 2025

❌ Empty Changelog Section

Uh oh!

github-actions bot commented Oct 13, 2025

❌ Empty Changelog Section

Uh oh!

wanglam Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

s-zx Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

wanglam Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

s-zx Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

yuye-aws left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xluo-aws commented Oct 14, 2025

Uh oh!

yuye-aws commented Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

s-zx commented Oct 9, 2025 •

edited

Loading

codecov bot commented Oct 9, 2025 •

edited

Loading

yuye-aws left a comment •

edited

Loading