Skip to content

Conversation

s-zx
Copy link
Contributor

@s-zx s-zx commented Oct 9, 2025

Description

Neural Sparse Search offers a lightweight yet effective approach for semantic search by representing text as sparse vectors where most elements are zero. This method bridges the gap between traditional keyword matching and dense neural embeddings.

Neural Sparse Search works in two phases:

  1. Document Processing: Documents are tokenized and converted into sparse vector representations where only meaningful tokens have non-zero values.
  2. Query Processing: User queries undergo the same tokenization process, creating sparse vectors that can be efficiently compared with document vectors.

Issues Resolved

Screenshot

Testing the changes

Changelog

Check List

  • All tests pass
    • yarn test:jest
    • yarn test:jest_integration
  • New functionality includes testing.
  • New functionality has been documented.
  • Update CHANGELOG.md
  • Commits are signed per the DCO using --signoff

Copy link
Contributor

github-actions bot commented Oct 9, 2025

❌ Empty Changelog Section

The Changelog section in your PR description is empty. Please add a valid changelog entry or entries. If you did add a changelog entry, check to make sure that it was not accidentally included inside the comment block in the Changelog section.

Copy link

codecov bot commented Oct 9, 2025

Codecov Report

❌ Patch coverage is 17.80822% with 60 lines in your changes missing coverage. Please review.
✅ Project coverage is 52.77%. Comparing base (dfc3f48) to head (91d6063).

Files with missing lines Patch % Lines
src/plugins/workspace/server/sparse_search.ts 0.00% 32 Missing ⚠️
src/plugins/workspace/server/routes/index.ts 5.88% 16 Missing ⚠️
src/core/public/chrome/utils.ts 73.33% 4 Missing ⚠️
...core/public/chrome/ui/header/header_search_bar.tsx 0.00% 3 Missing ⚠️
...c/chrome/ui/global_search/search_pages_command.tsx 0.00% 2 Missing ⚠️
src/core/public/chrome/chrome_service.tsx 0.00% 1 Missing ⚠️
src/core/public/core_system.ts 0.00% 1 Missing ⚠️
.../components/global_search/search_pages_command.tsx 50.00% 1 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (dfc3f48) and HEAD (91d6063). Click for more details.

HEAD has 4 uploads less than BASE
Flag BASE (dfc3f48) HEAD (91d6063)
Linux_2 1 0
Linux_1 1 0
Linux_3 1 0
Windows_2 1 0
Additional details and impacted files
@@                               Coverage Diff                                @@
##           feature/global-semantic-search-neural-sparse   #10696      +/-   ##
================================================================================
- Coverage                                         60.25%   52.77%   -7.49%     
================================================================================
  Files                                              4385     4099     -286     
  Lines                                            116753   112834    -3919     
  Branches                                          19010    18387     -623     
================================================================================
- Hits                                              70346    59543   -10803     
- Misses                                            41568    48981    +7413     
+ Partials                                           4839     4310     -529     
Flag Coverage Δ
Linux_1 ?
Linux_2 ?
Linux_3 ?
Linux_4 32.59% <0.00%> (-0.01%) ⬇️
Windows_1 26.63% <17.80%> (-0.04%) ⬇️
Windows_2 ?
Windows_3 38.42% <0.00%> (-0.01%) ⬇️
Windows_4 32.59% <0.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Contributor

❌ Empty Changelog Section

The Changelog section in your PR description is empty. Please add a valid changelog entry or entries. If you did add a changelog entry, check to make sure that it was not accidentally included inside the comment block in the Changelog section.

1 similar comment
Copy link
Contributor

❌ Empty Changelog Section

The Changelog section in your PR description is empty. Please add a valid changelog entry or entries. If you did add a changelog entry, check to make sure that it was not accidentally included inside the comment block in the Changelog section.

@s-zx s-zx changed the title Sparse search global semantic search: neural sparse Oct 13, 2025
@s-zx s-zx changed the title global semantic search: neural sparse global semantic search: neural sparse search Oct 13, 2025
Copy link
Contributor

❌ Empty Changelog Section

The Changelog section in your PR description is empty. Please add a valid changelog entry or entries. If you did add a changelog entry, check to make sure that it was not accidentally included inside the comment block in the Changelog section.

1 similar comment
Copy link
Contributor

❌ Empty Changelog Section

The Changelog section in your PR description is empty. Please add a valid changelog entry or entries. If you did add a changelog entry, check to make sure that it was not accidentally included inside the comment block in the Changelog section.

Signed-off-by: Zhenxing Shen <[email protected]>
);

const debouncedSearch = useMemo(() => {
return debounce(search, 500); // 300ms delay, adjust as needed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: comment need changes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reminding!

Comment on lines +74 to +81
const nlpBertTokenizer = new BertWordPieceTokenizer({ vocabContent: Object.keys(this.vocab) });
const tokenizedResult = nlpBertTokenizer.tokenizeSentence(query);
const tokensArray = tokenizedResult.tokens;
console.log('Tokenization: ', tokensArray);

const queryVec = this.buildQueryVector(tokensArray);
console.log('Non-zero query dimensions count: ', Object.keys(queryVec).length);
console.log('Non-zero query vector: ', queryVec);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

neural sparse search supports natural language query. You can search with existing model_id

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! We will try this way in the future.

Copy link
Member

@yuye-aws yuye-aws Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you can provide some context on why you're doing tokenization here? At first glance, you're doing tokenization to obtain the sparse query vector with IDF value. This is the doc-only search mode: https://docs.opensearch.org/latest/query-dsl/specialized/neural-sparse/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right. We want to use doc-only neural sparse search to achieve semantic search in frontend without using backend service. We generate doc vetor in advance and store them in frontend side. Then we tokenize the query to obtain the sparse query vector with IDF value. After that, relevance is calculated using a dot product between query and document vectors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc-only queries supports natural language query

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, this way we don't need tokenization anymore.

});

try {
console.log('All links:', allSearchAbleLinks);
Copy link
Contributor

@FriedhelmWS FriedhelmWS Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: AllSearchableLinks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

core.application.register({
id: PLUGIN_ID,
title: 'Discover',
description: 'Analyze your data in OpenSearch and visualize key metrics.',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for curiosity, would the performance of semantic search be more accurate if we can have a more explanatory and detailed description for each application?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it will imrpove the relevance. We can enrich them in the future.

...workspaceOptionalAttributesSchema,
});

let jsonData: {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it safe to make it as a global variable in nodejs? Would it be nice to create a class, make jsonData a private field of that class and provide dedicated method to manipulate the jsonData?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a good concern. We should make it private.


if (!jsonData) {
const filePath = path.join(__dirname, 'doc_vectors.json');
const data = await readFile(filePath, 'utf8');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to have error handling for file reading here I guess.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have a try/catch here, I think it will handle the error message.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants