Add solr #336

cwaddingham · 2025-09-18T15:43:38Z

Problem

We need to add first‐class Solr support to VSB while ensuring robustness, configurability, and code quality.

Solr queries were returning zero recall due to metadata flattening and type mismatches.
Long‐running test runs (100k queries) needed a way to be limited for quick iteration.
Solr core creation/deletion and index schema setup were brittle, leading to 503 errors and stale cores.
Performance of Solr k-NN queries was unacceptably slow out-of-the-box.
The codebase had accumulated linter violations and inconsistent formatting.

Solution

Integrate Solr as a supported backend
- Implemented SolrClient & SolrNamespace in vsb/databases/solr/solr.py with robust retry, backoff, schema management, core lifecycle, and metadata flattening that promotes fields to top-level.
- Added --query-limit CLI option (vsb/cmdline_args.py) and plumbed it through workload builders (locustfile.py, ParquetWorkload) to allow small test runs.
- Cast both actual & expected IDs to strings in the search client and metrics (vsb/metrics.py) to guarantee correct recall.
Resilient index creation & cleanup
- Enhanced core unload/create logic with deleteIndex, deleteDataDir, deleteInstanceDir, proper polling, and Docker volume adjustments to avoid stale data.
- Wrapped all errors in StopUser for clean Locust shutdown.
Performance tuning
- Exposed Solr JVM heap (SOLR_JAVA_MEM), G1GC flags, ulimits, CPU/memory quotas in docker/solr/docker-compose.yml.
- Mounted custom solrconfig.xml & managed-schema.xml with tuned filterCache, queryResultCache, docValuesCache, thread pools, and HNSW parameters (ef, beamWidth) via bind mounts.
Code quality & infrastructure
- Ran Black (v24.4.2) across the repo, fixed all formatting violations.
- Updated tests in tests/integration for Solr mirroring the existing common test framework.
- Added documentation to README and vsb/databases/solr/README.md on configset seeding and Docker usage.

Type of Change

New feature (adds Solr integration and related improvements)
Bug fix (metadata flattening, ID type casting, core teardown)
Infrastructure change (Docker Compose, solrconfig.xml, cache tuning)
Documentation update (README additions, configset instructions)

Test Plan

Unit & integration tests
- pytest tests/unit and pytest tests/integration --db solr pass with query-limit and recall > 0.
Manual Solr validation
- Use curl to confirm k-NN queries with ef, fq parameters return expected docs.
Locust workload
- Run locust -f vsb/locustfile.py --query-limit 100 --database solr and verify correct op/s and recall metrics.
Performance check
- Benchmark before/after cache and JVM tuning; ensure p95 latency drops under target (e.g. <500 ms).
Linter & formatting
- black . --check returns no changes.
Docker restart
- Confirm Solr container picks up custom solrconfig.xml without deleting /var/solr/data.

- Added Solr to the list of supported databases in the main README. - Implemented Solr-specific command-line arguments in `cmdline_args.py` for configuring Solr URL, index name, and index configuration. - Updated the `Database` enum in `vsb/databases/__init__.py` to include Solr and return the `SolrDB` class. - Created necessary files and logic to support Solr database operations, including index creation and management. - Ensured compatibility with existing VSB workflows and command-line interface.

…nvert existing parquet files to the format Pinecone requires; the other to start the import itself.

… continuing a workload that failed.

…esume - Add persistent `requests.Session` with `HTTPAdapter` + `Retry`; centralize HTTP helpers. - Ensure schema on startup: - add/replace `knn_vector` fieldType with correct `vectorDimension` and `similarityFunction` - ensure `id` + `values` fields - add typed, multiValued dynamic fields (`*_s`, `*_i`, `*_f`, `*_b`) - Core lifecycle: - `core_exists`, `_wait_for_core_loaded`, `_recreate_core`, `_unload_hard` - `create_index` now cleanly creates core and refuses to recreate if it already exists - Ingest robustness: - `_filter_existing_ids` to skip already-present docs (RTG via `/select`) - `_normalize_docs` + type inference + field auto-creation - batched add with retries and `commitWithin=60000` - `delete_all`, `delete_index`, and explicit final `commit()` - Query/filters: - `_to_solr_fq` builds typed `fq` from dicts/lists/bools/nums - `search` defaults `ef=200`, returns `id,score` only, supports dict `fq` - API/behavior changes in wrappers: - `SolrClient.__init__` now `core=<name>`, keyword-only; supports `start_from`, `overwrite`, retry settings - `SolrNamespace.insert_batch` respects `skip_populate`, passes `start_from`/`overwrite`; `query` returns list of IDs directly - `SolrDB` passes args by name; handles `overwrite` vs resume; lowers `max_batch_size` to 100; commits after finalize BREAKING CHANGES: - `SolrClient.__init__` signature changed: use `core=<name>` and keyword args; `index_name` removed. - `create_index` errors if core already exists (use `overwrite`/drop if you need a fresh core). - `search` no longer returns `metadata` in `fl` (now `id,score`); update callers if they relied on metadata. - `SolrDB` expects new config keys: `start_from`, `solr_max_retries`, `solr_retry_delay`.

…esources - Added skeleton configsets for Solr to ensure custom settings for cache are used

…ul for testing).

- Revised the solr.py module to better handle queries with Solr tuning

… handles.

…ting all to str.

gitguardian · 2025-09-18T15:44:36Z

⚠️ GitGuardian has uncovered 1 secret following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secret in your pull request

GitGuardian id	GitGuardian status	Secret	Commit	Filename
20844949	Triggered	Generic Password	`3c4f073`	tests/integration/test_solr.py	View secret

🛠 Guidelines to remediate hardcoded secrets

Understand the implications of revoking this secret by investigating where it is used in your code.
Replace and store your secret safely. Learn here the best practices.
Revoke and rotate this secret.
If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider

following these best practices for managing and storing secrets including API keys and other credentials
install secret detection on pre-commit to catch secret before it leaves your machine and ease remediation.

^{🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.}

…iables.

vkrishna1084 · 2025-09-19T21:02:49Z

reviewed and merging the changes

…ring PR merge.

…during PR merge.

cwaddingham added 17 commits September 8, 2025 17:06

Added two utilities for handling bulk import into Pinecone. One to co…

e2571b3

…nvert existing parquet files to the format Pinecone requires; the other to start the import itself.

Corrected distance and dimensions for YFCC dataset.

61ebdfd

Added additional libraries to pyproject.toml.

eec3bdc

Added additional command line arguments for Solr, specifically around…

463ac73

… continuing a workload that failed.

Additional error logging on Pinecone index creation.

75d3793

Corrected linter errors.

6a76055

Updated formatting to clear linter errors.

a864f06

Updated test files to resolve linter errors.

e1cc061

- Updated docker-compose.yaml to set reasonable minimums for system r…

8b3ed85

…esources - Added skeleton configsets for Solr to ensure custom settings for cache are used

Added new command line option to limit how many queries are run (usef…

f61e583

…ul for testing).

- Updated README.md with instructions on using the Docker container

b96f0b1

- Revised the solr.py module to better handle queries with Solr tuning

Added support for query limits and ensured proper closing of database…

d19ce3b

… handles.

Ensured vector IDs are compared with ground truth consistently by cas…

32121a6

…ting all to str.

Updates to allow for query limits from the command line.

1482ec2

Updates to comply with black linter report.

53028d0

cwaddingham requested a review from vkrishna1084 September 18, 2025 15:43

cwaddingham added 2 commits September 18, 2025 08:45

Regenerate poetry.lock after pyproject.toml changes

9ad8863

Changed hard coded user/password for Solr test to use environment var…

63224ca

…iables.

vkrishna1084 approved these changes Sep 19, 2025

View reviewed changes

cwaddingham added 2 commits September 19, 2025 14:24

Added yfcc-test as a workload to resolve some Python import errors du…

66a786b

…ring PR merge.

Added spawn-test-solr as a test to resolve some Python import errors …

ccdc5bb

…during PR merge.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add solr #336

Add solr #336

Uh oh!

cwaddingham commented Sep 18, 2025

Uh oh!

gitguardian bot commented Sep 18, 2025 •

edited

Loading

Uh oh!

vkrishna1084 commented Sep 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add solr #336

Are you sure you want to change the base?

Add solr #336

Uh oh!

Conversation

cwaddingham commented Sep 18, 2025

Problem

Solution

Type of Change

Test Plan

Uh oh!

gitguardian bot commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ GitGuardian has uncovered 1 secret following the scan of your pull request.

Uh oh!

vkrishna1084 commented Sep 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gitguardian bot commented Sep 18, 2025 •

edited

Loading