Skip to content

Conversation

@cwaddingham
Copy link

Problem

We need to add first‐class Solr support to VSB while ensuring robustness, configurability, and code quality.

  • Solr queries were returning zero recall due to metadata flattening and type mismatches.
  • Long‐running test runs (100k queries) needed a way to be limited for quick iteration.
  • Solr core creation/deletion and index schema setup were brittle, leading to 503 errors and stale cores.
  • Performance of Solr k-NN queries was unacceptably slow out-of-the-box.
  • The codebase had accumulated linter violations and inconsistent formatting.

Solution

  • Integrate Solr as a supported backend
    • Implemented SolrClient & SolrNamespace in vsb/databases/solr/solr.py with robust retry, backoff, schema management, core lifecycle, and metadata flattening that promotes fields to top-level.
    • Added --query-limit CLI option (vsb/cmdline_args.py) and plumbed it through workload builders (locustfile.py, ParquetWorkload) to allow small test runs.
    • Cast both actual & expected IDs to strings in the search client and metrics (vsb/metrics.py) to guarantee correct recall.
  • Resilient index creation & cleanup
    • Enhanced core unload/create logic with deleteIndex, deleteDataDir, deleteInstanceDir, proper polling, and Docker volume adjustments to avoid stale data.
    • Wrapped all errors in StopUser for clean Locust shutdown.
  • Performance tuning
    • Exposed Solr JVM heap (SOLR_JAVA_MEM), G1GC flags, ulimits, CPU/memory quotas in docker/solr/docker-compose.yml.
    • Mounted custom solrconfig.xml & managed-schema.xml with tuned filterCache, queryResultCache, docValuesCache, thread pools, and HNSW parameters (ef, beamWidth) via bind mounts.
  • Code quality & infrastructure
    • Ran Black (v24.4.2) across the repo, fixed all formatting violations.
    • Updated tests in tests/integration for Solr mirroring the existing common test framework.
    • Added documentation to README and vsb/databases/solr/README.md on configset seeding and Docker usage.

Type of Change

  • New feature (adds Solr integration and related improvements)
  • Bug fix (metadata flattening, ID type casting, core teardown)
  • Infrastructure change (Docker Compose, solrconfig.xml, cache tuning)
  • Documentation update (README additions, configset instructions)

Test Plan

  1. Unit & integration tests
    • pytest tests/unit and pytest tests/integration --db solr pass with query-limit and recall > 0.
  2. Manual Solr validation
    • Use curl to confirm k-NN queries with ef, fq parameters return expected docs.
  3. Locust workload
    • Run locust -f vsb/locustfile.py --query-limit 100 --database solr and verify correct op/s and recall metrics.
  4. Performance check
    • Benchmark before/after cache and JVM tuning; ensure p95 latency drops under target (e.g. <500 ms).
  5. Linter & formatting
    • black . --check returns no changes.
  6. Docker restart
    • Confirm Solr container picks up custom solrconfig.xml without deleting /var/solr/data.

- Added Solr to the list of supported databases in the main README.
- Implemented Solr-specific command-line arguments in `cmdline_args.py` for configuring Solr URL, index name, and index configuration.
- Updated the `Database` enum in `vsb/databases/__init__.py` to include Solr and return the `SolrDB` class.
- Created necessary files and logic to support Solr database operations, including index creation and management.
- Ensured compatibility with existing VSB workflows and command-line interface.
…nvert existing parquet files to the format Pinecone requires; the other to start the import itself.
…esume

- Add persistent `requests.Session` with `HTTPAdapter` + `Retry`; centralize HTTP helpers.
- Ensure schema on startup:
  - add/replace `knn_vector` fieldType with correct `vectorDimension` and `similarityFunction`
  - ensure `id` + `values` fields
  - add typed, multiValued dynamic fields (`*_s`, `*_i`, `*_f`, `*_b`)
- Core lifecycle:
  - `core_exists`, `_wait_for_core_loaded`, `_recreate_core`, `_unload_hard`
  - `create_index` now cleanly creates core and refuses to recreate if it already exists
- Ingest robustness:
  - `_filter_existing_ids` to skip already-present docs (RTG via `/select`)
  - `_normalize_docs` + type inference + field auto-creation
  - batched add with retries and `commitWithin=60000`
  - `delete_all`, `delete_index`, and explicit final `commit()`
- Query/filters:
  - `_to_solr_fq` builds typed `fq` from dicts/lists/bools/nums
  - `search` defaults `ef=200`, returns `id,score` only, supports dict `fq`
- API/behavior changes in wrappers:
  - `SolrClient.__init__` now `core=<name>`, keyword-only; supports `start_from`, `overwrite`, retry settings
  - `SolrNamespace.insert_batch` respects `skip_populate`, passes `start_from`/`overwrite`; `query` returns list of IDs directly
  - `SolrDB` passes args by name; handles `overwrite` vs resume; lowers `max_batch_size` to 100; commits after finalize

BREAKING CHANGES:
- `SolrClient.__init__` signature changed: use `core=<name>` and keyword args; `index_name` removed.
- `create_index` errors if core already exists (use `overwrite`/drop if you need a fresh core).
- `search` no longer returns `metadata` in `fl` (now `id,score`); update callers if they relied on metadata.
- `SolrDB` expects new config keys: `start_from`, `solr_max_retries`, `solr_retry_delay`.
…esources

- Added skeleton configsets for Solr to ensure custom settings for cache are used
- Revised the solr.py module to better handle queries with Solr tuning
@gitguardian
Copy link

gitguardian bot commented Sep 18, 2025

⚠️ GitGuardian has uncovered 1 secret following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secret in your pull request
GitGuardian id GitGuardian status Secret Commit Filename
20844949 Triggered Generic Password 3c4f073 tests/integration/test_solr.py View secret
🛠 Guidelines to remediate hardcoded secrets
  1. Understand the implications of revoking this secret by investigating where it is used in your code.
  2. Replace and store your secret safely. Learn here the best practices.
  3. Revoke and rotate this secret.
  4. If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

@vkrishna1084
Copy link
Contributor

reviewed and merging the changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants