-
Notifications
You must be signed in to change notification settings - Fork 4
Add solr #336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add solr #336
Conversation
- Added Solr to the list of supported databases in the main README. - Implemented Solr-specific command-line arguments in `cmdline_args.py` for configuring Solr URL, index name, and index configuration. - Updated the `Database` enum in `vsb/databases/__init__.py` to include Solr and return the `SolrDB` class. - Created necessary files and logic to support Solr database operations, including index creation and management. - Ensured compatibility with existing VSB workflows and command-line interface.
…nvert existing parquet files to the format Pinecone requires; the other to start the import itself.
… continuing a workload that failed.
…esume - Add persistent `requests.Session` with `HTTPAdapter` + `Retry`; centralize HTTP helpers. - Ensure schema on startup: - add/replace `knn_vector` fieldType with correct `vectorDimension` and `similarityFunction` - ensure `id` + `values` fields - add typed, multiValued dynamic fields (`*_s`, `*_i`, `*_f`, `*_b`) - Core lifecycle: - `core_exists`, `_wait_for_core_loaded`, `_recreate_core`, `_unload_hard` - `create_index` now cleanly creates core and refuses to recreate if it already exists - Ingest robustness: - `_filter_existing_ids` to skip already-present docs (RTG via `/select`) - `_normalize_docs` + type inference + field auto-creation - batched add with retries and `commitWithin=60000` - `delete_all`, `delete_index`, and explicit final `commit()` - Query/filters: - `_to_solr_fq` builds typed `fq` from dicts/lists/bools/nums - `search` defaults `ef=200`, returns `id,score` only, supports dict `fq` - API/behavior changes in wrappers: - `SolrClient.__init__` now `core=<name>`, keyword-only; supports `start_from`, `overwrite`, retry settings - `SolrNamespace.insert_batch` respects `skip_populate`, passes `start_from`/`overwrite`; `query` returns list of IDs directly - `SolrDB` passes args by name; handles `overwrite` vs resume; lowers `max_batch_size` to 100; commits after finalize BREAKING CHANGES: - `SolrClient.__init__` signature changed: use `core=<name>` and keyword args; `index_name` removed. - `create_index` errors if core already exists (use `overwrite`/drop if you need a fresh core). - `search` no longer returns `metadata` in `fl` (now `id,score`); update callers if they relied on metadata. - `SolrDB` expects new config keys: `start_from`, `solr_max_retries`, `solr_retry_delay`.
…esources - Added skeleton configsets for Solr to ensure custom settings for cache are used
- Revised the solr.py module to better handle queries with Solr tuning
|
| GitGuardian id | GitGuardian status | Secret | Commit | Filename | |
|---|---|---|---|---|---|
| 20844949 | Triggered | Generic Password | 3c4f073 | tests/integration/test_solr.py | View secret |
🛠 Guidelines to remediate hardcoded secrets
- Understand the implications of revoking this secret by investigating where it is used in your code.
- Replace and store your secret safely. Learn here the best practices.
- Revoke and rotate this secret.
- If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.
To avoid such incidents in the future consider
- following these best practices for managing and storing secrets including API keys and other credentials
- install secret detection on pre-commit to catch secret before it leaves your machine and ease remediation.
🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.
|
reviewed and merging the changes |
Problem
We need to add first‐class Solr support to VSB while ensuring robustness, configurability, and code quality.
Solution
SolrClient&SolrNamespaceinvsb/databases/solr/solr.pywith robust retry, backoff, schema management, core lifecycle, and metadata flattening that promotes fields to top-level.--query-limitCLI option (vsb/cmdline_args.py) and plumbed it through workload builders (locustfile.py,ParquetWorkload) to allow small test runs.vsb/metrics.py) to guarantee correct recall.deleteIndex,deleteDataDir,deleteInstanceDir, proper polling, and Docker volume adjustments to avoid stale data.StopUserfor clean Locust shutdown.SOLR_JAVA_MEM), G1GC flags, ulimits, CPU/memory quotas indocker/solr/docker-compose.yml.solrconfig.xml&managed-schema.xmlwith tunedfilterCache,queryResultCache,docValuesCache, thread pools, and HNSW parameters (ef,beamWidth) via bind mounts.tests/integrationfor Solr mirroring the existing common test framework.vsb/databases/solr/README.mdon configset seeding and Docker usage.Type of Change
Test Plan
pytest tests/unitandpytest tests/integration --db solrpass with query-limit and recall > 0.ef,fqparameters return expected docs.locust -f vsb/locustfile.py --query-limit 100 --database solrand verify correct op/s and recall metrics.black . --checkreturns no changes.solrconfig.xmlwithout deleting/var/solr/data.