-
Notifications
You must be signed in to change notification settings - Fork 4.9k
Fixed embedding issue (8192 token limit) + allow to configure peristent local storage for chromadb #242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @bmigette, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces significant improvements by adding text chunking to handle large inputs and enabling persistent storage for FinancialSituationMemory. The changes are well-structured and include comprehensive tests. My review focuses on improving code clarity, robustness, and maintainability by suggesting the use of more idiomatic library features, better exception handling, and code style enhancements. I've also noted the addition of documentation files that might be redundant.
| try: | ||
| self.situation_collection = self.chroma_client.get_collection(name=collection_name) | ||
| except: | ||
| self.situation_collection = self.chroma_client.create_collection(name=collection_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This try-except block uses a bare except:, which is dangerous as it catches system-exiting exceptions like SystemExit and KeyboardInterrupt. This logic can be made safer and more concise by using the idiomatic get_or_create_collection method provided by ChromaDB.
self.situation_collection = self.chroma_client.get_or_create_collection(name=collection_name)| model=self.embedding, input=chunk | ||
| ) | ||
| chunk_embeddings.append(response.data[0].embedding) | ||
| except Exception as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| except Exception: | ||
| # Fallback: try without settings if there are compatibility issues | ||
| self.chroma_client = chromadb.PersistentClient(path=persistent_dir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Catching a broad Exception can hide unexpected errors and make debugging harder. While the fallback is intended for compatibility issues, it would be more robust to at least log the exception that was caught so it's not silently ignored.
| except Exception: | |
| # Fallback: try without settings if there are compatibility issues | |
| self.chroma_client = chromadb.PersistentClient(path=persistent_dir) | |
| except Exception as e: | |
| # Fallback: try without settings if there are compatibility issues | |
| print(f"Warning: Failed to initialize PersistentClient with settings, falling back. Error: {e}") | |
| self.chroma_client = chromadb.PersistentClient(path=persistent_dir) |
| """ | ||
| # text-embedding-3-small has a max context length of 8192 tokens | ||
| # Conservative estimate: ~3 characters per token for safety margin | ||
| max_chars = 24000 # ~8000 tokens * 3 chars/token |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The value 24000 is a "magic number". It's better to define it as a constant at the class or module level with a descriptive name (e.g., _MAX_EMBEDDING_CHARS). This improves readability and makes it easier to change if the model limits are updated in the future. This value is also used to calculate chunk_size on line 85.
| return [response.data[0].embedding] | ||
|
|
||
| # Text is too long, use RecursiveCharacterTextSplitter | ||
| print(f"Text length {len(text)} exceeds limit, splitting into chunks for embedding") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using print for logging is not ideal for production code. It's better to use Python's logging module. This provides more control over log levels, formatting, and output streams (e.g., writing to a file, stderr, etc.). The PR description mentions this was a conscious choice to simplify, but it's a key point for maintainability. Similar print statements are on lines 92 and 103.
| situation_embeddings = self.get_embedding(situation) | ||
|
|
||
| # Add each chunk as a separate document | ||
| for chunk_idx, embedding in enumerate(situation_embeddings): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| @@ -0,0 +1,325 @@ | |||
| # Pull Request: Add Chunking and Persistent Storage to FinancialSituationMemory | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file, along with CHANGES_SUMMARY.md, appears to be a copy of the pull request description. While detailed documentation is great, checking it into the repository for each PR can lead to a lot of one-off documentation files that may become outdated. Consider whether this information would be better placed in a more central location like a project wiki, a CHANGELOG.md, or within the code's docstrings, to make it more maintainable in the long run.
Pull Request: Add Chunking and Persistent Storage to FinancialSituationMemory
Overview
This PR adds two critical improvements to the
FinancialSituationMemoryclass:RecursiveCharacterTextSplitterMotivation
Problem 1: Long Text Handling
The OpenAI embedding model
text-embedding-3-smallhas a maximum context length of 8,192 tokens. When financial analysis texts exceed this limit (e.g., comprehensive market analyses, long research reports), the embedding API fails with an error.Before:
After:
Problem 2: Memory Persistence
The original implementation used ChromaDB's in-memory client, meaning all memories were lost when the process ended. For production trading systems, persisting historical knowledge is essential.
Before:
After:
Changes Made
1. Updated
requirements.txtAdded
langchainforRecursiveCharacterTextSplitter:typing-extensions +langchain langchain-openai langchain-experimental2. Enhanced
FinancialSituationMemory.__init__()Added Parameters:
symbol(optional): For symbol-specific collection namingpersistent_dir(optional): Path for persistent storageKey Improvements:
persistent_diris provided3. Reimplemented
get_embedding()Returns Changed:
Key Features:
RecursiveCharacterTextSplitterfor intelligent chunkingAlgorithm:
4. Updated
add_situations()Changed to handle chunked embeddings:
Benefit: Even if query matches only one chunk of a long situation, the full situation is returned.
5. Enhanced
get_memories()Added embedding averaging for multi-chunk queries:
Benefit: Long queries are represented by their average embedding, improving semantic search accuracy.
Backward Compatibility
✅ Fully backward compatible - existing code continues to work without changes:
New features are opt-in via optional parameters:
Testing
Created comprehensive test suite in
test_memory_chunking.py:To run tests:
Benefits
For Users
For Developers
Performance Impact
Memory Usage
Processing Time
Storage
Migration Guide
For Existing Users
No changes required! Your existing code continues to work:
For New Features
Add persistent storage:
Add symbol-specific collections:
Example Usage
Before (Old API)
After (New API with Long Texts)
Files Changed
langchaindependencyCode Quality
Checklist
Future Enhancements
Potential follow-ups (not in this PR):
Questions & Answers
Q: Why return a list instead of single embedding?
A: Consistency - both short and long texts now return lists. Makes API more predictable and easier to handle.
Q: Why average embeddings for multi-chunk queries?
A: Common approach in semantic search - represents overall meaning while avoiding bias toward any single chunk.
Q: Is persistent_dir required?
A: No - optional parameter. Defaults to in-memory storage for backward compatibility.
Q: Can I migrate existing in-memory data to persistent storage?
A: Not directly - would need to re-add situations with persistent_dir specified.
Q: Performance impact?
A: Negligible for short texts. Linear increase for long texts based on chunk count.
Acknowledgments
This implementation is based on patterns from:
Ready for Review ✅
Please let me know if you have any questions or suggestions!