|
| 1 | +# Simple Ingestion Workflows |
| 2 | + |
| 3 | +This module provides Temporal workflows for ingesting documents into the vector database (Qdrant) using the TC Hivemind backend. |
| 4 | + |
| 5 | +## Available Workflows |
| 6 | + |
| 7 | +### 1. VectorIngestionWorkflow |
| 8 | + |
| 9 | +A workflow for processing single document ingestion requests. |
| 10 | + |
| 11 | +**Usage:** |
| 12 | +```python |
| 13 | +from hivemind_etl.simple_ingestion.schema import IngestionRequest |
| 14 | +from temporalio.client import Client |
| 15 | + |
| 16 | +# Create single ingestion request |
| 17 | +request = IngestionRequest( |
| 18 | + communityId="my_community", |
| 19 | + platformId="my_platform", |
| 20 | + text="Document content here...", |
| 21 | + metadata={ |
| 22 | + "title": "Document Title", |
| 23 | + "author": "Author Name" |
| 24 | + } |
| 25 | +) |
| 26 | + |
| 27 | +# Execute workflow |
| 28 | +client = await Client.connect("localhost:7233") |
| 29 | +await client.execute_workflow( |
| 30 | + "VectorIngestionWorkflow", |
| 31 | + request, |
| 32 | + id="single-ingestion-123", |
| 33 | + task_queue="hivemind-etl" |
| 34 | +) |
| 35 | +``` |
| 36 | + |
| 37 | +### 2. BatchVectorIngestionWorkflow |
| 38 | + |
| 39 | +A workflow for processing multiple document ingestion requests in parallel batches for improved efficiency. |
| 40 | + |
| 41 | +**Key Features:** |
| 42 | +- **Automatic Chunking**: Large batches are automatically split into smaller parallel chunks |
| 43 | +- **Parallel Processing**: Multiple `process_documents_batch` activities run simultaneously |
| 44 | +- **Configurable Batch Size**: Control the size of each processing chunk (default: 10 documents) |
| 45 | +- **Same Collection**: All documents in a batch request must belong to the same community and collection |
| 46 | +- **Error Handling**: Same retry policy as single document workflow but with longer timeout for batch processing |
| 47 | + |
| 48 | +**Usage:** |
| 49 | +```python |
| 50 | +from hivemind_etl.simple_ingestion.schema import BatchIngestionRequest, BatchDocument |
| 51 | +from temporalio.client import Client |
| 52 | + |
| 53 | +# Create batch ingestion request |
| 54 | +batch_request = BatchIngestionRequest( |
| 55 | + communityId="my_community", |
| 56 | + platformId="my_platform", |
| 57 | + collectionName="optional_custom_collection", # Optional |
| 58 | + document=[ |
| 59 | + BatchDocument( |
| 60 | + docId="doc_1", |
| 61 | + text="First document content...", |
| 62 | + metadata={"title": "Document 1"}, |
| 63 | + excludedEmbedMetadataKeys=["some_key"], |
| 64 | + excludedLlmMetadataKeys=["other_key"] |
| 65 | + ), |
| 66 | + BatchDocument( |
| 67 | + docId="doc_2", |
| 68 | + text="Second document content...", |
| 69 | + metadata={"title": "Document 2"} |
| 70 | + ), |
| 71 | + # ... more documents |
| 72 | + ] |
| 73 | +) |
| 74 | + |
| 75 | +# Execute batch workflow |
| 76 | +client = await Client.connect("localhost:7233") |
| 77 | +await client.execute_workflow( |
| 78 | + "BatchVectorIngestionWorkflow", |
| 79 | + batch_request, |
| 80 | + 10, # batch_size: optional, default is 10 |
| 81 | + id="batch-ingestion-123", |
| 82 | + task_queue="hivemind-etl" |
| 83 | +) |
| 84 | +``` |
| 85 | + |
| 86 | +## Schema Reference |
| 87 | + |
| 88 | +### IngestionRequest (Single Document) |
| 89 | + |
| 90 | +```python |
| 91 | +class IngestionRequest(BaseModel): |
| 92 | + communityId: str # Community identifier |
| 93 | + platformId: str # Platform identifier |
| 94 | + text: str # Document text content |
| 95 | + metadata: dict # Document metadata |
| 96 | + docId: str = str(uuid4()) # Unique document ID (auto-generated) |
| 97 | + excludedEmbedMetadataKeys: list[str] = [] # Keys to exclude from embedding |
| 98 | + excludedLlmMetadataKeys: list[str] = [] # Keys to exclude from LLM processing |
| 99 | + collectionName: str | None = None # Optional custom collection name |
| 100 | +``` |
| 101 | + |
| 102 | +### BatchIngestionRequest (Multiple Documents) |
| 103 | + |
| 104 | +```python |
| 105 | +class BatchIngestionRequest(BaseModel): |
| 106 | + communityId: str # Community identifier |
| 107 | + platformId: str # Platform identifier |
| 108 | + collectionName: str | None = None # Optional custom collection name |
| 109 | + document: list[BatchDocument] # List of documents to process |
| 110 | + |
| 111 | +class BatchDocument(BaseModel): |
| 112 | + docId: str # Unique document ID |
| 113 | + text: str # Document text content |
| 114 | + metadata: dict # Document metadata |
| 115 | + excludedEmbedMetadataKeys: list[str] = [] # Keys to exclude from embedding |
| 116 | + excludedLlmMetadataKeys: list[str] = [] # Keys to exclude from LLM processing |
| 117 | +``` |
| 118 | + |
| 119 | +## Collection Naming |
| 120 | + |
| 121 | +- **Default**: `{communityId}_{platformId}` |
| 122 | +- **Custom**: `{communityId}_{collectionName}` (when `collectionName` is provided) |
| 123 | + |
| 124 | +The collection name reconstruction is handled automatically by the `CustomIngestionPipeline`. |
| 125 | + |
| 126 | +## Performance Considerations |
| 127 | + |
| 128 | +### When to Use Batch vs Single Workflows |
| 129 | + |
| 130 | +**Use BatchVectorIngestionWorkflow when:** |
| 131 | +- Processing multiple documents from the same community/collection |
| 132 | +- Bulk importing large datasets |
| 133 | +- You have 10+ documents to process together |
| 134 | +- You want to maximize throughput with parallel processing |
| 135 | + |
| 136 | +**Use VectorIngestionWorkflow when:** |
| 137 | +- Processing single documents in real-time |
| 138 | +- Documents arrive individually |
| 139 | +- You need immediate processing |
| 140 | +- Simple use cases with occasional documents |
| 141 | + |
| 142 | +### Batch Processing Optimizations |
| 143 | + |
| 144 | +The batch workflow automatically optimizes performance by: |
| 145 | + |
| 146 | +1. **Parallel Chunking**: Large batches are split into smaller chunks that process simultaneously |
| 147 | +2. **Configurable Batch Size**: Tune chunk size based on your system resources (default: 10) |
| 148 | +3. **Pipeline Reuse**: One `CustomIngestionPipeline` instance per chunk |
| 149 | +4. **Bulk Operations**: All documents in a chunk are processed together |
| 150 | +5. **Concurrent Execution**: Multiple chunks can run in parallel using asyncio.gather() |
| 151 | + |
| 152 | +## Error Handling |
| 153 | + |
| 154 | +Both workflows implement the same retry policy: |
| 155 | +- **Initial retry interval**: 1 second |
| 156 | +- **Maximum retry interval**: 1 minute |
| 157 | +- **Maximum attempts**: 3 |
| 158 | +- **Timeout**: 5 minutes (single), 10 minutes (batch) |
| 159 | + |
| 160 | +## Testing |
| 161 | + |
| 162 | +Use the provided test script to verify functionality: |
| 163 | + |
| 164 | +```bash |
| 165 | +python test_batch_workflow.py |
| 166 | +``` |
| 167 | + |
| 168 | +The test script demonstrates: |
| 169 | +- Batch processing with multiple documents |
| 170 | +- Mixed collection handling |
| 171 | +- Comparison between single and batch workflows |
| 172 | + |
| 173 | +## Integration |
| 174 | + |
| 175 | +Both workflows are automatically registered in the Temporal worker through `registry.py`. Ensure your worker includes: |
| 176 | + |
| 177 | +```python |
| 178 | +from registry import WORKFLOWS, ACTIVITIES |
| 179 | + |
| 180 | +# Worker setup includes both workflows and activities |
| 181 | +worker = Worker( |
| 182 | + client=client, |
| 183 | + task_queue="hivemind-etl", |
| 184 | + workflows=WORKFLOWS, |
| 185 | + activities=ACTIVITIES |
| 186 | +) |
| 187 | +``` |
0 commit comments