Skip to content
Open
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions ai/.python-version
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
3.9.15
12 changes: 12 additions & 0 deletions ai/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,18 @@ The server is built with FastAPI. To start the server by running `uvicorn main:a
Swaggger Documentation: /docs
Chat endpoint: /chat

The storage context is pulled from s3 so the `main.py` script needs to know where to find it and how to authenticate.

- Auth:
IRSA should work, otherwise you'll need to set the standard AWS env vars:
- `AWS_ACCESS_KEY_ID`
- `AWS_SECRET_ACCESS_KEY`
- Path:
The script expects the AWS path in `PLURAL_AI_INDEX_S3_PATH` in the format `<bucket-name>/<path>`.
Defaults to `plural-assets/dagster/plural-ai/vector_store_index`

To be safe `AWS_DEFAULT_REGION` should be set to the region of the bucket.

## Running scraper.py

The scraper currently incorporates three datasources:
Expand Down
7 changes: 6 additions & 1 deletion ai/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
from llama_index import StorageContext, load_index_from_storage, ServiceContext, set_global_service_context
from llama_index.indices.postprocessor import SentenceEmbeddingOptimizer
from llama_index.embeddings import OpenAIEmbedding
from s3fs import S3FileSystem

from pydantic import BaseModel

Expand All @@ -22,7 +23,11 @@ class QueryResponse(BaseModel):
service_context = ServiceContext.from_defaults(embed_model=embed_model)
set_global_service_context(service_context)

storage_context = StorageContext.from_defaults(persist_dir="./storage")
storage_context = StorageContext.from_defaults(
# persist_dir format: "<bucket-name>/<path>"
persist_dir=os.getenv("PLURAL_AI_INDEX_S3_PATH", "plural-assets/dagster/plural-ai/vector_store_index"),
fs=S3FileSystem()
)
index = load_index_from_storage(storage_context)
query_engine = index.as_query_engine(
node_postprocessors=[SentenceEmbeddingOptimizer(percentile_cutoff=0.5)],
Expand Down
3 changes: 2 additions & 1 deletion ai/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -56,4 +56,5 @@ yarl==1.9.2
python-graphql-client
nltk
config
html2text
html2text
s3fs
2 changes: 1 addition & 1 deletion ai/scraper.py
Original file line number Diff line number Diff line change
Expand Up @@ -117,4 +117,4 @@ def scrape_discord():
index = VectorStoreIndex.from_documents(list(chain))
index.storage_context.persist()

print("persisted new vector index")
print("persisted new vector index")