-
Notifications
You must be signed in to change notification settings - Fork 24
Tanvir/add websites routes #4498
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: app
Are you sure you want to change the base?
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
1 Skipped Deployment
|
| asyncio.create_task( | ||
| _crawl_website_job( | ||
| job_id=job_id, | ||
| domain=domain, | ||
| config=body, | ||
| db=db, | ||
| ) | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: _crawl_website_job receives a request-scoped db session that will be closed before its database operations are executed.
Severity: CRITICAL | Confidence: 1.00
🔍 Detailed Analysis
The _crawl_website_job function receives a request-scoped db: AsyncSession parameter. Since this function is executed as an asyncio.create_task(), the db session will be closed by the time any database operations within _crawl_website_job are attempted. This will result in a "Session is closed" error when the currently commented-out database logic is implemented, as the session will no longer be active.
💡 Suggested Fix
Modify _crawl_website_job to not accept db: AsyncSession as a parameter. Instead, it should create its own AsyncSession using async_session_maker() within the function's scope for all database interactions.
🤖 Prompt for AI Agent
Fix this bug. In servers/fai/src/fai/routes/website.py at lines 81-88: The
`_crawl_website_job` function receives a request-scoped `db: AsyncSession` parameter.
Since this function is executed as an `asyncio.create_task()`, the `db` session will be
closed by the time any database operations within `_crawl_website_job` are attempted.
This will result in a "Session is closed" error when the currently commented-out
database logic is implemented, as the session will no longer be active.
Did we get this right? 👍 / 👎 to inform future reviews.
| config=body, | ||
| db=db, | ||
| ) | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Background task _crawl_website_job receives a request-scoped db session that will be closed before use, causing runtime errors.
Severity: CRITICAL | Confidence: 1.00
🔍 Detailed Analysis
The asyncio.create_task call in index_website passes a request-scoped db (AsyncSession) to the background task _crawl_website_job. When the HTTP request completes, FastAPI's dependency system cleans up and closes this session. If _crawl_website_job attempts database operations using this closed session, it will result in a runtime error. This issue is currently latent because database operations within _crawl_website_job are commented out.
💡 Suggested Fix
Modify _crawl_website_job to create its own AsyncSession using async_session_maker() or use job_manager.execute_job() which handles session lifecycle.
🤖 Prompt for AI Agent
Fix this bug. In servers/fai/src/fai/routes/website.py at line 88: The
`asyncio.create_task` call in `index_website` passes a request-scoped `db`
(AsyncSession) to the background task `_crawl_website_job`. When the HTTP request
completes, FastAPI's dependency system cleans up and closes this session. If
`_crawl_website_job` attempts database operations using this closed session, it will
result in a runtime error. This issue is currently latent because database operations
within `_crawl_website_job` are commented out.
Did we get this right? 👍 / 👎 to inform future reviews.
| job_id = await job_manager.create_job(db) | ||
|
|
||
| # TODO: Start the crawling job similar to index_website | ||
| # asyncio.create_task(_crawl_website_job(...)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: reindex_website creates a job but fails to launch the background task, leaving the job stuck in PENDING status.
Severity: CRITICAL | Confidence: 1.00
🔍 Detailed Analysis
The reindex_website endpoint creates a job using job_manager.create_job but fails to launch the corresponding background task _crawl_website_job because the asyncio.create_task call is commented out. This leaves the created job stuck indefinitely in PENDING status, as no background worker exists to process it, thus breaking the API contract for job tracking.
💡 Suggested Fix
Uncomment asyncio.create_task(_crawl_website_job(...)) in reindex_website to ensure the job is properly launched and processed.
🤖 Prompt for AI Agent
Fix this bug. In servers/fai/src/fai/routes/website.py at line 297: The
`reindex_website` endpoint creates a job using `job_manager.create_job` but fails to
launch the corresponding background task `_crawl_website_job` because the
`asyncio.create_task` call is commented out. This leaves the created job stuck
indefinitely in `PENDING` status, as no background worker exists to process it, thus
breaking the API contract for job tracking.
Did we get this right? 👍 / 👎 to inform future reviews.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments:
servers/fai/alembic/env.py (line 24):
WebsiteDb model is missing from the Alembic environment configuration. While the manual migration file exists and will work, Alembic won't be aware of this model, which can cause issues with future auto-migrations and model tracking.
View Details
📝 Patch Details
diff --git a/servers/fai/alembic/env.py b/servers/fai/alembic/env.py
index 9e0cc00b1..84f1a91bc 100644
--- a/servers/fai/alembic/env.py
+++ b/servers/fai/alembic/env.py
@@ -8,6 +8,7 @@ from fai.db import (
Base,
engine,
)
+from fai.models.db.conversation_report_db import ConversationReportDb # noqa: F401
from fai.models.db.discord_integration_db import DiscordIntegrationDb # noqa: F401
from fai.models.db.discord_message_cache_db import DiscordMessageCacheDb # noqa: F401
from fai.models.db.document_db import DocumentDb # noqa: F401
@@ -22,6 +23,7 @@ from fai.models.db.slack_context_db import SlackContextDb # noqa: F401
from fai.models.db.slack_integration_db import SlackIntegrationDb # noqa: F401
from fai.models.db.slack_message_cache_db import SlackMessageCacheDb # noqa: F401
from fai.models.db.slack_message_classification_db import SlackMessageClassificationDb # noqa: F401
+from fai.models.db.website_db import WebsiteDb # noqa: F401
# this is the Alembic Config object, which provides
# access to the values within the .ini file in use.
diff --git a/servers/fai/src/utils/init_db.py b/servers/fai/src/utils/init_db.py
index e14391bc6..705647a29 100644
--- a/servers/fai/src/utils/init_db.py
+++ b/servers/fai/src/utils/init_db.py
@@ -18,6 +18,7 @@ from fai.models.db.slack_context_db import SlackContextDb # noqa: F401
from fai.models.db.slack_integration_db import SlackIntegrationDb # noqa: F401
from fai.models.db.slack_message_cache_db import SlackMessageCacheDb # noqa: F401
from fai.models.db.slack_message_classification_db import SlackMessageClassificationDb # noqa: F401
+from fai.models.db.website_db import WebsiteDb # noqa: F401
async def init() -> None:
Analysis
Missing model imports in Alembic environment configuration break auto-migration detection
What fails: Alembic's env.py is missing imports for WebsiteDb and ConversationReportDb models, preventing auto-migration detection from tracking these tables
How to reproduce:
cd servers/fai
# Run alembic revision --autogenerate (would miss WebsiteDb/ConversationReportDb changes)Result: Alembic's Base.metadata.tables won't include websites or conversation_reports tables, causing auto-migration detection to either:
- Try to recreate existing tables
- Miss schema changes to these models in future migrations
- Generate incorrect migration diffs
Expected: All active database models should be imported in alembic/env.py to ensure complete metadata registration, matching the pattern used for 13 other models (DiscordIntegrationDb, DocumentDb, etc.)
Root cause: WebsiteDb has manual migration but wasn't added to env.py imports. ConversationReportDb exists only in init_db.py imports but not in Alembic configuration.
| job_id, | ||
| index_source.id, | ||
| domain, | ||
| IndexWebsiteRequest(base_url=body.base_url), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reindex_website endpoint creates a minimal IndexWebsiteRequest with only the base_url, discarding all original crawl configuration parameters that should be preserved from the existing index source.
View Details
📝 Patch Details
diff --git a/servers/fai/src/fai/routes/website.py b/servers/fai/src/fai/routes/website.py
index 5b74d9631..17f4c6e7f 100644
--- a/servers/fai/src/fai/routes/website.py
+++ b/servers/fai/src/fai/routes/website.py
@@ -432,6 +432,12 @@ async def reindex_website(
await db.commit()
# Start the crawling job
+ # Preserve original crawl configuration if index source exists
+ if index_source and hasattr(index_source, 'config') and index_source.config:
+ crawl_config = IndexWebsiteRequest(**index_source.config)
+ else:
+ crawl_config = IndexWebsiteRequest(base_url=body.base_url)
+
asyncio.create_task(
job_manager.execute_job(
job_id,
@@ -439,7 +445,7 @@ async def reindex_website(
job_id,
index_source.id,
domain,
- IndexWebsiteRequest(base_url=body.base_url),
+ crawl_config,
db,
)
)
Analysis
Configuration loss in reindex_website endpoint when preserving crawl parameters
What fails: reindex_website() in servers/fai/src/fai/routes/website.py creates minimal IndexWebsiteRequest(base_url=body.base_url) at line 442, losing original crawl configuration stored in index_source.config
How to reproduce:
- Index website with custom parameters:
POST /sources/website/example.com/indexwithchunk_size=500, delay=2.0, max_pages=100 - Call reindex endpoint:
POST /sources/website/example.com/reindexwith samebase_url - Original config (
chunk_size=500, delay=2.0, max_pages=100) gets replaced with defaults (chunk_size=1000, delay=1.0, max_pages=null)
Result: Reindex crawl uses different parameters than original indexing, creating inconsistent chunks and crawl behavior
Expected: Reindex should preserve original crawl configuration from index_source.config for consistent re-crawling with identical parameters as initial indexing
Fixes FER-
Short description of the changes made
Adds routes for website datasources (not implemented yet).
How has this PR been tested?
Ran locally and verified that I can hit the endpoints