Skip to content

Index and query prs#40

Open
nick1udwig wants to merge 10 commits intosimiligh:mainfrom
nick1udwig:index-and-query-prs
Open

Index and query prs#40
nick1udwig wants to merge 10 commits intosimiligh:mainfrom
nick1udwig:index-and-query-prs

Conversation

@nick1udwig
Copy link

@nick1udwig nick1udwig commented Feb 12, 2026

Description

Index and query PRs, not just issues

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 💥 Breaking change (fix or feature that would cause existing functionality to change)
  • 📚 Documentation update
  • 🔧 Configuration/build change
  • ♻️ Refactoring (no functional changes)
  • 🧪 Test update

Related Issues

Fixes #38

Changes Made

  • adds PR indexing
  • adds PR querying (against PRs and issues)

Testing

  • I have run go build ./... successfully
  • I have run go test ./... successfully
  • I have run go vet ./... successfully
  • I have tested the changes locally

(I've tested in nick1udwig#1 with OpenAI models)

Screenshots (if applicable)

Checklist

  • My code follows the project's style guidelines
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • New and existing unit tests pass locally with my changes

Additional Notes

Summary by CodeRabbit

  • New Features

    • OpenAI added as an alternate AI provider
    • Pull request indexing with a dedicated PR collection
    • New PR-duplicate detection command
  • Improvements

    • API key handling clarified: at least one provider key required and provider selection behavior documented
    • Embedding configuration enhanced with model and dimension support
    • Indexing and CLI extended to include PRs and PR-specific options
  • Documentation

    • Updated setup guides, README, and workflow examples for dual-provider and PR workflows

@nick1udwig nick1udwig requested a review from Kavirubc as a code owner February 12, 2026 23:15
@coderabbitai
Copy link

coderabbitai bot commented Feb 12, 2026

📝 Walkthrough

Walkthrough

Adds multi-provider AI support (Gemini/OpenAI), PR indexing and duplicate-detection features, config and CLI extensions for PRs, Qdrant PR collection support, GitHub PR helpers, provider-agnostic embedder/LLM refactors, and documentation/workflow updates.

Changes

Cohort / File(s) Summary
Config & Env
\.env.sample, \.gitignore, \.simili.yaml, internal/core/config/config.go
Added QDRANT_PR_COLLECTION and pr_collection wiring, clarified GEMINI/OPENAI key comments, tightened .gitignore path, updated default embedding model/dimensions, and added PRCollection with derived-default logic.
Docs & Examples
DOCS/..., README.md, cmd/simili-web/README.md, DOCS/examples/...
Documented dual-provider AI keys (GEMINI_API_KEY/OPENAI_API_KEY), made Gemini optional, added OPENAI_API_KEY to example workflows/envs, and extended README with PR indexing/pr-duplicate CLI usage and setup steps.
Embedder & LLM (multi-provider)
internal/integrations/gemini/provider.go, internal/integrations/gemini/embedder.go, internal/integrations/gemini/llm.go, internal/integrations/gemini/prompts.go
Introduced Provider resolution (Gemini/OpenAI), refactored Embedder and LLMClient for multi-provider support (provider-specific clients, dynamic model/dimensions), added PR-duplicate prompt builder (duplicated implementation present). Review provider resolution, OpenAI HTTP paths, and duplicated prompt.
GitHub PR integration
internal/integrations/github/client.go, cmd/simili/commands/pr_support.go
Added GitHub methods to get/list PRs and PR files, plus helpers to resolve PR collection names, list PR file paths, build PR metadata text, and extract linked issue refs (regex). Check pagination and regex edge cases.
CLI: indexing & processing
cmd/simili/commands/index.go, cmd/simili/commands/process.go, cmd/simili/commands/batch.go, cmd/simili/commands/learn.go
Added --include-prs / --pr-collection, PR processing path and Job.IsPullRequest, removed hard GEMINI env fallback in favor of cfg-driven API key/model, and threaded dynamic embedding dimensions into collection creation. Verify collection creation and PR routing.
New PR command & tests
cmd/simili/commands/pr_duplicate.go, cmd/simili/commands/pr_duplicate_test.go
New pr-duplicate command implementing candidate search + optional LLM analysis and unit tests for linked-issue extraction and candidate building. Review error paths and test coverage.
Vector DB prep & usage
internal/steps/vectordb_prep.go
VectorDBPrep now holds embedder reference and can override collection dimension using embedder-reported dimensions.
Web app wiring
cmd/simili-web/main.go
Switched to config-driven embedder/LLM initialization (uses cfg.Embedding.APIKey/model) and adjusted init logging/messages.
Misc (workflows & examples)
DOCS/examples/.../workflow.yml, DOCS/examples/.../simili.yaml, DOCS/examples/multi-repo/*
Updated example workflows to include OPENAI_API_KEY in secrets/env and simili config examples to show model/dimensions and new defaults.

Sequence Diagram(s)

sequenceDiagram
    actor User as User
    participant CLI as CLI (pr-duplicate)
    participant GH as GitHub API
    participant Embed as Embedder (Gemini/OpenAI)
    participant Qdrant as Qdrant
    participant LLM as LLM Client (Gemini/OpenAI)
    participant Out as Output

    User->>CLI: run pr-duplicate --repo org/repo --number N
    CLI->>GH: GetPullRequest(org,repo,N) / ListPullRequestFiles
    GH-->>CLI: PR metadata + file list
    CLI->>Embed: build embedding vector (PR text)
    Embed-->>CLI: embedding vector
    CLI->>Qdrant: search issues collection (and PR collection if present)
    Qdrant-->>CLI: candidate list
    CLI->>LLM: DetectPRDuplicate(PR + candidates)
    LLM-->>CLI: structured duplicate result
    CLI->>Out: render text or JSON
    Out-->>User: display findings
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~65 minutes

Possibly related PRs

Suggested labels

enhancement

Poem

🐰 I hopped through code with keys in paw,

Gemini and OpenAI—both I saw,
PRs now indexed, duplicates near,
Config and docs made bright and clear,
A happy rabbit cheers—deploy with a cheer! 🥕

🚥 Pre-merge checks | ✅ 5 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 18.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Index and query prs' clearly summarizes the primary change—adding PR indexing and querying capabilities alongside the existing issue support.
Linked Issues check ✅ Passed The PR implements all core requirements from issue #38: adds PR indexing/querying, extends duplicate detection and cross-repo search to PRs, and enables intelligent PR routing via the new pr-duplicate command.
Out of Scope Changes check ✅ Passed All changes directly support PR support: PR indexing logic, PR-specific configuration (pr_collection), dual AI provider support (enabling PR features), GitHub client PR methods, and helper utilities—all tied to the #38 objectives.
Merge Conflict Detection ✅ Passed ✅ No merge conflicts detected when merging into main

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cmd/simili-web/main.go (1)

112-147: ⚠️ Potential issue | 🟠 Major

Add env fallback for embedding/LLM API keys when no config file is present.

With no config file, cfg.Embedding.APIKey stays empty and the web UI will fail even if GEMINI_API_KEY or OPENAI_API_KEY is set in the environment.

🔧 Suggested fix (env fallback)
 import (
 	"context"
 	"embed"
 	"encoding/json"
 	"fmt"
 	"io/fs"
 	"log"
 	"net/http"
 	"os"
+	"strings"
@@
 func initDependencies(cfg *config.Config) (*pipeline.Dependencies, error) {
 	deps := &pipeline.Dependencies{
 		DryRun: true, // Always dry-run for web UI
 	}
 
-	// Embedder (Gemini/OpenAI auto-selected by available keys)
-	embedder, err := gemini.NewEmbedder(cfg.Embedding.APIKey, cfg.Embedding.Model)
+	apiKey := strings.TrimSpace(cfg.Embedding.APIKey)
+	if apiKey == "" {
+		if k := strings.TrimSpace(os.Getenv("GEMINI_API_KEY")); k != "" {
+			apiKey = k
+		} else if k := strings.TrimSpace(os.Getenv("OPENAI_API_KEY")); k != "" {
+			apiKey = k
+		}
+	}
+
+	// Embedder (Gemini/OpenAI auto-selected by available keys)
+	embedder, err := gemini.NewEmbedder(apiKey, cfg.Embedding.Model)
 	if err != nil {
 		return nil, fmt.Errorf("failed to init embedder: %w", err)
 	}
 	deps.Embedder = embedder
@@
-	// LLM Client
-	llm, err := gemini.NewLLMClient(cfg.Embedding.APIKey)
+	// LLM Client
+	llm, err := gemini.NewLLMClient(apiKey)
 	if err != nil {
 		return nil, fmt.Errorf("failed to init LLM: %w", err)
 	}
 	deps.LLMClient = llm
🤖 Fix all issues with AI agents
In @.env.sample:
- Around line 7-12: Reorder the variables in .env.sample so QDRANT_COLLECTION
appears before QDRANT_URL to satisfy dotenv-linter key order (move the line
"QDRANT_COLLECTION=your-issues-collection" above
"QDRANT_URL=https://your-cluster.qdrant.io:6333"), or alternatively update the
dotenv-linter configuration to allow the current ordering if you prefer to keep
the existing layout; ensure the change preserves comments and optional
QDRANT_PR_COLLECTION lines.

In @.simili.yaml:
- Around line 7-10: The dimensions value in .simili.yaml (model:
gemini-embedding-001) is set to 1536 which conflicts with the codebase default
Embedding.Dimensions (768) and Gemini’s API default (3072) unless
output_dimensionality is set; either change dimensions to 768 to match the
codebase default or ensure the Gemini client initialization explicitly sets
output_dimensionality: 1536 so the API returns 1536-d embeddings (check places
that reference Embedding.Dimensions and the Gemini call in
internal/steps/vectordb_prep.go and the config default in
internal/core/config/config.go to keep the config and actual embedder output in
sync).

In `@cmd/simili/commands/batch.go`:
- Around line 333-339: The current code hard-fails when gemini.NewLLMClient
returns an error; change it to allow LLM initialization to fail gracefully by
only assigning deps.LLMClient and printing the verbose success message when err
== nil, and otherwise (when err != nil) do not return the error — optionally
emit a verbose warning about the LLM init failure; specifically, update the
gemini.NewLLMClient call handling so that deps.LLMClient remains nil on error
(the Dependencies.Close() already tolerates nil) and downstream LLM-dependent
steps remain skipped for non-LLM presets.

In `@internal/integrations/gemini/embedder.go`:
- Around line 116-147: embedOpenAI mutates e.dimensions unsafely while
Dimensions() reads it concurrently; add synchronization to avoid data races by
protecting dimensions with a mutex or atomic: add a sync.RWMutex (e.g., dimsMu)
to the Embedder struct, set e.dimensions under dimsMu.Lock() in embedOpenAI (or
initialize once if immutable) and read it under dimsMu.RLock() in Dimensions();
ensure all reads/writes of e.dimensions use the same lock to eliminate the race.

In `@internal/integrations/github/client.go`:
- Around line 157-168: ListPullRequestFiles currently returns only the first
page; change it to paginate internally like ListIssueEvents so callers don’t
miss files. Initialize opts if nil (PerPage:100), loop calling
c.client.PullRequests.ListFiles(ctx, org, repo, number, opts), append returned
[]*github.CommitFile to a single slice, update opts.Page using resp.NextPage (or
increment) until there are no more pages, and return the aggregated files and
the last *github.Response; preserve the existing error wrapping when a call
fails. Reference: ListPullRequestFiles, ListIssueEvents, and
listAllPullRequestFilePaths.
🧹 Nitpick comments (7)
cmd/simili/commands/pr_support.go (2)

17-17: Regex may miss some GitHub linking keywords.

The pattern handles close/closed/closes, fix/fixed/fixes/fixe, and resolve/resolved/resolves. Note that fixe is matched but this is uncommon. GitHub also supports additional keywords like closing which this pattern won't catch, though the current set covers the most common cases.

Also applies to: 103-128


76-78: Chained getters are nil-safe in go-github v60—defensive nil checks are optional.

In go-github v60, the GetUser(), GetBase(), and GetHead() accessor methods are designed to be nil-safe on their receivers and in chains. For example, if pr.GetUser() returns nil, calling GetLogin() on that nil pointer still returns an empty string rather than panicking. This is by design: the generated getters in github-accessors.go handle nil receivers gracefully.

The code as written is safe and does not require defensive nil checks. If you prefer explicit nil checks for code clarity, the suggested refactor is valid but not necessary.

DOCS/examples/multi-repo/shared-workflow.yml (1)

11-19: Consider documenting the "at least one required" constraint.

Both GEMINI_API_KEY and OPENAI_API_KEY are marked as required: false, but the application requires at least one. While runtime validation handles this, adding a comment in the workflow clarifying this constraint would improve maintainability.

📝 Suggested comment addition
     secrets:
+      # At least one of GEMINI_API_KEY or OPENAI_API_KEY must be provided
       GEMINI_API_KEY:
         required: false
       OPENAI_API_KEY:
         required: false
DOCS/single-repo-setup.md (1)

30-48: Mention PR workflows in Step 3/CLI backfill.
Since PR support was added, call out PR-triggered workflows and the PR indexing flags so users discover the feature.

✍️ Suggested doc tweak
-Create a GitHub Actions workflow file (e.g., `.github/workflows/simili.yml`) to trigger the bot on issue events.
+Create a GitHub Actions workflow file (e.g., `.github/workflows/simili.yml`) to trigger the bot on issue events
+(and `pull_request` events if you want PR workflows).

 ...
-2.  **Index Issues**:
+2.  **Index Issues**:
     ```bash
     gh simili index --repo owner/repo --config .github/simili.yaml
     ```
+   **Optional: Index PRs too**:
+   ```bash
+   gh simili index --repo owner/repo --config .github/simili.yaml --include-prs --pr-collection <name>
+   ```
cmd/simili/commands/pr_duplicate.go (1)

212-260: Align Qdrant env overrides with other commands.
Other commands honor QDRANT_URL/QDRANT_API_KEY. Adding the same fallback here improves CLI consistency.

🔧 Suggested adjustment
-	qdrantClient, err := qdrant.NewClient(cfg.Qdrant.URL, cfg.Qdrant.APIKey)
+	qURL := cfg.Qdrant.URL
+	if val := os.Getenv("QDRANT_URL"); val != "" && (qURL == "" || qURL == "localhost:6334") {
+		qURL = val
+	}
+	if qURL == "" {
+		qURL = "localhost:6334"
+	}
+
+	qKey := cfg.Qdrant.APIKey
+	if val := os.Getenv("QDRANT_API_KEY"); val != "" && qKey == "" {
+		qKey = val
+	}
+
+	qdrantClient, err := qdrant.NewClient(qURL, qKey)
internal/integrations/gemini/llm.go (1)

168-174: Close OpenAI idle connections in Close().

Right now only the Gemini client is closed. For long‑lived processes using OpenAI, consider closing idle HTTP connections to avoid resource leaks.

♻️ Proposed fix
 func (l *LLMClient) Close() error {
-	if l.gemini != nil {
-		return l.gemini.Close()
-	}
-	return nil
+	var err error
+	if l.gemini != nil {
+		err = l.gemini.Close()
+	}
+	if l.openAI != nil {
+		l.openAI.CloseIdleConnections()
+	}
+	return err
 }
internal/integrations/gemini/embedder.go (1)

68-74: Close OpenAI idle connections in Close().

Right now only the Gemini client is closed. For long‑lived processes using OpenAI, consider closing idle HTTP connections to avoid resource leaks.

♻️ Proposed fix
 func (e *Embedder) Close() error {
-	if e.gemini != nil {
-		return e.gemini.Close()
-	}
-	return nil
+	var err error
+	if e.gemini != nil {
+		err = e.gemini.Close()
+	}
+	if e.openAI != nil {
+		e.openAI.CloseIdleConnections()
+	}
+	return err
 }

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
DOCS/single-repo-setup.md (1)

36-48: ⚠️ Potential issue | 🟡 Minor

Update the CLI index command documentation to mention PR indexing capability.

The CLI index command supports PR indexing via the --include-prs flag and a dedicated --pr-collection parameter, but the documentation only shows basic issue indexing. Add an example or mention of these flags so users are aware they can also index pull requests.

@Kavirubc
Copy link
Member

Kavirubc commented Feb 13, 2026

Hey @nick1udwig — really appreciate this contribution! The work here is solid and both features are things we want in 0.2.0.

However, this PR bundles two independent features that we'd like to track and merge separately:

  1. OpenAI provider support → tracked in [0.2.0v][Feature] OpenAI provider support for embeddings and LLM #42
  2. PR indexing + pr-duplicate CLI → tracked in [0.2.0v][Feature] PR indexing with dedicated collection and pr-duplicate CLI #43

Could you split this into two PRs targeting those issues? It'll make each easier to review, test independently, and revert if needed.

If you'd like to be assigned to either issue, please drop a comment on #42 or #43 and we'll get you added.


Heads-up: codebase has changed since this PR was opened

Two PRs have landed on main since you branched — you'll want to rebase and reconcile the following:

PR #39 — feat: support pull request events in triage pipeline

  • pull_request and pr_comment event types added to the runtime pipeline
  • enrichIssueFromGitHubEvent + populateIssuePayload extracted in cmd/simili/commands/process.go
  • Pipeline guards added in transfer_check, llm_router, action_executor, pending_action_scheduler to skip transfer/routing for PR events
  • Workflow examples updated: pull_request trigger added, closed/deleted removed from issues trigger, pull-requests: write permission added
  • "Similar Issues" renamed to "Similar Threads" in response_builder.go
  • action.yml renamed to "Issue & PR Intelligence"

PR #41 — fix: PR triage review feedback

  • synchronize removed from pull_request workflow trigger (fires only on opened, edited, reopened)
  • Similar Threads table column header corrected from Issue to Thread
  • command_handler.go comment reworded for accuracy

The main files likely to conflict with your branch are DOCS/examples/*/workflow.yml, internal/steps/response_builder.go, and action.yml. Let us know if you need any help with the rebase.

Thanks again for the effort on this — looking forward to getting both features in!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Pull Request support

2 participants