Index and query prs by nick1udwig · Pull Request #40 · similigh/simili-bot

nick1udwig · 2026-02-12T23:15:38Z

Description

Index and query PRs, not just issues

Type of Change

🐛 Bug fix (non-breaking change that fixes an issue)
✨ New feature (non-breaking change that adds functionality)
💥 Breaking change (fix or feature that would cause existing functionality to change)
📚 Documentation update
🔧 Configuration/build change
♻️ Refactoring (no functional changes)
🧪 Test update

Related Issues

Fixes #38

Changes Made

adds PR indexing
adds PR querying (against PRs and issues)

Testing

I have run go build ./... successfully
I have run go test ./... successfully
I have run go vet ./... successfully
I have tested the changes locally

(I've tested in nick1udwig#1 with OpenAI models)

Screenshots (if applicable)

Checklist

My code follows the project's style guidelines
I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
New and existing unit tests pass locally with my changes

Additional Notes

Summary by CodeRabbit

New Features
- OpenAI added as an alternate AI provider
- Pull request indexing with a dedicated PR collection
- New PR-duplicate detection command
Improvements
- API key handling clarified: at least one provider key required and provider selection behavior documented
- Embedding configuration enhanced with model and dimension support
- Indexing and CLI extended to include PRs and PR-specific options
Documentation
- Updated setup guides, README, and workflow examples for dual-provider and PR workflows

coderabbitai · 2026-02-12T23:15:55Z

📝 Walkthrough

Walkthrough

Adds multi-provider AI support (Gemini/OpenAI), PR indexing and duplicate-detection features, config and CLI extensions for PRs, Qdrant PR collection support, GitHub PR helpers, provider-agnostic embedder/LLM refactors, and documentation/workflow updates.

Changes

Cohort / File(s)	Summary
Config & Env `\.env.sample`, `\.gitignore`, `\.simili.yaml`, `internal/core/config/config.go`	Added `QDRANT_PR_COLLECTION` and `pr_collection` wiring, clarified GEMINI/OPENAI key comments, tightened `.gitignore` path, updated default embedding model/dimensions, and added `PRCollection` with derived-default logic.
Docs & Examples `DOCS/...`, `README.md`, `cmd/simili-web/README.md`, `DOCS/examples/...`	Documented dual-provider AI keys (GEMINI_API_KEY/OPENAI_API_KEY), made Gemini optional, added OPENAI_API_KEY to example workflows/envs, and extended README with PR indexing/pr-duplicate CLI usage and setup steps.
Embedder & LLM (multi-provider) `internal/integrations/gemini/provider.go`, `internal/integrations/gemini/embedder.go`, `internal/integrations/gemini/llm.go`, `internal/integrations/gemini/prompts.go`	Introduced `Provider` resolution (Gemini/OpenAI), refactored Embedder and LLMClient for multi-provider support (provider-specific clients, dynamic model/dimensions), added PR-duplicate prompt builder (duplicated implementation present). Review provider resolution, OpenAI HTTP paths, and duplicated prompt.
GitHub PR integration `internal/integrations/github/client.go`, `cmd/simili/commands/pr_support.go`	Added GitHub methods to get/list PRs and PR files, plus helpers to resolve PR collection names, list PR file paths, build PR metadata text, and extract linked issue refs (regex). Check pagination and regex edge cases.
CLI: indexing & processing `cmd/simili/commands/index.go`, `cmd/simili/commands/process.go`, `cmd/simili/commands/batch.go`, `cmd/simili/commands/learn.go`	Added `--include-prs` / `--pr-collection`, PR processing path and Job.IsPullRequest, removed hard GEMINI env fallback in favor of cfg-driven API key/model, and threaded dynamic embedding dimensions into collection creation. Verify collection creation and PR routing.
New PR command & tests `cmd/simili/commands/pr_duplicate.go`, `cmd/simili/commands/pr_duplicate_test.go`	New `pr-duplicate` command implementing candidate search + optional LLM analysis and unit tests for linked-issue extraction and candidate building. Review error paths and test coverage.
Vector DB prep & usage `internal/steps/vectordb_prep.go`	VectorDBPrep now holds embedder reference and can override collection dimension using embedder-reported dimensions.
Web app wiring `cmd/simili-web/main.go`	Switched to config-driven embedder/LLM initialization (uses `cfg.Embedding.APIKey`/model) and adjusted init logging/messages.
Misc (workflows & examples) `DOCS/examples/.../workflow.yml`, `DOCS/examples/.../simili.yaml`, `DOCS/examples/multi-repo/*`	Updated example workflows to include `OPENAI_API_KEY` in secrets/env and simili config examples to show `model`/`dimensions` and new defaults.

Sequence Diagram(s)

sequenceDiagram
    actor User as User
    participant CLI as CLI (pr-duplicate)
    participant GH as GitHub API
    participant Embed as Embedder (Gemini/OpenAI)
    participant Qdrant as Qdrant
    participant LLM as LLM Client (Gemini/OpenAI)
    participant Out as Output

    User->>CLI: run pr-duplicate --repo org/repo --number N
    CLI->>GH: GetPullRequest(org,repo,N) / ListPullRequestFiles
    GH-->>CLI: PR metadata + file list
    CLI->>Embed: build embedding vector (PR text)
    Embed-->>CLI: embedding vector
    CLI->>Qdrant: search issues collection (and PR collection if present)
    Qdrant-->>CLI: candidate list
    CLI->>LLM: DetectPRDuplicate(PR + candidates)
    LLM-->>CLI: structured duplicate result
    CLI->>Out: render text or JSON
    Out-->>User: display findings

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~65 minutes

Possibly related PRs

feat: add batch CLI and web UI for issue analysis #36: Overlaps in embedding/provider selection and duplicate-detection code paths; similar LLM/embedder refactors and prompt/duplicate logic.

Suggested labels

enhancement

Poem

🐰 I hopped through code with keys in paw,

Gemini and OpenAI—both I saw,
PRs now indexed, duplicates near,
Config and docs made bright and clear,
A happy rabbit cheers—deploy with a cheer! 🥕

🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 18.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Index and query prs' clearly summarizes the primary change—adding PR indexing and querying capabilities alongside the existing issue support.
Linked Issues check	✅ Passed	The PR implements all core requirements from issue `#38`: adds PR indexing/querying, extends duplicate detection and cross-repo search to PRs, and enables intelligent PR routing via the new pr-duplicate command.
Out of Scope Changes check	✅ Passed	All changes directly support PR support: PR indexing logic, PR-specific configuration (pr_collection), dual AI provider support (enabling PR features), GitHub client PR methods, and helper utilities—all tied to the `#38` objectives.
Merge Conflict Detection	✅ Passed	✅ No merge conflicts detected when merging into `main`

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

add openai support

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

cmd/simili-web/main.go (1)

112-147: ⚠️ Potential issue | 🟠 Major

Add env fallback for embedding/LLM API keys when no config file is present.

With no config file, cfg.Embedding.APIKey stays empty and the web UI will fail even if GEMINI_API_KEY or OPENAI_API_KEY is set in the environment.

🔧 Suggested fix (env fallback)

 import (
 	"context"
 	"embed"
 	"encoding/json"
 	"fmt"
 	"io/fs"
 	"log"
 	"net/http"
 	"os"
+	"strings"
@@
 func initDependencies(cfg *config.Config) (*pipeline.Dependencies, error) {
 	deps := &pipeline.Dependencies{
 		DryRun: true, // Always dry-run for web UI
 	}
 
-	// Embedder (Gemini/OpenAI auto-selected by available keys)
-	embedder, err := gemini.NewEmbedder(cfg.Embedding.APIKey, cfg.Embedding.Model)
+	apiKey := strings.TrimSpace(cfg.Embedding.APIKey)
+	if apiKey == "" {
+		if k := strings.TrimSpace(os.Getenv("GEMINI_API_KEY")); k != "" {
+			apiKey = k
+		} else if k := strings.TrimSpace(os.Getenv("OPENAI_API_KEY")); k != "" {
+			apiKey = k
+		}
+	}
+
+	// Embedder (Gemini/OpenAI auto-selected by available keys)
+	embedder, err := gemini.NewEmbedder(apiKey, cfg.Embedding.Model)
 	if err != nil {
 		return nil, fmt.Errorf("failed to init embedder: %w", err)
 	}
 	deps.Embedder = embedder
@@
-	// LLM Client
-	llm, err := gemini.NewLLMClient(cfg.Embedding.APIKey)
+	// LLM Client
+	llm, err := gemini.NewLLMClient(apiKey)
 	if err != nil {
 		return nil, fmt.Errorf("failed to init LLM: %w", err)
 	}
 	deps.LLMClient = llm

🤖 Fix all issues with AI agents

In @.env.sample:
- Around line 7-12: Reorder the variables in .env.sample so QDRANT_COLLECTION
appears before QDRANT_URL to satisfy dotenv-linter key order (move the line
"QDRANT_COLLECTION=your-issues-collection" above
"QDRANT_URL=https://your-cluster.qdrant.io:6333"), or alternatively update the
dotenv-linter configuration to allow the current ordering if you prefer to keep
the existing layout; ensure the change preserves comments and optional
QDRANT_PR_COLLECTION lines.

In @.simili.yaml:
- Around line 7-10: The dimensions value in .simili.yaml (model:
gemini-embedding-001) is set to 1536 which conflicts with the codebase default
Embedding.Dimensions (768) and Gemini’s API default (3072) unless
output_dimensionality is set; either change dimensions to 768 to match the
codebase default or ensure the Gemini client initialization explicitly sets
output_dimensionality: 1536 so the API returns 1536-d embeddings (check places
that reference Embedding.Dimensions and the Gemini call in
internal/steps/vectordb_prep.go and the config default in
internal/core/config/config.go to keep the config and actual embedder output in
sync).

In `@cmd/simili/commands/batch.go`:
- Around line 333-339: The current code hard-fails when gemini.NewLLMClient
returns an error; change it to allow LLM initialization to fail gracefully by
only assigning deps.LLMClient and printing the verbose success message when err
== nil, and otherwise (when err != nil) do not return the error — optionally
emit a verbose warning about the LLM init failure; specifically, update the
gemini.NewLLMClient call handling so that deps.LLMClient remains nil on error
(the Dependencies.Close() already tolerates nil) and downstream LLM-dependent
steps remain skipped for non-LLM presets.

In `@internal/integrations/gemini/embedder.go`:
- Around line 116-147: embedOpenAI mutates e.dimensions unsafely while
Dimensions() reads it concurrently; add synchronization to avoid data races by
protecting dimensions with a mutex or atomic: add a sync.RWMutex (e.g., dimsMu)
to the Embedder struct, set e.dimensions under dimsMu.Lock() in embedOpenAI (or
initialize once if immutable) and read it under dimsMu.RLock() in Dimensions();
ensure all reads/writes of e.dimensions use the same lock to eliminate the race.

In `@internal/integrations/github/client.go`:
- Around line 157-168: ListPullRequestFiles currently returns only the first
page; change it to paginate internally like ListIssueEvents so callers don’t
miss files. Initialize opts if nil (PerPage:100), loop calling
c.client.PullRequests.ListFiles(ctx, org, repo, number, opts), append returned
[]*github.CommitFile to a single slice, update opts.Page using resp.NextPage (or
increment) until there are no more pages, and return the aggregated files and
the last *github.Response; preserve the existing error wrapping when a call
fails. Reference: ListPullRequestFiles, ListIssueEvents, and
listAllPullRequestFilePaths.

🧹 Nitpick comments (7)

cmd/simili/commands/pr_support.go (2)

17-17: Regex may miss some GitHub linking keywords.

The pattern handles close/closed/closes, fix/fixed/fixes/fixe, and resolve/resolved/resolves. Note that fixe is matched but this is uncommon. GitHub also supports additional keywords like closing which this pattern won't catch, though the current set covers the most common cases.

Also applies to: 103-128

76-78: Chained getters are nil-safe in go-github v60—defensive nil checks are optional.

In go-github v60, the GetUser(), GetBase(), and GetHead() accessor methods are designed to be nil-safe on their receivers and in chains. For example, if pr.GetUser() returns nil, calling GetLogin() on that nil pointer still returns an empty string rather than panicking. This is by design: the generated getters in github-accessors.go handle nil receivers gracefully.

The code as written is safe and does not require defensive nil checks. If you prefer explicit nil checks for code clarity, the suggested refactor is valid but not necessary.
DOCS/examples/multi-repo/shared-workflow.yml (1)
11-19: Consider documenting the "at least one required" constraint.

Both GEMINI_API_KEY and OPENAI_API_KEY are marked as required: false, but the application requires at least one. While runtime validation handles this, adding a comment in the workflow clarifying this constraint would improve maintainability.
📝 Suggested comment addition
     secrets:
+      # At least one of GEMINI_API_KEY or OPENAI_API_KEY must be provided
       GEMINI_API_KEY:
         required: false
       OPENAI_API_KEY:
         required: false
DOCS/single-repo-setup.md (1)
30-48: Mention PR workflows in Step 3/CLI backfill.
Since PR support was added, call out PR-triggered workflows and the PR indexing flags so users discover the feature.
✍️ Suggested doc tweak
-Create a GitHub Actions workflow file (e.g., `.github/workflows/simili.yml`) to trigger the bot on issue events.
+Create a GitHub Actions workflow file (e.g., `.github/workflows/simili.yml`) to trigger the bot on issue events
+(and `pull_request` events if you want PR workflows).

 ...
-2.  **Index Issues**:
+2.  **Index Issues**:
     ```bash
     gh simili index --repo owner/repo --config .github/simili.yaml
     ```
+   **Optional: Index PRs too**:
+   ```bash
+   gh simili index --repo owner/repo --config .github/simili.yaml --include-prs --pr-collection <name>
+   ```
cmd/simili/commands/pr_duplicate.go (1)
212-260: Align Qdrant env overrides with other commands.
Other commands honor QDRANT_URL/QDRANT_API_KEY. Adding the same fallback here improves CLI consistency.
🔧 Suggested adjustment
-	qdrantClient, err := qdrant.NewClient(cfg.Qdrant.URL, cfg.Qdrant.APIKey)
+	qURL := cfg.Qdrant.URL
+	if val := os.Getenv("QDRANT_URL"); val != "" && (qURL == "" || qURL == "localhost:6334") {
+		qURL = val
+	}
+	if qURL == "" {
+		qURL = "localhost:6334"
+	}
+
+	qKey := cfg.Qdrant.APIKey
+	if val := os.Getenv("QDRANT_API_KEY"); val != "" && qKey == "" {
+		qKey = val
+	}
+
+	qdrantClient, err := qdrant.NewClient(qURL, qKey)
internal/integrations/gemini/llm.go (1)
168-174: Close OpenAI idle connections in Close().

Right now only the Gemini client is closed. For long‑lived processes using OpenAI, consider closing idle HTTP connections to avoid resource leaks.
♻️ Proposed fix
 func (l *LLMClient) Close() error {
-	if l.gemini != nil {
-		return l.gemini.Close()
-	}
-	return nil
+	var err error
+	if l.gemini != nil {
+		err = l.gemini.Close()
+	}
+	if l.openAI != nil {
+		l.openAI.CloseIdleConnections()
+	}
+	return err
 }
internal/integrations/gemini/embedder.go (1)
68-74: Close OpenAI idle connections in Close().

Right now only the Gemini client is closed. For long‑lived processes using OpenAI, consider closing idle HTTP connections to avoid resource leaks.
♻️ Proposed fix
 func (e *Embedder) Close() error {
-	if e.gemini != nil {
-		return e.gemini.Close()
-	}
-	return nil
+	var err error
+	if e.gemini != nil {
+		err = e.gemini.Close()
+	}
+	if e.openAI != nil {
+		e.openAI.CloseIdleConnections()
+	}
+	return err
 }

.env.sample

.simili.yaml

cmd/simili/commands/batch.go

internal/integrations/gemini/embedder.go

internal/integrations/github/client.go

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

DOCS/single-repo-setup.md (1)

36-48: ⚠️ Potential issue | 🟡 Minor

Update the CLI index command documentation to mention PR indexing capability.

The CLI index command supports PR indexing via the --include-prs flag and a dedicated --pr-collection parameter, but the documentation only shows basic issue indexing. Add an example or mention of these flags so users are aware they can also index pull requests.

Kavirubc · 2026-02-13T04:19:24Z

Hey @nick1udwig — really appreciate this contribution! The work here is solid and both features are things we want in 0.2.0.

However, this PR bundles two independent features that we'd like to track and merge separately:

OpenAI provider support → tracked in [0.2.0v][Feature] OpenAI provider support for embeddings and LLM #42
PR indexing + pr-duplicate CLI → tracked in [0.2.0v][Feature] PR indexing with dedicated collection and pr-duplicate CLI #43

Could you split this into two PRs targeting those issues? It'll make each easier to review, test independently, and revert if needed.

If you'd like to be assigned to either issue, please drop a comment on #42 or #43 and we'll get you added.

Heads-up: codebase has changed since this PR was opened

Two PRs have landed on main since you branched — you'll want to rebase and reconcile the following:

PR #39 — feat: support pull request events in triage pipeline

pull_request and pr_comment event types added to the runtime pipeline
enrichIssueFromGitHubEvent + populateIssuePayload extracted in cmd/simili/commands/process.go
Pipeline guards added in transfer_check, llm_router, action_executor, pending_action_scheduler to skip transfer/routing for PR events
Workflow examples updated: pull_request trigger added, closed/deleted removed from issues trigger, pull-requests: write permission added
"Similar Issues" renamed to "Similar Threads" in response_builder.go
action.yml renamed to "Issue & PR Intelligence"

PR #41 — fix: PR triage review feedback

synchronize removed from pull_request workflow trigger (fires only on opened, edited, reopened)
Similar Threads table column header corrected from Issue to Thread
command_handler.go comment reworded for accuracy

The main files likely to conflict with your branch are DOCS/examples/*/workflow.yml, internal/steps/response_builder.go, and action.yml. Let us know if you need any help with the rebase.

Thanks again for the effort on this — looking forward to getting both features in!

nick1udwig added 3 commits February 12, 2026 14:39

add (provisional) indexing and querying of prs

624e76d

split up runPRDuplicate

5b043e2

add openai support

cfa5a63

nick1udwig requested a review from Kavirubc as a code owner February 12, 2026 23:15

Merge pull request #1 from nick1udwig/add-openai-support

8968259

add openai support

coderabbitai bot reviewed Feb 12, 2026

View reviewed changes

mahsumaktas mentioned this pull request Feb 12, 2026

feat: support pull request events in triage pipeline #39

Merged

nick1udwig added 5 commits February 12, 2026 16:46

chore: reorder qdrant env vars in sample

06cecd2

chore: align default simili embedding config to 768

4720fd2

fix: make batch LLM initialization non-fatal

43d97cc

fix: make embedder dimensions updates race-safe

bfa41d0

fix: paginate pull request file listing in github client

1773bb9

Kavirubc assigned nick1udwig Feb 13, 2026

Merge branch 'main' into index-and-query-prs

b314f1c

coderabbitai bot reviewed Feb 13, 2026

View reviewed changes

This was referenced Feb 13, 2026

[0.2.0v][Feature] OpenAI provider support for embeddings and LLM #42

Open

[0.2.0v][Feature] PR indexing with dedicated collection and pr-duplicate CLI #43

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Index and query prs#40

Index and query prs#40
nick1udwig wants to merge 10 commits intosimiligh:mainfrom
nick1udwig:index-and-query-prs

nick1udwig commented Feb 12, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Feb 12, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Kavirubc commented Feb 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nick1udwig commented Feb 12, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Related Issues

Changes Made

Testing

Screenshots (if applicable)

Checklist

Additional Notes

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Kavirubc commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Heads-up: codebase has changed since this PR was opened

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nick1udwig commented Feb 12, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 12, 2026 •

edited

Loading

Kavirubc commented Feb 13, 2026 •

edited

Loading