Skip to content

Conversation

@ptelang
Copy link

@ptelang ptelang commented Jan 16, 2026

Summary

This PR introduces the EmbeddingServer Custom Resource Definition (CRD) for the ToolHive Kubernetes Operator, enabling deployment and management of HuggingFace Text Embeddings Inference (TEI) servers.

Key Features

  • New CRD: EmbeddingServer for deploying HuggingFace Text Embeddings Inference servers in Kubernetes
  • Model Management: Supports downloading and caching embedding models from HuggingFace
  • Persistent Storage: Optional PVC support for model caching to speed up restarts after initial download
  • Resource Configuration: Full control over deployment resources, environment variables, and pod templates
  • Group Integration: Can be associated with MCPGroups for organization and management
  • Status Management: Comprehensive condition-based status tracking:
    • ModelReady: Indicates whether the embedding model is downloaded and ready
    • VolumeReady: Indicates whether the PVC for model caching is ready
    • ImageValidated: Validates the container image
    • PodTemplateValid: Validates the pod template configuration

Components Added

  • CRD Definition: cmd/thv-operator/api/v1alpha1/embeddingserver_types.go
  • Controller: cmd/thv-operator/controllers/embeddingserver_controller.go
  • Unit Tests: Comprehensive test coverage for the controller
  • Example Configurations:
    • basic-embedding.yaml: Minimal configuration with default settings
    • embedding-with-cache.yaml: Configuration with persistent model cache (10Gi PVC)
    • embedding-advanced.yaml: Advanced setup with custom resources and environment variables
  • Documentation: README with detailed examples and usage instructions

Test plan

  • Unit tests for EmbeddingServer controller
  • Example YAML configurations validated
  • E2E operator tests (Chainsaw) for EmbeddingServer CRD
  • Manual testing of embedding server deployment in Kubernetes cluster
  • Verification of model caching with persistent volumes

Large PR Justification

  • Multiple related changes that would break if separated

Generated with Claude Code

Co-Authored-By: Claude noreply@anthropic.com

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Large PR Detected

This PR exceeds 1000 lines of changes and requires justification before it can be reviewed.

How to unblock this PR:

Add a section to your PR description with the following format:

## Large PR Justification

[Explain why this PR must be large, such as:]
- Generated code that cannot be split
- Large refactoring that must be atomic
- Multiple related changes that would break if separated
- Migration or data transformation

Alternative:

Consider splitting this PR into smaller, focused changes (< 1000 lines each) for easier review and reduced risk.

See our Contributing Guidelines for more details.


This review will be automatically dismissed once you add the justification section.

@github-actions github-actions bot added the size/XL Extra large PR: 1000+ lines changed label Jan 16, 2026
@ptelang ptelang requested a review from jerm-dro January 16, 2026 18:02
@ptelang ptelang force-pushed the add-embedding-engine branch from 97bc58c to 8d13462 Compare January 16, 2026 18:04
@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Jan 16, 2026
@ptelang ptelang force-pushed the add-embedding-engine branch from 8d13462 to 6203f70 Compare January 16, 2026 18:08
@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Jan 16, 2026
@github-actions github-actions bot dismissed their stale review January 16, 2026 18:09

Large PR justification has been provided. Thank you!

@github-actions
Copy link
Contributor

✅ Large PR justification has been provided. The size review has been dismissed and this PR can now proceed with normal review.

@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Jan 16, 2026
Copy link
Contributor

@jerm-dro jerm-dro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mostly looks good, but I have a few blocking changes:

  • Let's use a Secret to plumb huggingface tokens into the EmbeddingServer
  • Let's treat the server as a StatefulSet rather than attaching PVCs to a deployment

@ChrisJBurns
Copy link
Collaborator

@ptelang Thanks for this. Is there any way we can split this into multiple PRs? 4k diffs is quite a large change and I feel folks will miss stuff in the cracks due to trying to review the entire thing. I've also noticed that the controller has 1k lines of code but only 300 lines of unit tests. Is there a way we can increase the test coverage to ensure the code does what we want?

@ChrisJBurns
Copy link
Collaborator

@claude Can you please review this PR using the necessary subagents in this repository. Leverage your knowledge of software engineering best practices (such as SOLID / DRY etc) and Kubernetes controller in order to support your review.

@claude
Copy link
Contributor

claude bot commented Jan 19, 2026

Claude finished @ChrisJBurns's task in 5m 24s —— View job


PR Review Complete

I've conducted a comprehensive review of the EmbeddingServer CRD implementation using the kubernetes-expert and code-reviewer specialized agents. Here are the findings:

Tasks

  • Review CRD design and Kubernetes patterns (kubernetes-expert)
  • Review code quality and ToolHive best practices (code-reviewer)
  • Analyze test coverage and quality
  • Consolidate findings and provide comprehensive feedback

Critical Issues (Blocking)

1. Security: EnvVar Type Lacks Secret Support ⛔

Location: cmd/thv-operator/api/v1alpha1/mcpserver_types.go:225-234

The custom EnvVar type only supports plain string values:

type EnvVar struct {
    Name  string `json:"name"`
    Value string `json:"value"`  // ⚠️ No support for secrets!
}

Problem: HuggingFace models often require authentication tokens (especially for private models or to avoid rate limits). Storing these as plain text environment variables is a security vulnerability - tokens will be exposed in Git commits, logs, and kubectl describe output.

This aligns with @jerm-dro's blocking review comment about using Secrets for HF tokens.

Recommendation: Extend the EnvVar type to support ValueFrom.SecretKeyRef like Kubernetes native envvars. Your codebase already has the SecretKeyRef type in MCPExternalAuthConfig that can be reused:

type EnvVar struct {
    Name string `json:"name"`
    // Value is the plain text value (mutually exclusive with ValueFrom)
    // +optional
    Value string `json:"value,omitempty"`
    // ValueFrom is a reference to a secret (mutually exclusive with Value)
    // +optional
    ValueFrom *EnvVarSource `json:"valueFrom,omitempty"`
}

type EnvVarSource struct {
    // SecretKeyRef selects a key from a secret
    // +optional
    SecretKeyRef *SecretKeyRef `json:"secretKeyRef,omitempty"`
}

Then update the controller to validate secret references exist and add RBAC markers:

//+kubebuilder:rbac:groups="",resources=secrets,verbs=get;list;watch

2. Kubernetes: Deployment + PVC Scaling Issue ⛔

Location: cmd/thv-operator/controllers/embeddingserver_controller.go:740-752

The controller creates a Deployment with a single PVC that all replicas try to mount:

podTemplate.Spec.Volumes = []corev1.Volume{
    {
        Name: "model-cache",
        VolumeSource: corev1.VolumeSource{
            PersistentVolumeClaim: &corev1.PersistentVolumeClaimVolumeSource{
                ClaimName: pvcName,  // ⚠️ Single PVC for all replicas
            },
        },
    },
}

This validates @jerm-dro's blocking review comment about converting to StatefulSet.

Problem: When replicas > 1:

  • With ReadWriteOnce (default): Only the first pod can mount the PVC. Additional pods get stuck in Pending state with Multi-Attach error
  • With ReadWriteMany: Works but requires specific storage classes (NFS, CephFS) that aren't available in all environments

Evidence: The CRD defaults to ReadWriteOnce (cmd/thv-operator/api/v1alpha1/embeddingserver_types.go:119), which will fail when users try to scale.

Why StatefulSet Solves This:

  • Uses volumeClaimTemplates - each pod gets its own PVC automatically
  • Each replica (embedding-0, embedding-1, embedding-2) gets its own storage (model-cache-embedding-0, model-cache-embedding-1, etc.)
  • Works perfectly with ReadWriteOnce (the most common storage type)
  • Provides stable pod identity and ordered operations

Recommendation: Use StatefulSet when modelCache.enabled=true and replicas > 1. Your MCPServer controller already uses StatefulSet (see cmd/thv-operator/controllers/mcpserver_controller.go:1384-1395), so this would align with existing patterns.

Trade-off: Each replica will download and cache the model separately (N × storage cost), but this is the correct architecture for truly independent replicas. If shared caching is critical, you'd need a more complex sidecar pattern.


3. Missing Secret Reference Validation ⛔

Location: cmd/thv-operator/controllers/embeddingserver_controller.go:649-664

After fixing issue #1 above, the controller will need to validate that referenced secrets actually exist before creating the Deployment. Otherwise, pods will fail with CreateContainerConfigError.

See cmd/thv-operator/controllers/virtualmcpserver_deployment.go:599-611 for the validation pattern used in VirtualMCPServer.


Major Issues

4. Code Quality: Silent Error Drops

Location: cmd/thv-operator/controllers/embeddingserver_controller.go:554-556, 593-595, 884-886

Multiple places silently drop errors:

if err := ctrl.SetControllerReference(embedding, pvc, r.Scheme); err != nil {
    return nil  // ❌ Error silently dropped
}

Impact: If SetControllerReference fails, the PVC/Deployment/Service is created without an owner reference, breaking cascade deletion when the EmbeddingServer is deleted.

Recommendation: At minimum, log the error. Better yet, return it to the caller.


5. Architecture: Multiple Status Updates Per Reconciliation

Location: Throughout embeddingserver_controller.go (lines 329, 369, 402-404, 997)

The controller calls r.Status().Update() multiple times per reconciliation loop, which:

  • Creates race conditions when status changes between reads
  • Increases API server load
  • Can cause reconciliation thrashing

Recommendation: Adopt the StatusCollector pattern used in MCPRegistry controller (see cmd/thv-operator/controllers/mcpregistry_controller.go:162-164):

statusCollector := embeddingserverstatus.NewCollector(embedding)
// Make all status changes through collector
statusCollector.SetPhase(...)
statusCollector.SetCondition(...)
// Single batch update at end
statusCollector.Apply(ctx, r.Client)

6. Test Coverage: Missing Critical Path Tests

Metrics:

  • Controller code: 1,031 lines
  • Test code: 316 lines (30.7% ratio)
  • Tests cover: Only helper methods (GetPort(), GetReplicas(), labelsForEmbedding())
  • NOT tested: Reconciliation logic, image validation, PVC creation, deployment drift detection, finalizer cleanup

This addresses @ChrisJBurns's comment about 1k lines of controller code with only 300 lines of tests.

Critical paths lacking unit tests:

  • Reconcile() method
  • validateImage() - what happens when image is invalid?
  • ensurePVC() - PVC creation and binding checks
  • deploymentNeedsUpdate() - drift detection logic
  • handleDeletion() - finalizer cleanup

E2E tests exist (basic, with-cache, lifecycle, multi-tenancy), which is good, but unit tests are needed for:

  • Fast feedback during development
  • Testing error conditions that are hard to reproduce in e2e
  • Validating edge cases

Recommendation: Add unit tests using mocked Kubernetes client. Example:

func TestEmbeddingServerReconciler_ValidateImage_Invalid(t *testing.T) {
    // Test invalid image rejection and status update
}

func TestEmbeddingServerReconciler_DeploymentNeedsUpdate_ImageChange(t *testing.T) {
    // Test deployment drift detection
}

Code Quality Issues

7. DRY Violations

PVC Name Calculation (lines 452, 512, 741):

pvcName := fmt.Sprintf("%s-model-cache", embedding.Name)

Repeated 3 times. Extract to method: getPVCName(embedding).

Environment Variable Logic (lines 650-663 vs 935-942):
Building env vars is duplicated in buildEnvVars() and deploymentNeedsUpdate(). Extract to shared method.

Status Update Pattern (lines 369-372, 402-404, 417-419, etc.):
The pattern if statusErr := r.Status().Update(...) { log.Error(...) } appears 5+ times in validateImage() alone.


8. CLAUDE.md Violation: Method Organization

Issue: Public methods should be at the top of the file, private methods at the bottom (per /home/runner/work/toolhive/toolhive/CLAUDE.md).

Current order:

  • Line 62: Reconcile (public) ✓
  • Lines 127-1022: Private helper methods
  • Line 1024: SetupWithManager (public) ❌ Should be near top

Recommendation: Move SetupWithManager to line ~123 (right after Reconcile).


9. Inconsistent ObservedGeneration

Some conditions set ObservedGeneration (line 465), others don't (lines 372, 404, etc.). This field should always be set to indicate which spec version the condition applies to.


Minor Issues

10. ModelCache Enabled Field Design

Location: cmd/thv-operator/api/v1alpha1/embeddingserver_types.go:98-120

type ModelCacheConfig struct {
    Enabled bool `json:"enabled,omitempty"`  // ⚠️ omitempty with bool
    // +kubebuilder:default=true

Using omitempty with booleans means false values are omitted from JSON, which can cause confusion when the default is true. Users can't explicitly disable caching without removing the entire modelCache section.

Recommendation: Either remove omitempty or use *bool for tri-state logic.


11. Unused Context Parameters

Lines 562, 851 use _ context.Context for unused parameters. Either remove them or add a comment explaining they're reserved for future use.


Positive Observations ✅

Despite the issues, there are many good practices:

  1. ✅ Early validation with fail-fast pattern
  2. ✅ Proper use of finalizers for cleanup
  3. ✅ Structured reconciliation with helper methods
  4. ✅ Image validation integration
  5. ✅ PodTemplateSpec validation using shared builder pattern
  6. Excellent E2E test coverage (basic, with-cache, lifecycle, multi-tenancy scenarios)
  7. ✅ Proper label management with standard Kubernetes labels
  8. ✅ Conflict resolution logic (updateDeploymentWithRetry)
  9. ✅ Resource ownership with SetControllerReference
  10. ✅ Phase-based status for user visibility

Summary and Recommendations

Addressing Existing Review Comments

@jerm-dro's blocking comments:

  1. Use Secret for HF tokens: Valid and critical - see Issue fix(typo): corrects readme #1 above
  2. Convert to StatefulSet: Valid and blocking - see Issue Do we want the container monitor? #2 above
  3. Log ignored errors: Valid - see embeddingserver_controller.go:137

@ChrisJBurns's questions:

  1. Splitting the PR: This is feasible but would require careful sequencing:

    • PR1: CRD types only (270 lines)
    • PR2: Controller (1031 lines) + tests
    • PR3: E2E tests and examples

    However, given the blocking issues identified above, I'd recommend fixing the critical issues first before splitting, as the StatefulSet change will touch both CRD and controller.

  2. Test coverage: Confirmed - controller has 1k lines but only 300 lines of tests covering trivial helpers. Needs unit tests for reconciliation logic.

Priority Recommendations

P0 (Must Fix Before Merge):

  1. Add SecretKeyRef support to EnvVar type for secure token handling
  2. Convert to StatefulSet when modelCache.enabled=true and replicas > 1
  3. Add secret reference validation in controller
  4. Fix silent error drops (log at minimum)
  5. Add RBAC markers for secret access

P1 (Should Fix Soon):
6. Add unit tests for critical reconciliation paths
7. Implement StatusCollector pattern for atomic status updates
8. Fix method organization (public methods at top)
9. Extract duplicated code (PVC name, env var logic)

P2 (Nice to Have):
10. Fix ModelCache.Enabled omitempty issue
11. Make ObservedGeneration setting consistent
12. Remove unused context parameters

Overall Assessment

Functionality: ✅ The controller works for single-replica scenarios and includes excellent E2E tests.

Production Readiness: ⚠️ Blocked by critical issues:

  • Security vulnerability (plain text tokens)
  • Scaling failure (Deployment + PVC)
  • Missing unit tests for critical paths

The implementation shows good understanding of Kubernetes operator patterns and follows most ToolHive conventions. With the recommended fixes, this will be a solid addition to the operator.


@ptelang
Copy link
Author

ptelang commented Jan 20, 2026

@ChrisJBurns

Is there any way we can split this into multiple PRs? 4k diffs is quite a large change and I feel folks will miss stuff in the cracks due to trying to review the entire thing.

I will split the PRs tomorrow.

I've also noticed that the controller has 1k lines of code but only 300 lines of unit tests. Is there a way we can increase the test coverage to ensure the code does what we want?

Good point. I added more unit tests.

@ptelang ptelang force-pushed the add-embedding-engine branch 2 times, most recently from d549ddb to e5af0d2 Compare January 20, 2026 14:09
@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Jan 20, 2026
@ptelang ptelang force-pushed the add-embedding-engine branch from e5af0d2 to 149c710 Compare January 20, 2026 14:14
@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Jan 20, 2026
@ptelang ptelang force-pushed the add-embedding-engine branch 2 times, most recently from bda5c39 to 03231c5 Compare January 20, 2026 15:06
@github-actions github-actions bot removed the size/XL Extra large PR: 1000+ lines changed label Jan 20, 2026
@ptelang ptelang force-pushed the add-embedding-engine branch from d19f9c5 to 931ad7c Compare January 20, 2026 21:41
@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Jan 20, 2026
Signed-off-by: Jeremy Drouillard <jeremy@stacklok.com>
@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Jan 20, 2026
@ptelang ptelang force-pushed the add-embedding-engine branch from 5b15f05 to 05e1f4f Compare January 21, 2026 14:13
@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Jan 21, 2026
@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Jan 21, 2026
@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Jan 21, 2026
@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Jan 21, 2026
@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Jan 21, 2026
@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Jan 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XL Extra large PR: 1000+ lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants