[Feature Request] Run neurons using Kubernetes Jobs

### Request Type
Feature Request

### Problem Description
Cortex can run analyzers and responders (collectively, _neurons_, if I'm using the term properly) as subprocesses (`ProcessJobRunnerSrv`) or using Docker (`DockerJobRunnerSrv`). When processes are used, all the neuron code ends up in the same filesystem, process, and network namespace as Cortex itself. When Docker is used, both Cortex itself and each neuron's code can run in their own containers. This is a maintainability triumph.

But in order to do this, Cortex's container has to have access to the Docker socket, and it has to share a directory with the neuron containers it runs. (With the `DockerJobRunnerSrv`, this filesystem sharing happens via access to a directory in the host OS' filesystem.) Access to the Docker socket is not a security best practice, because it's equivalent to local root; also it hinders scalability and flexibility in running neurons, and it means depending specifically on Docker, not generally any software that can make a container.

### Kubernetes
Kubernetes offers APIs for running containers which are scalable to multi-node clusters, securable, and supported by multiple implementations. Some live natively in public clouds (EKS, AKS, GKE, etc.), some are trivial single-node clusters for development (minikube, KIND), and some are lightweight but production-grade (k3s, MicroK8S). Kubernetes has extension points for storage, networking, and container running, so these functions can be provided by plugins.

The net effect is that while there is a dizzying array of choices about how to set up a Kubernetes cluster, applications consuming the Kubernetes APIs don't need to make those choices, only to signify what they need. The cluster will make the right thing appear, subject to the choices of its operators. And the people using the cluster need not be the same people as the operators: public clouds support quick deployments with but few questions, I've heard.

### Jobs
One of the patterns Kubernetes supports for using containers to get work done is the Job. (I'll capitalize it here to avoid confusion with Cortex jobs.) You create a Job with a Pod spec (which in turn specifies some volumes and some containers), and the Job will create and run the Pod, retrying until it succeeds, subject to limits on time, number of tries, rate limits, and the like. Upon succeeding it remains until deleted, or until a configured timeout expires, etc.

Running a Job with the Kubernetes API would be not unlike running a container with the Docker API, except that it would be done using a different client, and filesystem sharing would be accomplished in a different way.

### Sharing files with a job
With the Docker job runner, a directory on the host is specified; it's mounted into the Cortex container; and when Cortex creates a neuron container, it's mounted into that neuron container. This implicitly assumes that the Cortex container and the neuron container are both run on the same host, they can have access to the same filesystem at the same time, and it provides persistent storage. None of these are necessarily true under Kubernetes.

Under Kubernetes, a `PersistentVolumeClaim` can be created, which backs a volume that can be mounted into a container (as signified in the spec for the container). That claim can have `ReadWriteMany` as its `accessModes` setting, which signifies to the cluster a requirement that multiple nodes should be able to read and write files in the `PersistentVolume` which the cluster provides to satisfy the claim. On a trivial, single-node cluster, the persistent volume can be a `hostPath`: the same way everything happens now with Docker, but more complicated. But different clusters can provide other kinds of volumes to satisfy such a claim: self-hosted clusters may use Longhorn or Rook to provide redundant, fault-tolerant storage; or public clouds may provide volumes of other types they have devised themselves (Elastic Filesystem, Azure Shared Disks, etc). The `PersistentVolumeClaim` doesn't care.

So the Cortex container is created with a `ReadWriteMany` `PersistentVolumeClaim` backed volume mounted as its job base directory. When running a neuron, it creates a directory for the job, and creates a `Job`, with volumes which are job-specific `subPath`s of the same `PersistentVolumeClaim` mounted as the `/job/input` (`readOnly: true`) and `/job/output` directories. How the files get shared is up to the cluster. The `Job` can only see and use the input and output directories for the Cortex job it's about. When the `Job` is finished, the output files will be visible in the Cortex container under the job base directory, as with other job running methods.

### How to implement
1. Choose a Kubernetes client.
    - skuber exists specifically for Scala, but appears to have been last updated in September 2020, and is not automatically generated from the Kubernetes API definition, so it takes manual work to make updates.
    - io.fabric8:kubernetes-client is generally for Java, and it's automatically generated, with the last update this month.
2. Write a `KubernetesJobRunnerSrv`, which takes information about a persistent volume claim passed in, and uses it to create a Kubernetes Job for a Cortex job, hewing closely to the `DockerJobRunnerSrv`.
3. Follow dependencies and add code as necessary until Cortex can be configured to run jobs this way, and can run the jobs.
4. Document several use cases.
    - The simplest way to run everything on a single machine.
    - A simple self-hosted setup.
5. Write a Helm chart or Operator, which will make Cortex deployment quick and easy, given a cluster.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature Request] Run neurons using Kubernetes Jobs #347

Request Type

Problem Description

Kubernetes

Jobs

Sharing files with a job

How to implement

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature Request] Run neurons using Kubernetes Jobs #347

Description

Request Type

Problem Description

Kubernetes

Jobs

Sharing files with a job

How to implement

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions