In early development · building in the open

The open source
control plane for
AI inference

Install Modelplane in your own environment, and it operates your GPU clusters across cloud, neocloud, and on-premise as one inference fleet: provisioning clusters, placing models, autoscaling replicas, caching weights, and routing through a single OpenAI-compatible endpoint. It runs any model on any serving engine on any infrastructure, all under your control.

Get started →View on GitHub

ModelDeployment

POSTSGLang

deepseek-r1

prefill / decode8× B200

ModelDeployment

POSTvLLM

llama-4-70b

tensor parallel4× H100

ModelDeployment

POSTTRT-LLM

qwen3-235b

data / expert8× H200

reconciling

provisioningschedulingautoscalingroutingcaching

policygovernancecompliance

InferenceCluster

GCP

Cloud · us-central1

256× TPU v6e8× H100

InferenceCluster

CoreWeave

Neocloud · gpu-east

72× GB2008× H200

InferenceCluster

DGX

On-prem · dc-1

32× H1008× A100

Created by

Built on

The inference ecosystem. Under one control plane.

Modelplane doesn’t replace the inference ecosystem, it orchestrates it across three layers: the models you run, the engines that serve them, and the infrastructure underneath, across accelerators and providers. It composes what your teams already choose and integrates new pieces as they emerge.

composesprovisionsschedulesautoscalesroutescaches

orchestrates

Models

open weights & custom

Llama

Qwen

DeepSeek

Mistral

gpt-oss

Gemma+ any open-weight model

Serving

inference engines

vLLMSSGLang

TensorRT-LLM

TGIlllama.cppLLMDeploy+ any engine

Infrastructure

providers & accelerators

Providers

AWS

GCP

Azure

CoreWeave

Lambdaoon-prem+ any Kubernetes

Accelerators

NVIDIA

AMD

Google TPU

AWS Trainium

Intel Gaudi+ any accelerator

Advanced serving. From single GPU to frontier.

Modelplane matches each model’s requirements and serving topologies to the hardware available, using expressive CEL selectors and composable API shapes. Topology is declared as shape, so it places anything from a single GPU to multi-node, disaggregated frontier serving, and new parallelism strategies as they emerge.

tensor parallel

Split each layer across GPUs in a node for low-latency single-model serving.

→→

pipeline parallel

Stage a model across nodes so very large models fit beyond a single box.

data / expert

Replicate workers, or shard experts across them for MoE throughput.

→

prefill / decode

Disaggregate prefill and decode onto separate pools for frontier serving.

+ emerging topology

Described as shape, so future parallelism strategies just work.

A resource API for inference. Serving two roles.

Modelplane defines a flexible API for inference. Each role owns its own resources: developers declare model deployments and expose one service across regions, clouds, and managed vendors, while platform teams declare the fleet of clusters, accelerators, and gateways underneath.

Development & ML teams

Define model deployments: the model, the engine and its configuration, serving topology, hardware request, region, and environment. Then expose them as one service, weighted across regions, clouds, and managed vendors.

kind: ModelService

routing: weighted, openai

kind: ModelDeployment

model: llama-4-70b

cluster: aws-us-east

kind: ModelDeployment

model: llama-4-70b

cluster: gcp-eu-west

kind: ModelEndpoint

target: vendor-api

type: managed

Platform teams

Declare the fleet: a gateway over clusters across clouds and regions, each with its own hardware classes and node pools. Set the capacity, accelerators, policy, and cost controls the whole fleet runs within.

kind: InferenceGateway

routes: all endpoints

kind: InferenceCluster

pools: h200, h100

kind: InferenceCluster

pools: tpu-v6e, a100

kind: InferenceCluster

pools: h100, l40s

Capabilities built for the fleet. Not just the cluster.

01 / Provisioning

Provision the fleet, or bring your own

Provision inference clusters on AWS, GCP, and Azure, or bring your own on any Kubernetes. Each gets hardware classes, node pools, an inference gateway, and the full serving stack, installed and continuously reconciled.

Provisioning

Provision · GKE / EKSBring your own · any K8s

Modelplane installs & reconciles

InferenceCluster● reconciled

classes: h200-8x, h100-8x · node pools · gateway

✓ GPU operator & drivers

✓ Serving engines

✓ Inference gateway

02 / Scheduling

One global pool of capacity

Modelplane treats every cluster, cloud, and region as one global pool. A fleet scheduler places each model's replicas where its requirements match a cluster's capabilities, then hands off to the cluster's own scheduler and DRA.

Two-level scheduling

fleet scheduler

one global pool

tracks requirements

↔ capabilities

→

places replicas

aws-us-east

gcp-eu-west

azure-us2

→

cluster scheduler

DRA

bound

03 / Autoscaling

Scale replicas across clouds and regions

Every model exposes the standard Kubernetes scale subresource, so its replicas scale out across clusters, clouds, and regions, driven by hand or by HPA and KEDA.

roadmapScale-to-zero is on the roadmap.

Autoscaling

load

spec.replicas 6

GCP · eu-west

AWS · us-east

Azure · us-east2

min 1 · max 8 replicasscale subresource · HPA / KEDA

04 / Routing

One service, many replicas and endpoints

A model service is one stable, OpenAI-compatible endpoint over many replicas and model endpoints. Weighted routing spreads traffic across replicas for canary and A/B rollouts, and a managed endpoint can take a weighted share too.

One service, many endpoints

ModelService · prod-llama

● one OpenAI-compatible endpoint

ModelEndpoint

replica · aws-us-east

ModelEndpoint

replica · gcp-eu-west

ModelEndpoint

managed · vendor

Genuinely open. Community driven.

Modelplane is Apache 2.0 and open source end to end. It runs entirely in your own infrastructure and depends on nothing outside it, so no vendor can throttle, restrict, or revoke access.

It’s built by the team behind Crossplane, the open source control plane trusted at Apple, JPMC, Nike, Elastic, Grafana, and MongoDB, and it’s headed for a neutral open source foundation. Built in the open: star the project, join the community, and help shape the project.

★ Star on GitHub Join the community

The open sourcecontrol plane forAI inference

The inference ecosystem. Under one control plane.

Advanced serving. From single GPU to frontier.

A resource API for inference. Serving two roles.

Capabilities built for the fleet. Not just the cluster.

Provision the fleet, or bring your own

One global pool of capacity

Scale replicas across clouds and regions

One service, many replicas and endpoints

Genuinely open. Community driven.

The open source
control plane for
AI inference