In early development · building in the open
Install Modelplane in your own environment, and it operates your GPU clusters across cloud, neocloud, and on-premise as one inference fleet: provisioning clusters, placing models, autoscaling replicas, caching weights, and routing through a single OpenAI-compatible endpoint. It runs any model on any serving engine on any infrastructure, all under your control.
Modelplane doesn’t replace the inference ecosystem, it orchestrates it across three layers: the models you run, the engines that serve them, and the infrastructure underneath, across accelerators and providers. It composes what your teams already choose and integrates new pieces as they emerge.
orchestrates
Models
open weights & custom
Serving
inference engines
Infrastructure
providers & accelerators
Providers
Accelerators
Modelplane matches each model’s requirements and serving topologies to the hardware available, using expressive CEL selectors and composable API shapes. Topology is declared as shape, so it places anything from a single GPU to multi-node, disaggregated frontier serving, and new parallelism strategies as they emerge.
tensor parallel
Split each layer across GPUs in a node for low-latency single-model serving.
pipeline parallel
Stage a model across nodes so very large models fit beyond a single box.
data / expert
Replicate workers, or shard experts across them for MoE throughput.
prefill / decode
Disaggregate prefill and decode onto separate pools for frontier serving.
+ emerging topology
Described as shape, so future parallelism strategies just work.
Modelplane defines a flexible API for inference. Each role owns its own resources: developers declare model deployments and expose one service across regions, clouds, and managed vendors, while platform teams declare the fleet of clusters, accelerators, and gateways underneath.
Development & ML teams
Define model deployments: the model, the engine and its configuration, serving topology, hardware request, region, and environment. Then expose them as one service, weighted across regions, clouds, and managed vendors.
kind: ModelService
name: prod-llama
routing: weighted, openai
kind: ModelDeployment
model: llama-4-70b
cluster: aws-us-east
kind: ModelDeployment
model: llama-4-70b
cluster: gcp-eu-west
kind: ModelEndpoint
target: vendor-api
type: managed
Platform teams
Declare the fleet: a gateway over clusters across clouds and regions, each with its own hardware classes and node pools. Set the capacity, accelerators, policy, and cost controls the whole fleet runs within.
kind: InferenceGateway
name: prod-gateway
routes: all endpoints
kind: InferenceCluster
name: aws-us-east
pools: h200, h100
kind: InferenceCluster
name: gcp-eu-west
pools: tpu-v6e, a100
kind: InferenceCluster
name: onprem-dc1
pools: h100, l40s
01 / Provisioning
Provision inference clusters on AWS, GCP, and Azure, or bring your own on any Kubernetes. Each gets hardware classes, node pools, an inference gateway, and the full serving stack, installed and continuously reconciled.
Provisioning
classes: h200-8x, h100-8x · node pools · gateway
✓ GPU operator & drivers
✓ Serving engines
✓ Inference gateway
02 / Scheduling
Modelplane treats every cluster, cloud, and region as one global pool. A fleet scheduler places each model's replicas where its requirements match a cluster's capabilities, then hands off to the cluster's own scheduler and DRA.
Two-level scheduling
fleet scheduler
one global pool
tracks requirements
↔ capabilities
places replicas
cluster scheduler
DRA
03 / Autoscaling
Every model exposes the standard Kubernetes scale subresource, so its replicas scale out across clusters, clouds, and regions, driven by hand or by HPA and KEDA.
roadmapScale-to-zero is on the roadmap.
Autoscaling
04 / Routing
A model service is one stable, OpenAI-compatible endpoint over many replicas and model endpoints. Weighted routing spreads traffic across replicas for canary and A/B rollouts, and a managed endpoint can take a weighted share too.
One service, many endpoints
ModelService · prod-llama
● one OpenAI-compatible endpoint
ModelEndpoint
replica · aws-us-east
ModelEndpoint
replica · gcp-eu-west
ModelEndpoint
managed · vendor
Modelplane is Apache 2.0 and open source end to end. It runs entirely in your own infrastructure and depends on nothing outside it, so no vendor can throttle, restrict, or revoke access.
It’s built by the team behind Crossplane, the open source control plane trusted at Apple, JPMC, Nike, Elastic, Grafana, and MongoDB, and it’s headed for a neutral open source foundation. Built in the open: star the project, join the community, and help shape the project.