Kubernetes Gateway API Inference Extension

5 minute read page test

This task describes how to configure Istio to use the Kubernetes Gateway API Inference Extension. The Gateway API Inference Extension aims to improve and standardize routing to self-hosted AI models in Kubernetes. It utilizes CRDs from the Kubernetes Gateway API and leverages Envoy’s External Processing filter to extend any Gateway into an inference gateway.

API Resources

The Gateway API Inference Extension introduces two API types in order to assist with the unique challenges of traffic routing for inference workloads:

InferencePool represents a collection of backends for an inference workload, and contains a reference to an associated endpoint picker service. The Envoy ext_proc filter is used to route incoming requests to the endpoint picker service in order to make an informed routing decision to an optimal backend in the inference pool.

InferenceObjective allows specifying the serving objectives of the request associated with it.

Setup

As the Gateway APIs are a prerequisite for Inference Extension APIs, install both the Gateway API and Gateway API Inference Extension CRDs if they are not present:

$ kubectl get crd gateways.gateway.networking.k8s.io &> /dev/null || \
  { kubectl kustomize "github.com/kubernetes-sigs/gateway-api/config/crd?ref=v1.4.0" | kubectl apply -f -; }
$ kubectl get crd inference.networking.k8s.io &> /dev/null || \
  { kubectl kustomize "github.com/kubernetes-sigs/gateway-api-inference-extension/config/crd?ref=v1.0.1" | kubectl apply -f -; }

Install Istio using the minimal profile:

$ istioctl install --set profile=minimal --set values.pilot.env.SUPPORT_GATEWAY_API_INFERENCE_EXTENSION=true --set values.pilot.env.ENABLE_GATEWAY_API_INFERENCE_EXTENSION=true -y

Configuring an `InferencePool`

For a detailed guide on setting up a local test environment, see the Gateway API Inference Extension documentation.

In this example, we will deploy a inference model service using a vLLM simulator, and use an InferencePool and the endpoint picker in order to route requests to individual backends.

Deploy a basic vLLM simulator to behave as our inference workload, and the essential Gateway API resources:

$ kubectl create namespace istio-ingress
$ kubectl apply -f - <<EOF
apiVersion: v1
kind: Namespace
metadata:
  name: inference-model-server
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: inference-model-server-deployment
  namespace: inference-model-server
  labels:
    app: inference-model-server
spec:
  replicas: 3
  selector:
    matchLabels:
      app: inference-model-server
  template:
    metadata:
      labels:
        app: inference-model-server
    spec:
      containers:
      - name: vllm-sim
        image: ghcr.io/llm-d/llm-d-inference-sim:v0.7.1
        imagePullPolicy: Always
        args:
        - --model
        - meta-llama/Llama-3.1-8B-Instruct
        - --port
        - "8000"
        - --max-loras
        - "2"
        - --lora-modules
        - '{"name": "reviews-1"}'
        ports:
        - containerPort: 8000
          name: http
          protocol: TCP
        env:
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        - name: POD_IP
          valueFrom:
            fieldRef:
              fieldPath: status.podIP
        resources:
          requests:
            cpu: 20m
---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: gateway
  namespace: istio-ingress
spec:
  gatewayClassName: istio
  listeners:
  - name: http
    port: 80
    protocol: HTTP
    allowedRoutes:
      namespaces:
        from: All
      kinds:
      - group: gateway.networking.k8s.io
        kind: HTTPRoute
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: httproute-for-inferencepool
  namespace: inference-model-server
spec:
  parentRefs:
  - group: gateway.networking.k8s.io
    kind: Gateway
    name: gateway
    namespace: istio-ingress
    sectionName: http
  rules:
  - backendRefs:
    - group: inference.networking.k8s.io
      kind: InferencePool
      name: inference-model-server-pool
    matches:
    - path:
        type: PathPrefix
        value: /v1/completions
EOF

Deploy the endpoint picker service and create an InferencePool:

$ kubectl apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: inference-endpoint-picker
  namespace: inference-model-server
  labels:
    app: inference-epp
spec:
  replicas: 1
  selector:
    matchLabels:
      app: inference-epp
  template:
    metadata:
      labels:
        app: inference-epp
    spec:
      containers:
      - name: epp
        image: us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/epp:v20251119-2aaf2a6
        imagePullPolicy: Always
        args:
        - --pool-name
        - "inference-model-server-pool"
        - --pool-namespace
        - "inference-model-server"
        - --v
        - "4"
        - --zap-encoder
        - "json"
        - "--config-file"
        - "/config/default-plugins.yaml"
        ports:
        - containerPort: 9002
        - containerPort: 9003
        - name: metrics
          containerPort: 9090
        livenessProbe:
          grpc:
            port: 9003
            service: inference-extension
          initialDelaySeconds: 5
          periodSeconds: 10
        readinessProbe:
          grpc:
            port: 9003
            service: inference-extension
          initialDelaySeconds: 5
          periodSeconds: 10
        volumeMounts:
        - name: plugins-config-volume
          mountPath: "/config"
      volumes:
      - name: plugins-config-volume
        configMap:
          name: plugins-config
---
apiVersion: v1
kind: Service
metadata:
  name: endpoint-picker-svc
  namespace: inference-model-server
spec:
  selector:
    app: inference-epp
  ports:
    - protocol: TCP
      port: 9002
      targetPort: 9002
      appProtocol: http2
  type: ClusterIP
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: plugins-config
  namespace: inference-model-server
data:
  default-plugins.yaml: |
    apiVersion: inference.networking.x-k8s.io/v1alpha1
    kind: EndpointPickerConfig
    plugins:
    - type: queue-scorer
    - type: kv-cache-utilization-scorer
    - type: prefix-cache-scorer
    schedulingProfiles:
    - name: default
      plugins:
      - pluginRef: queue-scorer
        weight: 2
      - pluginRef: kv-cache-utilization-scorer
        weight: 2
      - pluginRef: prefix-cache-scorer
        weight: 3
---
# A DestinationRule is required to enable TLS between the gateway and
# the endpoint picker.
apiVersion: networking.istio.io/v1
kind: DestinationRule
metadata:
  name: endpoint-picker-tls
  namespace: inference-model-server
spec:
  host: endpoint-picker-svc
  trafficPolicy:
      tls:
        mode: SIMPLE
        insecureSkipVerify: true
---
apiVersion: inference.networking.k8s.io/v1
kind: InferencePool
metadata:
  name: inference-model-server-pool
  namespace: inference-model-server
spec:
  selector:
    matchLabels:
      app: inference-model-server
  targetPorts:
    - number: 8000
  endpointPickerRef:
    name: endpoint-picker-svc
    port:
      number: 9002
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: inference-model-reader
  namespace: inference-model-server
rules:
- apiGroups: ["inference.networking.k8s.io"]
  resources: ["inferencepools"]
  verbs: ["get", "list", "watch"]
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: epp-to-inference-model-reader
  namespace: inference-model-server
subjects:
- kind: ServiceAccount
  name: default
  namespace: inference-model-server
roleRef:
  kind: Role
  name: inference-model-reader
  apiGroup: rbac.authorization.k8s.io
EOF

Set the Ingress Host environment variable:

$ kubectl wait -n istio-ingress --for=condition=programmed gateways.gateway.networking.k8s.io gateway
$ export INGRESS_HOST=$(kubectl get gateways.gateway.networking.k8s.io gateway -n istio-ingress -ojsonpath='{.status.addresses[0].value}')

Send an inference request using curl, you should see a successful response from the backend model server:

$ curl -s -i "http://$INGRESS_HOST/v1/completions" -d '{"model": "reviews-1", "prompt": "What do reviewers think about The Comedy of Errors?", "max_tokens": 100, "temperature": 0}'
...
HTTP/1.1 200 OK
...
server: istio-envoy
...
{"choices":[{"finish_reason":"stop","index":0,"text":"Testing@, #testing 1$ ,2%,3^, [4"}],"created":1770406965,"id":"cmpl-5e508481-7c11-53e8-9587-972a3704724e","kv_transfer_params":null,"model":"reviews-1","object":"text_completion","usage":{"completion_tokens":16,"prompt_tokens":10,"total_tokens":26}}

Cleanup

Remove deployments and Gateway API resources:

$ kubectl delete deployment inference-model-server-deployment inference-endpoint-picker -n inference-model-server
$ kubectl delete httproute httproute-for-inferencepool -n inference-model-server
$ kubectl delete inferencepool inference-model-server-pool -n inference-model-server
$ kubectl delete gateways.gateway.networking.k8s.io gateway -n istio-ingress
$ kubectl delete ns istio-ingress inference-model-server

Uninstall Istio:

$ istioctl uninstall -y --purge
$ kubectl delete ns istio-system

Remove the Gateway API and Gateway API Inference Extension CRDs if they are no longer needed:

$ kubectl kustomize "github.com/kubernetes-sigs/gateway-api/config/crd?ref=v1.4.0" | kubectl delete -f -
$ kubectl kustomize "github.com/kubernetes-sigs/gateway-api-inference-extension/config/crd?ref=v1.0.1" | kubectl delete -f -

Kubernetes Gateway API Inference Extension

API Resources

Setup

Configuring an InferencePool

Cleanup

Configuring an `InferencePool`