Model Serving

Serve your trained models for production inference.

Chat Interface

The simplest way to test and interact with models:

aitraining chat

Opens a web interface at http://localhost:7860/inference. The Chat UI allows you to load any local or Hub model for interactive testing.

Custom Port

aitraining chat --port 3000

Custom Host

aitraining chat --host 0.0.0.0

API Server

The API server is a training runner, not an inference server. It exposes minimal endpoints for health checks while running training jobs.

Start API Server

aitraining api

Starts the training API on http://127.0.0.1:7860 by default.

Parameters

Parameter	Description	Default
`--port`	Port to run the API on	`7860`
`--host`	Host to bind to	`127.0.0.1`
`--task`	Task to run (optional)	`None`

Custom Port/Host

aitraining api --port 8000 --host 0.0.0.0

Environment Variables

The API server reads configuration from environment variables:

Variable	Description
`HF_TOKEN`	Hugging Face token for authentication
`AUTOTRAIN_USERNAME`	Username for training
`PROJECT_NAME`	Name of the project
`TASK_ID`	Task identifier
`PARAMS`	Training parameters (JSON)
`DATA_PATH`	Path to training data
`MODEL`	Model to use

Endpoints

Endpoint	Description
`GET /`	Returns training status message
`GET /health`	Health check (returns “OK”)

The API server automatically shuts down when no training jobs are active. For production inference, use vLLM or TGI instead.

Production Deployment

Using vLLM

For production-grade serving with high throughput:

pip install vllm

python -m vllm.entrypoints.openai.api_server \
  --model ./my-trained-model \
  --port 8000

Using Text Generation Inference (TGI)

docker run --gpus all -p 8080:80 \
  -v ./my-model:/model \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id /model

OpenAI-Compatible API

Both vLLM and TGI provide OpenAI-compatible endpoints:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="dummy"  # Not needed for local
)

response = client.chat.completions.create(
    model="my-model",
    messages=[
        {"role": "user", "content": "Hello!"}
    ]
)

Docker Deployment

Dockerfile Example

FROM python:3.10-slim

WORKDIR /app

# Install dependencies
RUN pip install aitraining torch

# Expose port
EXPOSE 7860

# Run chat server
CMD ["aitraining", "chat", "--host", "0.0.0.0", "--port", "7860"]

Build and run:

docker build -t my-model-server .
docker run -p 7860:7860 my-model-server

With GPU

docker run --gpus all -p 7860:7860 my-model-server

Load Testing

Using hey

hey -n 100 -c 10 \
  -m POST \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Hello", "max_tokens": 50}' \
  http://localhost:8000/generate

Using locust

# locustfile.py
from locust import HttpUser, task

class ModelUser(HttpUser):
    @task
    def generate(self):
        self.client.post("/generate", json={
            "prompt": "Hello, how are you?",
            "max_tokens": 50
        })

locust -f locustfile.py --host http://localhost:8000

Monitoring

Prometheus Metrics

If using vLLM or TGI, metrics are available at /metrics.

Logging

aitraining api --port 8000 2>&1 | tee server.log

Model Serving

Model Serving

Chat Interface

Custom Port

Custom Host

API Server

Start API Server

Parameters

Custom Port/Host

Environment Variables

Endpoints

Production Deployment

Using vLLM

Using Text Generation Inference (TGI)

OpenAI-Compatible API

Docker Deployment

Dockerfile Example

With GPU

Load Testing

Using hey

Using locust

Monitoring

Prometheus Metrics

Logging

Next Steps

Benchmarking

Chat Interface

​Model Serving

​Chat Interface

​Custom Port

​Custom Host

​API Server

​Start API Server

​Parameters

​Custom Port/Host

​Environment Variables

​Endpoints

​Production Deployment

​Using vLLM

​Using Text Generation Inference (TGI)

​OpenAI-Compatible API

​Docker Deployment

​Dockerfile Example

​With GPU

​Load Testing

​Using hey

​Using locust

​Monitoring

​Prometheus Metrics

​Logging

​Next Steps

Benchmarking

Chat Interface

Model Serving

Chat Interface

Custom Port

Custom Host

API Server

Start API Server

Parameters

Custom Port/Host

Environment Variables

Endpoints

Production Deployment

Using vLLM

Using Text Generation Inference (TGI)

OpenAI-Compatible API

Docker Deployment

Dockerfile Example

With GPU

Load Testing

Using hey

Using locust

Monitoring

Prometheus Metrics

Logging

Next Steps