Model Serving
Serve your trained models for production inference.Chat Interface
The simplest way to test and interact with models:http://localhost:7860/inference. The Chat UI allows you to load any local or Hub model for interactive testing.
Custom Port
Custom Host
API Server
The API server is a training runner, not an inference server. It exposes minimal endpoints for health checks while running training jobs.Start API Server
http://127.0.0.1:7860 by default.
Parameters
| Parameter | Description | Default |
|---|---|---|
--port | Port to run the API on | 7860 |
--host | Host to bind to | 127.0.0.1 |
--task | Task to run (optional) | None |
Custom Port/Host
Environment Variables
The API server reads configuration from environment variables:| Variable | Description |
|---|---|
HF_TOKEN | Hugging Face token for authentication |
AUTOTRAIN_USERNAME | Username for training |
PROJECT_NAME | Name of the project |
TASK_ID | Task identifier |
PARAMS | Training parameters (JSON) |
DATA_PATH | Path to training data |
MODEL | Model to use |
Endpoints
| Endpoint | Description |
|---|---|
GET / | Returns training status message |
GET /health | Health check (returns “OK”) |
The API server automatically shuts down when no training jobs are active. For production inference, use vLLM or TGI instead.
Production Deployment
Using vLLM
For production-grade serving with high throughput:Using Text Generation Inference (TGI)
OpenAI-Compatible API
Both vLLM and TGI provide OpenAI-compatible endpoints:Docker Deployment
Dockerfile Example
With GPU
Load Testing
Using hey
Using locust
Monitoring
Prometheus Metrics
If using vLLM or TGI, metrics are available at/metrics.
Logging
Next Steps
Benchmarking
Measure model performance
Chat Interface
Interactive testing