Token Factory bills inference in two complementary modes. This page covers the pricing models, how tokens are counted on the per-token mode, and how workspace quotas behave.
We don't pin per-1M-token numbers in these docs because they move faster than the docs do. Open the Model Library and click any model entry to see its current per-token rates and tier eligibility.
#Pricing models
- Per-token (Token-based pricing) — pay-as-you-go per 1,000 input + output tokens, priced by model. Best fit for variable traffic and quick prototyping. Available now via
/v1/chat/completionsand/v1/embeddings. - Per-GPU-hour (GPU hourly pricing) — reserved capacity billed by GPU-hour, for Dedicated Endpoints. Coming soon — contact sales if you're interested in dedicated capacity.
The rest of this page covers the per-token mode. Per-GPU-hour pricing details will land alongside Dedicated Endpoints.
#What's a token
A token is roughly a piece of a word — about 4 characters or 0.75 English words on average. As a rough rule of thumb, "Hello, world!" lands around 4 tokens on a typical OSS chat tokenizer (Qwen-family and OpenAI-shape vocabs sit in that range). Different languages and code tokenize differently; for non-English text or heavily formatted code, expect more tokens per character. The exact tokenization depends on the model. The response always tells you the count used via usage.prompt_tokens and usage.completion_tokens.
#How to estimate cost before calling
For a pre-call estimate, run your prompt through a tokenizer locally. The tiktoken library covers OpenAI-shape tokenizers — most OSS chat models in the same vocab family produce counts within a few percent of the actual billed total.
This is an estimate — actual billing always uses the model's own tokenizer. If the served model ships its own tokenizer (e.g. a Qwen variant), expect small drift from the tiktoken count.
#How spend is calculated (per-token)
Per-request cost follows one formula:
Rates are quoted per 1,000 tokens and listed by model on the dashboard catalog page. Input is everything you send: system prompt, conversation history, the new user message, tool definitions, schemas. Output is what the model generates. Cached prompts (when supported) bill input at a reduced rate — check the model's pricing entry.
#Quotas
Quotas are per workspace and monthly. Three resource buckets are tracked today, with a fourth on the roadmap:
| Resource | Counts | Reset |
|---|---|---|
| Tokens | Sum of prompt + completion tokens across all per-token API calls. | Monthly, on the workspace reset date. |
| Images | Image-generation requests via the Playground proxy. | Monthly, same date. |
| Audio | Audio-speech requests via the Playground proxy. | Monthly, same date. |
| Video Coming Soon | Reserved; not enforced yet. | — |
Each bucket exposes the same shape on the dashboard: limit, used, remaining, used_percent, the reset date, a warning threshold (≥ 80%), and an exceeded flag (≥ 100%).
#Enforcement vs. tracking-only
A workspace-level enforcement flag controls behavior at the limit. With enforcement enabled (the default), requests over a bucket's limit are rejected. With enforcement disabled, the dashboard switches to a Tracking only banner and shows usage counts without blocking traffic — useful for audit and pre-production workspaces.
#What happens at the limit
When enforcement is enabled and a bucket hits its monthly limit, further calls in that bucket are rejected by the gateway. Watch the workspace Billing page for the used_percent and warning signals well before the hard cap.
Per-key controls (scopes, RPM/TPM, expiration) are not implemented today — see authentication → workspace-level controls.
#Reducing spend
- Use the smallest model that does the job. Profile against your real workload before climbing the catalog — premium-by-default burns spend you didn't need.
- Cap
max_tokenson every request. Runaway loops are real. - Batch embeddings — one call with 100 inputs beats 100 calls with 1.
- Cache embeddings at ingest time — re-embedding the same document at query time is waste.
- Trim system prompts — they're paid for every turn.
For ongoing visibility, watch your spend in real time via the Observability metrics dashboard.