Tokens, pricing & quotas · Token Factory Docs

Token Factory bills inference in two complementary modes. This page covers the pricing models, how tokens are counted on the per-token mode, and how workspace quotas behave.

Current rates live on the dashboard

We don't pin per-1M-token numbers in these docs because they move faster than the docs do. Open the Model Library and click any model entry to see its current per-token rates and tier eligibility.

#Pricing models

Per-token (Token-based pricing) — pay-as-you-go per 1,000 input + output tokens, priced by model. Best fit for variable traffic and quick prototyping. Available now via /v1/chat/completions and /v1/embeddings.
Per-GPU-hour (GPU hourly pricing) — reserved capacity billed by GPU-hour, for Dedicated Endpoints. Coming soon — contact sales if you're interested in dedicated capacity.

The rest of this page covers the per-token mode. Per-GPU-hour pricing details will land alongside Dedicated Endpoints.

#What's a token

A token is roughly a piece of a word — about 4 characters or 0.75 English words on average. As a rough rule of thumb, "Hello, world!" lands around 4 tokens on a typical OSS chat tokenizer (Qwen-family and OpenAI-shape vocabs sit in that range). Different languages and code tokenize differently; for non-English text or heavily formatted code, expect more tokens per character. The exact tokenization depends on the model. The response always tells you the count used via usage.prompt_tokens and usage.completion_tokens.

#How to estimate cost before calling

For a pre-call estimate, run your prompt through a tokenizer locally. The tiktoken library covers OpenAI-shape tokenizers — most OSS chat models in the same vocab family produce counts within a few percent of the actual billed total.

This is an estimate — actual billing always uses the model's own tokenizer. If the served model ships its own tokenizer (e.g. a Qwen variant), expect small drift from the tiktoken count.

#How spend is calculated (per-token)

Per-request cost follows one formula:

Rates are quoted per 1,000 tokens and listed by model on the dashboard catalog page. Input is everything you send: system prompt, conversation history, the new user message, tool definitions, schemas. Output is what the model generates. Cached prompts (when supported) bill input at a reduced rate — check the model's pricing entry.

#Quotas

Quotas are per workspace and monthly. Three resource buckets are tracked today, with a fourth on the roadmap:

Resource	Counts	Reset
Tokens	Sum of prompt + completion tokens across all per-token API calls.	Monthly, on the workspace reset date.
Images	Image-generation requests via the Playground proxy.	Monthly, same date.
Audio	Audio-speech requests via the Playground proxy.	Monthly, same date.
Video Coming Soon	Reserved; not enforced yet.	—

Each bucket exposes the same shape on the dashboard: limit, used, remaining, used_percent, the reset date, a warning threshold (≥ 80%), and an exceeded flag (≥ 100%).

#Enforcement vs. tracking-only

A workspace-level enforcement flag controls behavior at the limit. With enforcement enabled (the default), requests over a bucket's limit are rejected. With enforcement disabled, the dashboard switches to a Tracking only banner and shows usage counts without blocking traffic — useful for audit and pre-production workspaces.

#What happens at the limit

When enforcement is enabled and a bucket hits its monthly limit, further calls in that bucket are rejected by the gateway. Watch the workspace Billing page for the used_percent and warning signals well before the hard cap.

Per-key controls (scopes, RPM/TPM, expiration) are not implemented today — see authentication → workspace-level controls.

#Reducing spend

Use the smallest model that does the job. Profile against your real workload before climbing the catalog — premium-by-default burns spend you didn't need.
Cap max_tokens on every request. Runaway loops are real.
Batch embeddings — one call with 100 inputs beats 100 calls with 1.
Cache embeddings at ingest time — re-embedding the same document at query time is waste.
Trim system prompts — they're paid for every turn.

For ongoing visibility, watch your spend in real time via the Observability metrics dashboard.

#What next

Observability metrics

Track spend, latency, and request counts in real time.

Errors & status codes

Every error code with cause and fix.