Every request you send to GPT-4, Claude, or Gemini is measured and billed in tokens — not characters, not words. Understanding how tokenization works is essential for controlling costs, fitting prompts within context windows, and building efficient LLM applications.
What Is a Token?
A token is a chunk of text as seen by the model's tokenizer. Tokenizers use Byte-Pair Encoding (BPE) to split text into the most statistically common subword units found during training.
Rules of thumb:
"Hello, world!" → ["Hello", ",", " world", "!"] = 4 tokens
"authentication" → ["authentic", "ation"] = 2 tokens
"supercalifragilistic" → 6+ tokens
Tokenizers by Model Family
| Model Family | Tokenizer | Vocab Size |
|---|---|---|
| GPT-4 / GPT-3.5 | cl100k_base | 100,277 |
| GPT-4o | o200k_base | 200,000 |
| Claude (all) | Claude's own BPE | ~100k |
| Gemini | SentencePiece | ~32k-256k |
| LLaMA / Mistral | SentencePiece | 32,000 |
Different tokenizers count the same text differently. A 1,000-token GPT-4 prompt may be 1,050 tokens with Claude.
Context Windows in 2025
| Model | Context Window | Notes |
|---|---|---|
| GPT-4o | 128,000 tokens | Standard version |
| GPT-4 Turbo | 128,000 tokens | |
| Claude 3.5 Sonnet | 200,000 tokens | |
| Gemini 1.5 Pro | 1,000,000 tokens | |
| Gemini 1.5 Flash | 1,000,000 tokens | |
| LLaMA 3.1 | 128,000 tokens | |
How API Costs Work
You pay for input tokens (your prompt + system prompt + history) and output tokens (the model's response). Output is typically 2–4× more expensive per token.
Total cost = (input_tokens × input_price) + (output_tokens × output_price)
Approximate pricing (per 1M tokens):
| Model | Input | Output |
|---|---|---|
| GPT-4o | $5 | $15 |
| GPT-4o mini | $0.15 | $0.60 |
| Claude 3.5 Sonnet | $3 | $15 |
| Gemini 1.5 Pro | $3.50 | $10.50 |
| Gemini 1.5 Flash | $0.075 | $0.30 |
Counting Tokens Before Sending
Use the OpenAI tiktoken library:
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
text = "Hello, how are you today?"
tokens = enc.encode(text)
print(f"Token count: {len(tokens)}") # → 6
For real-time token counting without code, use our Token Counter tool — paste any text and instantly see token counts for GPT-4, Claude, and Gemini, plus estimated API costs.
Strategies to Reduce Token Usage
1. Use a Smaller Model for Simple Tasks
GPT-4o mini costs 33× less than GPT-4o. For classification, extraction, or summarization-like tasks, smaller models often match quality at a fraction of the cost.
2. Compress System Prompts
Every message in every API call includes the full system prompt. A 500-token system prompt on 10,000 daily requests adds 5M tokens/day.
Before:
You are a helpful assistant that answers questions in a friendly,
professional, and concise manner. Always be respectful. Never
use profanity. Format your responses clearly.
After (same instructions, ~40% fewer tokens):
Answer concisely, professionally. No profanity. Format clearly.
3. Manage Conversation History
In chat applications, the entire conversation history is sent with each message. Cap history to the last N messages or summarize older turns:
MAX_HISTORY = 10
messages = [system_message] + conversation_history[-MAX_HISTORY:]
4. Use RAG Instead of Large Context
Rather than stuffing an entire document into context, use Retrieval-Augmented Generation (RAG) to fetch only the relevant chunks (typically 512–1024 tokens each).
5. Request Structured Output
JSON mode or function calling often produces shorter, more predictable outputs than prose explanations.
6. Batch Where Possible
For non-interactive tasks, use batch APIs (OpenAI Batch API, Anthropic batch) at 50% lower cost.
Token Budget Planning
For production LLM applications, calculate your monthly token budget:
Monthly tokens = daily_requests × (avg_input_tokens + avg_output_tokens) × 30
Monthly cost = monthly_tokens / 1,000,000 × price_per_million
Use our Token Counter to measure your typical prompt sizes and estimate costs before scaling.
