Context Window
The maximum number of tokens a model can process in a single request — including both the input (your prompt) and the output (the model's response). A larger context window lets the model process longer documents, more conversation history, or entire codebases at once. For example,
Google Gemini 2.5 Flash supports 1 million tokens (roughly 750,000 words), while many older free models are limited to 4K–32K tokens. For coding with Claude Code or Cursor, 128K is the practical minimum — enough for a medium-sized codebase plus the diff you're working on.
↑ Back to top Chain-of-Thought (CoT)
A prompting or training technique where the model breaks complex problems into intermediate reasoning steps before giving a final answer. For math, logic, and multi-step reasoning, CoT dramatically improves accuracy. Some models — like
DeepSeek-R1 — are explicitly trained to show their thinking, while others require a system prompt like "think step by step" to activate CoT behavior. CoT consumes more output tokens (the "thinking" tokens), so check the model's max output limit when using reasoning models.
↑ Back to top OpenAI-Compatible API
An API endpoint that accepts HTTP requests in the same format as OpenAI's
/v1/chat/completions endpoint. This means you can use the standard OpenAI Python (
openai) or JavaScript SDK to call models from completely different providers — just by changing the
base_url parameter.
NVIDIA NIM,
Groq,
Mistral AI,
Cerebras, and
OpenRouter all offer OpenAI-compatible endpoints. Most AI coding tools (Cursor, Codex, OpenCode) also speak this format, which is why they work with free backends.
↑ Back to top Fill-in-the-Middle (FIM)
A training objective where the model learns to generate text that fits between a given prefix and suffix — the core capability behind inline code completion in IDEs. Instead of "continue this text" (autoregressive), FIM says "here's the code before and after the cursor — fill the gap."
Codestral and DeepSeek Coder variants are trained with FIM. Not all coding models support it; if you need inline completions (not chat-based coding), look for models that explicitly mention FIM in their documentation.
↑ Back to top Rate Limit (RPM / RPD / TPM)
The maximum number of API calls or tokens a provider allows within a time window on the free tier. Rate limits are expressed as RPM (requests per minute — e.g., NVIDIA NIM's 40 RPM), RPD (requests per day — e.g., Groq's 14,400 RPD), or TPM (tokens per minute — e.g., Mistral's 500,000 TPM). For solo developers, 30 RPM is usually sufficient (one request every 2 seconds). For apps with multiple users, look for providers with high daily caps or no daily limit.
Compare all provider rate limits → ↑ Back to top Tokens
The basic unit of text that LLMs process — roughly equivalent to a word fragment. In English, one token is about 4 characters or 0.75 words on average. A 1,000-word article is approximately 1,300 tokens. Most APIs charge by token count (even free tiers count tokens for rate limiting). Context windows, max output, and rate limits are all measured in tokens. Tools like tiktoken (OpenAI) can count tokens in your text before sending to the API.
↑ Back to top Multimodal
A model that accepts input beyond just text — images, audio, and/or video.
Gemini 2.5 Flash is the most capable free multimodal model, accepting text, image, audio, and video in a single prompt. Most free models are text-only. On freellm.net, each model card shows modality tags (TEXT, IMG, AUD, VID) so you can quickly filter for the input type you need.
Browse vision-capable models → ↑ Back to top Embedding
A numerical vector representation of text, used for semantic search, clustering, and RAG (retrieval-augmented generation). Embedding models convert text into fixed-size arrays of numbers (e.g., 768 or 1024 dimensions). Similar texts produce similar vectors, enabling "find documents like this one" queries.
Free embedding models are available from Cohere (free trial), NVIDIA NIM, and Cloudflare Workers AI. They are not chat models — they output vectors, not text.
↑ Back to top Mixture of Experts (MoE)
A model architecture where only a subset of the total parameters (the "experts") are activated for each token, rather than all parameters firing every time. This allows a model to have a very large total parameter count for broad knowledge while keeping inference cost manageable. For example,
Qwen3.5 397B has 397 billion total parameters but only activates 17 billion per token. MoE models typically appear as "XB AYB" — like "122B A10B" meaning 122B total, 10B active.
↑ Back to top RAG (Retrieval-Augmented Generation)
A technique that combines a retrieval system (semantic search over your documents) with an LLM to ground responses in your specific data. Instead of the model relying purely on its training data, RAG first finds relevant documents, then feeds them to the model as context. This reduces hallucination and lets the model answer questions about your private data. RAG requires: (1) an embedding model to vectorize your documents, (2) a vector database to store them, and (3) a chat model to synthesize the answer.
Free embedding models for RAG → ↑ Back to top Base URL
The API endpoint address where your HTTP requests are sent. Every LLM provider has a base URL — for example, NVIDIA NIM uses
https://integrate.api.nvidia.com/v1, Groq uses
https://api.groq.com/openai/v1. When configuring a coding tool (Claude Code, Cursor, Codex), you set the tool's base URL to point at a free provider instead of the default paid one (e.g.,
api.openai.com). The base URL is the key to using free backends. Find the base URL for every model on its
detail page or use our
config generator.
↑ Back to top API Key
A secret token that authenticates your requests to an LLM provider. You get an API key by signing up on the provider's website — most issue one instantly after email registration. Each model page on freellm.net links directly to the provider's API key signup page. Never share your API key publicly (in git repos, screenshots, or client-side code); if leaked, rotate it immediately via the provider's dashboard. Most free providers allow key rotation without losing access.
↑ Back to top Parameters
The numerical weights that define a neural network, often used as a rough measure of model capability. More parameters generally mean more knowledge and reasoning ability, but also higher inference cost and latency. Free models range from 1B (lightweight, fast) to 480B (frontier capability). The number is expressed in billions: "Llama 3.3 70B" has 70 billion parameters. MoE models report both total and active parameter counts. Parameter count alone doesn't determine quality — architecture, training data, and post-training matter equally.
↑ Back to top Max Output Tokens
The maximum number of tokens the model can generate in a single response. For coding tasks, 8K output is the practical minimum (a full source file). 16K+ lets the model generate entire modules at once. For reasoning tasks with CoT, allow extra output tokens for the model's "thinking" before the final answer. Note that the context window is shared between input and output, so a 128K context window with 8K max output means up to 120K tokens available for your prompt.
↑ Back to top Tool Calling / Function Calling
The ability for an LLM to output structured instructions (usually JSON) that an external system can execute — like calling an API, running a database query, or reading a file. This is essential for agentic workflows: "search my codebase for all uses of function X" requires the model to call a search tool, not just generate text. Most OpenAI-compatible endpoints support tool calling if the underlying model does. For coding agents (Claude Code, Cursor Agent), tool calling is a must-have feature.
↑ Back to top Inference
The process of running a trained model to generate output from your input — essentially "using" the model rather than training it. When you send a prompt to an API and get a response, that's inference. Different providers optimize inference differently: Groq uses custom LPU chips for speed, NVIDIA NIM uses GPU clusters, and Cloudflare distributes inference across its edge network. Inference speed is measured in tokens per second (tok/s).
↑ Back to top System Prompt
A special message at the start of a conversation that sets the model's behavior, tone, and constraints — without being part of the user conversation itself. For example: "You are a senior Python developer. Answer with code examples. Never guess — say you don't know if unsure." System prompts are supported by all major LLM APIs and are critical for building chatbots, coding agents, and any application where consistent behavior matters.
↑ Back to top Hallucination
When an LLM generates content that sounds plausible but is factually incorrect — like citing a non-existent API function or inventing a statistic. Hallucination is an inherent limitation of LLMs; they predict tokens based on patterns, not verified facts. Mitigation strategies: (1) use RAG to ground outputs in real documents, (2) set a low temperature for factual tasks, (3) instruct the model via system prompt to express uncertainty, (4) use tool calling so the model can verify claims by running actual code or search queries.
↑ Back to top Streaming (SSE)
A response mode where the model sends tokens one at a time as they're generated (via Server-Sent Events), instead of waiting for the full response to complete. Streaming creates a more responsive UX — users see text appearing in real-time, like a person typing. All OpenAI-compatible APIs support streaming via stream: true in the request body. Most free providers support streaming; it doesn't cost extra and usually works by default in tools like Cursor and Claude Code.
↑ Back to top Vector Database
A specialized database that stores and searches embedding vectors — essential for RAG systems. Instead of keyword matching, a vector database finds documents whose embeddings are closest to your query embedding (cosine similarity). Popular options include Pinecone, Weaviate, Qdrant, and pgvector (PostgreSQL extension). Many have free tiers. After embedding your documents with a
free embedding model, store the vectors in a vector DB and query them at runtime to provide relevant context to your chat model.
↑ Back to top Temperature
A parameter (0.0 to 2.0) that controls output randomness. Low temperature (0.0–0.3) makes the model deterministic — good for factual Q&A and code generation where you want consistent answers. High temperature (0.7–1.5) increases randomness — good for creative writing and brainstorming. Setting temperature to 0 does not guarantee identical outputs across calls (due to GPU floating-point variance), but it makes outputs highly reproducible. Most free APIs default to 1.0.
↑ Back to top Fine-tuning
Additional training on top of a pre-trained base model using your own dataset — teaching the model your specific style, domain knowledge, or output format. Fine-tuning requires: (1) a labeled dataset (typically 50–1,000+ examples), (2) a provider that supports fine-tuning (most free tiers do not), and (3) GPU compute for the training run. Most free users don't fine-tune; instead, they use system prompts, few-shot examples (2–5 examples in the prompt), and RAG to achieve similar results without training.
↑ Back to top