AI Primer Part 2: Providers, Models, Runtimes, and Harnesses

In the first AI primer, I introduced models, prompts, context, tools, agents, and control loops. Since then, I’ve noticed that a few more terms keep getting confused.
Someone says, “I use Claude,” but do they mean the website claude.ai, Claude Code, the API, or one of Anthropic’s models?
Someone else says, “We built a harness,” when they mean they wrote an AGENTS.md file.
If you’ve heard these terms thrown around but aren’t sure what exactly they mean, or if you have used these words but are wondering if everyone understands them in the same way, this post is for you (to share).
Welcome to part two of the glossary.
Providers And Model Families
A provider is the company or organization developing a model or giving you access to it.
OpenAI, Anthropic, Google, Meta, DeepSeek, and others are examples of model providers. They develop model families, but they may also operate APIs, host inference, and build products on top of their models.
The model ecosystem is global. If we only learn the naming conventions used by some US companies, we will have a distorted view of the available models and how providers package them.
“Provider” can also be ambiguous when the company that created a model, the company hosting it, and the company that built the application you are using aren’t the same company.
A model family or series is a related collection of models. Within that family, the provider may publish multiple generations, capability tiers, sizes, or specialized variants.
Here is a useful naming hierarchy:
- Provider: the organization developing or serving the model
- Family or series: a related collection of models
- Generation: a major version within the family
- Tier or member: a provider-defined position around capability, speed, and efficiency
- Size: usually a parameter count, especially for open-weight models
- Specialization: coding, reasoning, vision, embeddings, safety, or another task
- Mode: a configuration used while running a model, not necessarily a different model
OpenAI’s GPT-5.6 series is a clean current example:
OpenAI
└── GPT
└── GPT-5.6
├── Sol
├── Terra
└── Luna
In that name, GPT is the broader lineage, 5.6 identifies the generation, and Sol, Terra, and Luna identify capability tiers. OpenAI describes Sol as its flagship tier, Terra as the balanced tier for everyday work, and Luna as the faster, lighter tier.
That does not mean Luna is “a small Sol,” or that Sol is only a coding model. These are members of a series positioned around different tradeoffs.
This general idea appears across providers:
| Provider and family | Higher-capability tier | Balanced tier | Faster or lighter tier |
|---|---|---|---|
| OpenAI GPT-5.6 | Sol | Terra | Luna |
| Anthropic Claude | Opus | Sonnet | Haiku |
| Google Gemini | Pro | — | Flash |
This table is neither exhaustive nor a conversion chart. Sol is not exactly equivalent to Opus. Luna is not exactly equivalent to Haiku. Pro and Flash do not create a perfect three-tier match for either family.
These names are maps created by each provider to help users choose among capability, latency, and cost. They are not standardized technical measurements, and they do not tell us the parameter count.
Other providers mix generations, capability labels, and specializations differently. DeepSeek has used V series names for general foundation models and R series names for reasoning models; its newer V4 API family includes Pro and Flash members. Z.ai’s GLM family uses generational names alongside labels such as Air, Flash, vision variants, and explicit parameter sizes. These systems rhyme with the US-provider examples, but they are not obligated to fit the same three boxes.
A model tier is a product category, not a unit of intelligence.
This distinction matters because model generations also improve. A newer balanced model may outperform an older flagship model. The name tells you how the provider positions that model within its current family; it does not establish a permanent rank across every model ever released.
Model Sizes And Specializations
Open-weight models often use a more literal naming convention:
gpt-oss-20b
gpt-oss-120b
Llama 8B
Llama 70B
Gemma 4B
Gemma 12B
The B means billions of parameters.
Parameters are the learned numerical values in a model. Parameter count gives us a rough sense of scale and can help estimate the memory and hardware needed to run a model.
It is not a universal capability score.
A 70-billion-parameter model is not automatically better for every task than a smaller model. Architecture, training data, post-training, quantization, tool-use training, and specialization all affect what a model can do.
Some models also use a mixture-of-experts architecture. Those models may publish a large total parameter count while activating only a smaller portion for each token. If you see both total parameters and active parameters, that is why.
Then there are specialized models:
- A coding model is optimized for understanding, generating, and working with code.
- A reasoning model is trained for multi-step problem solving and may expose settings that trade additional inference computation for deeper reasoning.
- A vision-language model can interpret images, video, or other visual input.
- An image-generation model produces images from text or other inputs.
- An embedding model turns input into vectors for search, retrieval, clustering, and comparison.
- A safety model may classify input or output against a policy.
Size, tier, and specialization answer different questions:
Size: How large is this model?
Tier: Where does the provider position it?
Specialization: What kind of work is it optimized for?
A specialized model may be available in several sizes. A capability tier may support several modes. A product may automatically choose among multiple models without showing you which one it selected.
That is why the model name matters, but it still does not tell you the whole system.
Closed, Open-Weight, And Open-Source Models
People often contrast “commercial models” with “open-source models.” That framing is misleading because open-weight models can absolutely be used commercially, and closed models are not the only models sold through commercial services.
The more useful distinction is about what has been released and how you can access it.
Closed Or Proprietary Models
With a closed or proprietary model, the provider does not distribute the weights. You use the model through a product or managed API.
The provider or an authorized hosting partner operates the infrastructure, while the model provider updates the model and controls how it may be served. You are buying access to inference rather than receiving the model itself.
Open-Weight Models
With an open-weight model, the learned weights are available for you to download and run under a particular license.
OpenAI’s gpt-oss, Google’s Gemma, Meta’s Llama family, DeepSeek’s model families, and Z.ai’s GLM releases are prominent examples of models distributed with weights.
Having the weights means you may be able to:
- run the model on your own hardware
- host it in your own cloud environment
- fine-tune or adapt it
- use a third-party inference provider
- control more of the data and deployment boundary
But “open weight” does not mean:
- free to operate
- free of license conditions
- small enough to run on your laptop
- fully reproducible from the original training data
- automatically open source under every definition
You still need compute, storage, an inference runtime, and people who know how to operate it. You also need to read the license for the specific model instead of assuming the word open grants unlimited rights.
Open-Source AI
Open-source AI is a stricter and still-evolving term.
Under the Open Source Initiative’s Open Source AI Definition, access to weights alone is not enough. The freedoms and materials needed to use, study, modify, and share the system also involve code and information about the data used to derive the model.
Not everyone uses the term that strictly, which is exactly why open weight is usually the clearer phrase when all we know is that the weights are available.
“Open weight” describes access to the weights. It does not mean noncommercial, unrestricted, or free to run.
Models, Applications, And Agentic Products
Now we can deal with the brand-name problem.
| Provider | Model family | User-facing application | Agentic product or harness | Open-weight family |
|---|---|---|---|---|
| OpenAI | GPT | ChatGPT | Codex | gpt-oss |
| Anthropic | Claude | Claude | Claude Code | — |
| Gemini | Gemini | Gemini CLI and related agent tooling | Gemma | |
| Meta | Llama | Meta AI | Product-dependent | Llama |
| DeepSeek | DeepSeek V and R series | DeepSeek | — | DeepSeek V and R releases |
| Z.ai | GLM | Z.ai | ZCode and AutoGLM | GLM open models |
ChatGPT is not a model. It is a product through which people interact with models, tools, memory, search, file handling, and other capabilities.
Claude is an overloaded name. It can refer to Anthropic’s model family or the Claude application. Claude Code is a different agentic product built for working with code, files, terminals, and development tools.
Gemini is also overloaded. It can mean Google’s model family or the Gemini application. Google’s Gemma family is a separate collection of open models.
Codex and Claude Code are not merely model names. They are agentic products that combine models with tools, execution environments, permissions, context management, and user interfaces.
DeepSeek also spans multiple layers. DeepSeek is the provider, names such as DeepSeek-V and DeepSeek-R refer to model series, and DeepSeek’s website and app are user-facing products. Its API names and modes do not always match the exact names of the downloadable model releases.
Z.ai is the provider and product brand; GLM is the model family. GLM contains multiple generations, sizes, and specializations, including general text, coding-oriented, reasoning, vision, image, and speech models. This is another good reminder that a provider’s catalog is usually a tree, not a single model.
The dashes in the table only mean I am not assigning an example in that category here. They do not claim that a provider has no agent products or integrations.
The underlying model can change without the product becoming a different kind of thing. ChatGPT can offer several models. A coding agent can route different tasks to different models. A harness can even support models from multiple providers.
The application is how you interact with the system. The model is one component inside the system.
Runtime, Harness, And Instruction Scaffold
This is where I want to be painfully specific because these terms are increasingly used as if they were interchangeable.
They are not.
The industry does not have one universally enforced vocabulary, so these are the working definitions I personally find most useful:
An inference runtime executes the model. A harness orchestrates the agent. An instruction scaffold guides it.
Inference Service
An inference service is an API or endpoint through which an application requests output from a model.
The service provides access to the model. It is not the model itself, and it is not necessarily the inference runtime executing the model.
When you use a managed API, the provider or an authorized hosting partner operates the infrastructure and hides most of the underlying inference runtime from you. When you host an open-weight model yourself, you may expose a similar service using an inference runtime or serving framework such as llama.cpp, vLLM, or SGLang. Tools such as Ollama and LM Studio make this easier by packaging local model management, compatible runtimes, and an API server into a more accessible experience.
For example, calling Anthropic’s Claude API means consuming an inference service while Anthropic operates the underlying inference runtime. Serving an open-weight model through vLLM means operating both the inference runtime and the inference service yourself.
Inference Runtime
An inference runtime is the machinery that loads and executes model weights. It handles the computation required to transform model input into generated output.
An inference runtime answers questions like:
Where and how does the model execute?
Which software loads and runs the model weights?
How is the computation scheduled across the available hardware?
Tools such as llama.cpp, vLLM, and SGLang provide this execution machinery. The inference runtime may sit behind a managed inference service or one that you operate yourself.
Harness
An agent harness is the software system wrapped around the model that makes the model useful as an agent.
The exact boundary varies by product, but a harness commonly provides:
- the reason, act, observe, repeat control loop
- context assembly and delivery
- tool definitions and tool execution
- filesystem and workspace access
- sandboxes and permission boundaries
- session state and memory
- context compaction
- retries, continuation, and recovery
- planning and verification loops
- human approvals and guardrails
- logs, traces, and observability
The model can propose an action. The harness decides how that proposal becomes an actual action in an actual environment.
The Databricks explanation of an agent harness describes this as the infrastructure that connects a model to tools, memory, workspaces, execution environments, and guardrails. LangChain’s anatomy of a harness emphasizes filesystems, bash, sandboxes, context management, durable state, and verification.
That is a lot more than a prompt file.
A harness answers:
What can the agent see?
What can it do?
How are actions executed?
How does work continue across steps or sessions?
What boundaries are enforced?
How is progress verified?
Codex and Claude Code are recognizable examples of agentic products with harnesses. In casual conversation, people sometimes call the entire product “the harness.” That is reasonable as long as we understand they mean the operational system, not just the model and not just its instructions.
Instruction Scaffold
An instruction scaffold is the organized guidance supplied to the agent through the harness.
It can include:
- system and developer instructions
AGENTS.mdCLAUDE.md- nested project instruction files
- skills and playbooks
- repository conventions
- workflow-specific rules
- reusable task templates
The scaffold helps answer:
How should the agent behave here?
What project knowledge should it follow?
Which workflow applies to this task?
Here is the distinction that keeps getting blurred:
An instruction scaffold is content consumed and delivered by a harness. It is not the harness.
If I change AGENTS.md, I have changed the agent’s guidance.
If I change the harness, I may have changed how instructions are discovered, which tools are available, how permission checks work, when context is compacted, how tool calls are executed, whether failed work is retried, or how results are verified.
Those are different kinds of changes.
The same instruction scaffold can be adapted for multiple harnesses. The same harness can load completely different instruction scaffolds for different projects. A directory full of Markdown files does not become a harness just because an agent reads it.
In fact, the harness decides whether, when, and how those files reach the model. LangChain’s article gives us a useful concrete example: AGENTS.md is stored in the filesystem and injected into context by harness machinery. The file contains guidance; the harness finds and delivers it.
| Concept | What it is | Primary responsibility | Examples |
|---|---|---|---|
| Model | The neural network | Transform input into generated output | GPT, Claude, Gemini, Llama |
| Inference service | API or service endpoint | Provide access to model inference | OpenAI API, Claude API, Gemini API |
| Inference runtime | Model execution machinery | Load and run model weights | llama.cpp, vLLM, SGLang |
| Harness | Operational software around the model | Context, tools, loops, state, permissions, verification | Codex, Claude Code |
| Instruction scaffold | Organized guidance | Tell the harness-driven agent how to work | AGENTS.md, CLAUDE.md, skills |
Final Mental Model
Here is the compact version:
Think of the model as the brain providing capability.
The inference service provides access.
The inference runtime executes the model.
The harness orchestrates the agent.
The instruction scaffold guides it.
If we can keep those five things separate, we can finally discuss whether we need a better model, a better inference service, a better inference runtime, a better harness, or simply better instructions.
Those are five different problems. We should stop giving them the same name.
Further Reading
- Previewing the GPT-5.6 series
- Anthropic model overview
- Google Gemini models
- Google Gemma models
- DeepSeek API documentation and model updates
- DeepSeek open model repositories
- Z.ai model overview
- Z.ai GLM open model repositories
- OpenAI open-weight models
- Open Source AI Definition
- Harnesses in AI: A Deep Dive — Tejas Kumar
- What is an AI Agent Harness? — Databricks
- The Anatomy of an Agent Harness — LangChain
- Harness design for long-running application development — Anthropic
- Build Agents That Run for Hours — Anthropic workshop