AI Primer Part 2: Providers, Models, Runtimes, and Harnesses

13 minute read

The Babel fish translating between languages

In the first AI primer, I introduced models, prompts, context, tools, agents, and control loops. Since then, I’ve noticed that a few more terms keep getting confused.

Someone says, “I use Claude,” but do they mean the website claude.ai, Claude Code, the API, or one of Anthropic’s models?

Someone else says, “We built a harness,” when they mean they wrote an AGENTS.md file.

If you’ve heard these terms thrown around but aren’t sure what exactly they mean, or if you have used these words but are wondering if everyone understands them in the same way, this post is for you (to share).

Welcome to part two of the glossary.

Providers And Model Families

A provider is the company or organization developing a model or giving you access to it.

OpenAI, Anthropic, Google, Meta, DeepSeek, and others are examples of model providers. They develop model families, but they may also operate APIs, host inference, and build products on top of their models.

The model ecosystem is global. If we only learn the naming conventions used by some US companies, we will have a distorted view of the available models and how providers package them.

“Provider” can also be ambiguous when the company that created a model, the company hosting it, and the company that built the application you are using aren’t the same company.

A model family or series is a related collection of models. Within that family, the provider may publish multiple generations, capability tiers, sizes, or specialized variants.

Here is a useful naming hierarchy:

Provider: the organization developing or serving the model
Family or series: a related collection of models
Generation: a major version within the family
Tier or member: a provider-defined position around capability, speed, and efficiency
Size: usually a parameter count, especially for open-weight models
Specialization: coding, reasoning, vision, embeddings, safety, or another task
Mode: a configuration used while running a model, not necessarily a different model

OpenAI’s GPT-5.6 series is a clean current example:

OpenAI
  └── GPT
        └── GPT-5.6
              ├── Sol
              ├── Terra
              └── Luna

In that name, GPT is the broader lineage, 5.6 identifies the generation, and Sol, Terra, and Luna identify capability tiers. OpenAI describes Sol as its flagship tier, Terra as the balanced tier for everyday work, and Luna as the faster, lighter tier.

That does not mean Luna is “a small Sol,” or that Sol is only a coding model. These are members of a series positioned around different tradeoffs.

This general idea appears across providers:

Provider and family	Higher-capability tier	Balanced tier	Faster or lighter tier
OpenAI GPT-5.6	Sol	Terra	Luna
Anthropic Claude	Opus	Sonnet	Haiku
Google Gemini	Pro	—	Flash

This table is neither exhaustive nor a conversion chart. Sol is not exactly equivalent to Opus. Luna is not exactly equivalent to Haiku. Pro and Flash do not create a perfect three-tier match for either family.

These names are maps created by each provider to help users choose among capability, latency, and cost. They are not standardized technical measurements, and they do not tell us the parameter count.

Other providers mix generations, capability labels, and specializations differently. DeepSeek has used V series names for general foundation models and R series names for reasoning models; its newer V4 API family includes Pro and Flash members. Z.ai’s GLM family uses generational names alongside labels such as Air, Flash, vision variants, and explicit parameter sizes. These systems rhyme with the US-provider examples, but they are not obligated to fit the same three boxes.

A model tier is a product category, not a unit of intelligence.

This distinction matters because model generations also improve. A newer balanced model may outperform an older flagship model. The name tells you how the provider positions that model within its current family; it does not establish a permanent rank across every model ever released.

Model Sizes And Specializations

Open-weight models often use a more literal naming convention:

gpt-oss-20b
gpt-oss-120b
Llama 8B
Llama 70B
Gemma 4B
Gemma 12B

The B means billions of parameters.

Parameters are the learned numerical values in a model. Parameter count gives us a rough sense of scale and can help estimate the memory and hardware needed to run a model.

It is not a universal capability score.

A 70-billion-parameter model is not automatically better for every task than a smaller model. Architecture, training data, post-training, quantization, tool-use training, and specialization all affect what a model can do.

Some models also use a mixture-of-experts architecture. Those models may publish a large total parameter count while activating only a smaller portion for each token. If you see both total parameters and active parameters, that is why.

Then there are specialized models:

A coding model is optimized for understanding, generating, and working with code.
A reasoning model is trained for multi-step problem solving and may expose settings that trade additional inference computation for deeper reasoning.
A vision-language model can interpret images, video, or other visual input.
An image-generation model produces images from text or other inputs.
An embedding model turns input into vectors for search, retrieval, clustering, and comparison.
A safety model may classify input or output against a policy.

Size, tier, and specialization answer different questions:

Size:           How large is this model?
Tier:           Where does the provider position it?
Specialization: What kind of work is it optimized for?

A specialized model may be available in several sizes. A capability tier may support several modes. A product may automatically choose among multiple models without showing you which one it selected.

That is why the model name matters, but it still does not tell you the whole system.

Closed, Open-Weight, And Open-Source Models

People often contrast “commercial models” with “open-source models.” That framing is misleading because open-weight models can absolutely be used commercially, and closed models are not the only models sold through commercial services.

The more useful distinction is about what has been released and how you can access it.

Closed Or Proprietary Models

With a closed or proprietary model, the provider does not distribute the weights. You use the model through a product or managed API.

The provider or an authorized hosting partner operates the infrastructure, while the model provider updates the model and controls how it may be served. You are buying access to inference rather than receiving the model itself.

Open-Weight Models

With an open-weight model, the learned weights are available for you to download and run under a particular license.

OpenAI’s gpt-oss, Google’s Gemma, Meta’s Llama family, DeepSeek’s model families, and Z.ai’s GLM releases are prominent examples of models distributed with weights.

Having the weights means you may be able to:

run the model on your own hardware
host it in your own cloud environment
fine-tune or adapt it
use a third-party inference provider
control more of the data and deployment boundary

But “open weight” does not mean:

free to operate
free of license conditions
small enough to run on your laptop
fully reproducible from the original training data
automatically open source under every definition

You still need compute, storage, an inference runtime, and people who know how to operate it. You also need to read the license for the specific model instead of assuming the word open grants unlimited rights.

Open-Source AI

Open-source AI is a stricter and still-evolving term.

Under the Open Source Initiative’s Open Source AI Definition, access to weights alone is not enough. The freedoms and materials needed to use, study, modify, and share the system also involve code and information about the data used to derive the model.

Not everyone uses the term that strictly, which is exactly why open weight is usually the clearer phrase when all we know is that the weights are available.

“Open weight” describes access to the weights. It does not mean noncommercial, unrestricted, or free to run.

Models, Applications, And Agentic Products

Now we can deal with the brand-name problem.

Provider	Model family	User-facing application	Agentic product or harness	Open-weight family
OpenAI	GPT	ChatGPT	Codex	gpt-oss
Anthropic	Claude	Claude	Claude Code	—
Google	Gemini	Gemini	Gemini CLI and related agent tooling	Gemma
Meta	Llama	Meta AI	Product-dependent	Llama
DeepSeek	DeepSeek V and R series	DeepSeek	—	DeepSeek V and R releases
Z.ai	GLM	Z.ai	ZCode and AutoGLM	GLM open models

ChatGPT is not a model. It is a product through which people interact with models, tools, memory, search, file handling, and other capabilities.

Claude is an overloaded name. It can refer to Anthropic’s model family or the Claude application. Claude Code is a different agentic product built for working with code, files, terminals, and development tools.

Gemini is also overloaded. It can mean Google’s model family or the Gemini application. Google’s Gemma family is a separate collection of open models.

Codex and Claude Code are not merely model names. They are agentic products that combine models with tools, execution environments, permissions, context management, and user interfaces.

DeepSeek also spans multiple layers. DeepSeek is the provider, names such as DeepSeek-V and DeepSeek-R refer to model series, and DeepSeek’s website and app are user-facing products. Its API names and modes do not always match the exact names of the downloadable model releases.

Z.ai is the provider and product brand; GLM is the model family. GLM contains multiple generations, sizes, and specializations, including general text, coding-oriented, reasoning, vision, image, and speech models. This is another good reminder that a provider’s catalog is usually a tree, not a single model.

The dashes in the table only mean I am not assigning an example in that category here. They do not claim that a provider has no agent products or integrations.

The underlying model can change without the product becoming a different kind of thing. ChatGPT can offer several models. A coding agent can route different tasks to different models. A harness can even support models from multiple providers.

The application is how you interact with the system. The model is one component inside the system.

Runtime, Harness, And Instruction Scaffold

This is where I want to be painfully specific because these terms are increasingly used as if they were interchangeable.

They are not.

The industry does not have one universally enforced vocabulary, so these are the working definitions I personally find most useful:

An inference runtime executes the model. A harness orchestrates the agent. An instruction scaffold guides it.

Inference Service

An inference service is an API or endpoint through which an application requests output from a model.

The service provides access to the model. It is not the model itself, and it is not necessarily the inference runtime executing the model.

When you use a managed API, the provider or an authorized hosting partner operates the infrastructure and hides most of the underlying inference runtime from you. When you host an open-weight model yourself, you may expose a similar service using an inference runtime or serving framework such as llama.cpp, vLLM, or SGLang. Tools such as Ollama and LM Studio make this easier by packaging local model management, compatible runtimes, and an API server into a more accessible experience.

For example, calling Anthropic’s Claude API means consuming an inference service while Anthropic operates the underlying inference runtime. Serving an open-weight model through vLLM means operating both the inference runtime and the inference service yourself.

Inference Runtime

An inference runtime is the machinery that loads and executes model weights. It handles the computation required to transform model input into generated output.

An inference runtime answers questions like:

Where and how does the model execute?
Which software loads and runs the model weights?
How is the computation scheduled across the available hardware?

Tools such as llama.cpp, vLLM, and SGLang provide this execution machinery. The inference runtime may sit behind a managed inference service or one that you operate yourself.

Harness

An agent harness is the software system wrapped around the model that makes the model useful as an agent.

The exact boundary varies by product, but a harness commonly provides:

the reason, act, observe, repeat control loop
context assembly and delivery
tool definitions and tool execution
filesystem and workspace access
sandboxes and permission boundaries
session state and memory
context compaction
retries, continuation, and recovery
planning and verification loops
human approvals and guardrails
logs, traces, and observability

The model can propose an action. The harness decides how that proposal becomes an actual action in an actual environment.

The Databricks explanation of an agent harness describes this as the infrastructure that connects a model to tools, memory, workspaces, execution environments, and guardrails. LangChain’s anatomy of a harness emphasizes filesystems, bash, sandboxes, context management, durable state, and verification.

That is a lot more than a prompt file.

A harness answers:

What can the agent see?
What can it do?
How are actions executed?
How does work continue across steps or sessions?
What boundaries are enforced?
How is progress verified?

Codex and Claude Code are recognizable examples of agentic products with harnesses. In casual conversation, people sometimes call the entire product “the harness.” That is reasonable as long as we understand they mean the operational system, not just the model and not just its instructions.

Instruction Scaffold

An instruction scaffold is the organized guidance supplied to the agent through the harness.

It can include:

system and developer instructions
AGENTS.md
CLAUDE.md
nested project instruction files
skills and playbooks
repository conventions
workflow-specific rules
reusable task templates

The scaffold helps answer:

How should the agent behave here?
What project knowledge should it follow?
Which workflow applies to this task?

Here is the distinction that keeps getting blurred:

An instruction scaffold is content consumed and delivered by a harness. It is not the harness.

If I change AGENTS.md, I have changed the agent’s guidance.

If I change the harness, I may have changed how instructions are discovered, which tools are available, how permission checks work, when context is compacted, how tool calls are executed, whether failed work is retried, or how results are verified.

Those are different kinds of changes.

The same instruction scaffold can be adapted for multiple harnesses. The same harness can load completely different instruction scaffolds for different projects. A directory full of Markdown files does not become a harness just because an agent reads it.

In fact, the harness decides whether, when, and how those files reach the model. LangChain’s article gives us a useful concrete example: AGENTS.md is stored in the filesystem and injected into context by harness machinery. The file contains guidance; the harness finds and delivers it.

Concept	What it is	Primary responsibility	Examples
Model	The neural network	Transform input into generated output	GPT, Claude, Gemini, Llama
Inference service	API or service endpoint	Provide access to model inference	OpenAI API, Claude API, Gemini API
Inference runtime	Model execution machinery	Load and run model weights	llama.cpp, vLLM, SGLang
Harness	Operational software around the model	Context, tools, loops, state, permissions, verification	Codex, Claude Code
Instruction scaffold	Organized guidance	Tell the harness-driven agent how to work	`AGENTS.md`, `CLAUDE.md`, skills

Final Mental Model

Here is the compact version:

Think of the model as the brain providing capability.
The inference service provides access.
The inference runtime executes the model.
The harness orchestrates the agent.
The instruction scaffold guides it.

If we can keep those five things separate, we can finally discuss whether we need a better model, a better inference service, a better inference runtime, a better harness, or simply better instructions.

Those are five different problems. We should stop giving them the same name.

Diego Jules