CraftUp · 2026
Built by PMs for PMs. Use this glossary to align your team on decision-critical concepts—context, retrieval, agents, evals, and safety—so you can ship reliable AI features faster.
The mechanisms an agent uses to remember and reuse past interactions or facts across turns and sessions.
Coordinating how agents, tools, and models are invoked, sequenced, and supervised within a product.
A product flow where an agent chains reasoning, tool use, and checkpoints to achieve a user goal with minimal hand-holding.
A system where an LLM plans and executes actions toward a goal using tools, memory, and feedback loops.
How you split documents or histories into pieces for indexing so retrieval balances relevance, completeness, and speed.
Showing which sources support an answer, with links or identifiers users can verify.
Screening inputs and outputs for toxicity, abuse, violence, or other policy-violating content.
Reducing and reshaping context (summaries, salience scoring, deduplication) so key facts fit within token and latency budgets.
Deliberately shaping what the model sees—ordering, framing, and scoping inputs—to drive reliable, on-brand responses.
Pairing LLM reasoning with external retrieval so responses cite up-to-date, relevant sources instead of relying on model memory.
A system where an LLM plans and executes actions toward a goal using tools, memory, and feedback loops.
A repeatable pipeline that scores model or agent outputs against test cases and business metrics before and after changes.
Policies and technical controls that constrain what an AI can say or do, preventing harmful or out-of-scope behavior.
Deliberately shaping what the model sees—ordering, framing, and scoping inputs—to drive reliable, on-brand responses.
The maximum token length a model can attend to at once across input and output.
Concrete input-output pairs included in the prompt to teach the model the desired style, structure, or reasoning without training.
Designing and testing instructions, examples, and constraints so an LLM produces outputs that meet product requirements.
A governed collection of reusable, versioned prompts and context blocks that teams can consume safely.
A parameterized prompt pattern that inserts dynamic data while preserving structure, tone, and constraints.
Requiring the model to return JSON or another strict schema so downstream systems can parse results reliably.
The always-on instruction block that sets persona, guardrails, and priorities for every model call in your product.
The maximum tokens you allocate per request across prompt, tools, and output to control latency and cost.
Explicit guidance given to a model about when and how to call tools or APIs, including constraints and safety rules.
The mechanisms an agent uses to remember and reuse past interactions or facts across turns and sessions.
How you split documents or histories into pieces for indexing so retrieval balances relevance, completeness, and speed.
Showing which sources support an answer, with links or identifiers users can verify.
Reducing and reshaping context (summaries, salience scoring, deduplication) so key facts fit within token and latency budgets.
Tracking the origin, transformations, and permissions of data used for training, retrieval, or responses.
Vector representations of text or data that capture semantic meaning, enabling similarity search, clustering, and ranking.
Responses that are explicitly supported by retrieved or verifiable sources, reducing hallucination risk.
Keeping the information an AI feature relies on up to date, and detecting when stale data harms quality.
A second-pass model or heuristic that orders retrieved items by relevance before feeding them to the LLM.
Pairing LLM reasoning with external retrieval so responses cite up-to-date, relevant sources instead of relying on model memory.
A storage and query engine optimized for vector similarity search, often combined with metadata filtering and hybrid search.
Coordinating how agents, tools, and models are invoked, sequenced, and supervised within a product.
A product flow where an agent chains reasoning, tool use, and checkpoints to achieve a user goal with minimal hand-holding.
A system where an LLM plans and executes actions toward a goal using tools, memory, and feedback loops.
An API pattern where the LLM returns a structured call to a specified function, often validated and executed by your code.
Infrastructure to run, monitor, and resume agents that operate over minutes to hours with checkpoints and persistence.
A protocol for connecting models to external tools and data sources in a standardized, secure way.
A setup where multiple specialized agents collaborate or compete to solve a task, often with coordination rules.
Splitting an agent into a planning component that outlines steps and an executor that performs them, often with feedback.
A pattern where the model critiques or scores its own output (or an agent’s step) before finalizing or retrying.
Allowing a model to invoke predefined functions or APIs with structured arguments during its reasoning loop.
How often model-invoked tools succeed, how they fail, and how gracefully the system recovers.
Crafting clear input/output definitions for tools exposed to the model to ensure safe, correct, and efficient calls.
Total variable cost (tokens, tool calls, infra) to complete a user task with your AI feature.
A repeatable pipeline that scores model or agent outputs against test cases and business metrics before and after changes.
A curated collection of test cases with trusted answers used to judge model quality over time.
How often a model produces unsupported or incorrect facts relative to total responses.
The maximum response time you can spend across model calls, tools, and orchestration while meeting UX and business goals.
Using a model to score another model’s outputs against criteria, often faster and cheaper than human labeling.
Quality tests run on recorded or synthetic data without live users, giving fast, safe feedback on changes.
Live experiments that measure model changes with real user traffic, often via A/B tests or shadow deployments.
Running automated checks to ensure a change doesn’t reintroduce past bugs or quality drops in model behavior.
The percentage of user or agent tasks completed correctly without human rework or retries.
Screening inputs and outputs for toxicity, abuse, violence, or other policy-violating content.
Sensitive information being exposed to unauthorized users or external systems through model inputs, outputs, or logs.
Policies and technical controls that constrain what an AI can say or do, preventing harmful or out-of-scope behavior.
A crafted input designed to bypass safety constraints and make the model produce disallowed content or actions.
Detecting and removing personally identifiable information from inputs, outputs, or stored data to prevent exposure.
A user or document attempt to override system instructions, causing the model to act outside intended bounds.
Systematic attempts to break or exploit an AI system to uncover safety and security weaknesses before attackers do.
The mechanisms an agent uses to remember and reuse past interactions or facts across turns and sessions.
Coordinating how agents, tools, and models are invoked, sequenced, and supervised within a product.
A product flow where an agent chains reasoning, tool use, and checkpoints to achieve a user goal with minimal hand-holding.
A system where an LLM plans and executes actions toward a goal using tools, memory, and feedback loops.
How you split documents or histories into pieces for indexing so retrieval balances relevance, completeness, and speed.
Showing which sources support an answer, with links or identifiers users can verify.
Screening inputs and outputs for toxicity, abuse, violence, or other policy-violating content.
Reducing and reshaping context (summaries, salience scoring, deduplication) so key facts fit within token and latency budgets.
Deliberately shaping what the model sees—ordering, framing, and scoping inputs—to drive reliable, on-brand responses.
The maximum token length a model can attend to at once across input and output.
Total variable cost (tokens, tool calls, infra) to complete a user task with your AI feature.
Sensitive information being exposed to unauthorized users or external systems through model inputs, outputs, or logs.
Tracking the origin, transformations, and permissions of data used for training, retrieval, or responses.
Vector representations of text or data that capture semantic meaning, enabling similarity search, clustering, and ranking.
A repeatable pipeline that scores model or agent outputs against test cases and business metrics before and after changes.
Concrete input-output pairs included in the prompt to teach the model the desired style, structure, or reasoning without training.
An API pattern where the LLM returns a structured call to a specified function, often validated and executed by your code.
A curated collection of test cases with trusted answers used to judge model quality over time.
Responses that are explicitly supported by retrieved or verifiable sources, reducing hallucination risk.
Policies and technical controls that constrain what an AI can say or do, preventing harmful or out-of-scope behavior.
How often a model produces unsupported or incorrect facts relative to total responses.
A crafted input designed to bypass safety constraints and make the model produce disallowed content or actions.
Keeping the information an AI feature relies on up to date, and detecting when stale data harms quality.
The maximum response time you can spend across model calls, tools, and orchestration while meeting UX and business goals.
Using a model to score another model’s outputs against criteria, often faster and cheaper than human labeling.
Infrastructure to run, monitor, and resume agents that operate over minutes to hours with checkpoints and persistence.
A protocol for connecting models to external tools and data sources in a standardized, secure way.
A setup where multiple specialized agents collaborate or compete to solve a task, often with coordination rules.
Quality tests run on recorded or synthetic data without live users, giving fast, safe feedback on changes.
Live experiments that measure model changes with real user traffic, often via A/B tests or shadow deployments.
Detecting and removing personally identifiable information from inputs, outputs, or stored data to prevent exposure.
Splitting an agent into a planning component that outlines steps and an executor that performs them, often with feedback.
Designing and testing instructions, examples, and constraints so an LLM produces outputs that meet product requirements.
A user or document attempt to override system instructions, causing the model to act outside intended bounds.
A governed collection of reusable, versioned prompts and context blocks that teams can consume safely.
A parameterized prompt pattern that inserts dynamic data while preserving structure, tone, and constraints.
Systematic attempts to break or exploit an AI system to uncover safety and security weaknesses before attackers do.
A pattern where the model critiques or scores its own output (or an agent’s step) before finalizing or retrying.
Running automated checks to ensure a change doesn’t reintroduce past bugs or quality drops in model behavior.
A second-pass model or heuristic that orders retrieved items by relevance before feeding them to the LLM.
Pairing LLM reasoning with external retrieval so responses cite up-to-date, relevant sources instead of relying on model memory.
Requiring the model to return JSON or another strict schema so downstream systems can parse results reliably.
The always-on instruction block that sets persona, guardrails, and priorities for every model call in your product.
The percentage of user or agent tasks completed correctly without human rework or retries.
The maximum tokens you allocate per request across prompt, tools, and output to control latency and cost.
Allowing a model to invoke predefined functions or APIs with structured arguments during its reasoning loop.
Explicit guidance given to a model about when and how to call tools or APIs, including constraints and safety rules.
How often model-invoked tools succeed, how they fail, and how gracefully the system recovers.
Crafting clear input/output definitions for tools exposed to the model to ensure safe, correct, and efficient calls.
A storage and query engine optimized for vector similarity search, often combined with metadata filtering and hybrid search.
Go deeper with our courses, resources, and blog. Start learning free in the app—no fluff, just practical AI product skills.