A working threat model for private AI environments
Generic security checklists do not produce useful design. A real threat model names risks the system has to answer for, the design choices that address each, and how to verify the mitigation actually holds.
What a threat model is for
A threat model is not a compliance artifact. It is the document that says, in plain language, what could go wrong with this system, what the design does about each scenario, and how the team would know if the mitigation failed. Without it, the architecture is impossible to review and the operating team has no way to prioritize their attention.
For private AI, the threat model has to cover risks that do not show up in conventional application security work. Prompt injection, retrieval leakage, model and prompt drift, egress through tool integrations, and privileged misuse are all AI-specific or AI-amplified. A SOC 2 control matrix does not catch them. They need to be named explicitly and mitigated by design choices, not by hoping the model behaves.
Risks every private AI threat model should address
Prompt injection via retrieved content is the most-discussed risk and the easiest to underestimate. An attacker plants instructions in a document the system will later retrieve — a vendor invoice, a customer support ticket, a public-facing page that gets indexed. The model sees the injected instructions as legitimate input. The mitigation is layered: trusted-source policies for retrieval, system prompts that explicitly bound instruction-following, output filters tied to your access policy, and adversarial prompts in your evaluation suite.
Retrieval leakage across access boundaries is more subtle. A user receives summarized content drawn from documents they would not be allowed to read directly. The retrieval layer fell through to a broader corpus because the access design ended at the application surface and did not extend to the index. The mitigation is per-user retrieval filters, document-level ACLs enforced in the vector store itself, and explicit rules against falling through to documents the requesting user has no permission to see.
Sensitive content captured in logs is a recurring incident pattern. Queries and retrieved snippets containing regulated material end up in telemetry, monitoring tools, vendor dashboards, or third-party log aggregators. The mitigation is redaction at the logging boundary, tiered logging policies that distinguish what is safe to capture from what must be excluded, retention limits, and an audit of every system in the telemetry path.
Model and prompt drift is the slowest-moving risk and the easiest to ignore. A model upgrade or a prompt edit subtly changes the system's behavior. Accuracy, safety, or compliance posture degrades in ways that do not surface until a user complains. The mitigation is a versioned prompt library, an evaluation suite run on every change with regression criteria, and a documented rollback path that the team has actually exercised at least once.
Egress to unmanaged endpoints is the architectural failure that scales fastest. A retrieval source, plugin, or tool integration calls out to a third party that was never on the approved data-flow list. The mitigation is egress allow-listing, network segmentation around the model runtime, and a written inventory of every outbound connection the system can make.
Privileged misuse is the risk that gets the least attention in design and the most attention in incident reviews. An administrator or maintainer queries the system in ways that bypass end-user controls or exfiltrate content. The mitigation is admin separation, audit logging of privileged actions, two-person rules for sensitive changes, and explicit acceptable-use documentation that names what privileged operators may and may not do.
How to verify the mitigations actually hold
A threat model with no verification plan is a wish list. Each named risk should have at least one way to be tested in the evaluation suite or in a periodic red-team exercise. For prompt injection, this is adversarial prompts in regression tests. For retrieval leakage, it is access-boundary tests run as a non-privileged user. For drift, it is the evaluation suite itself.
The threat model should also have a "how would we know" section for each risk — a signal the operating team can monitor that tells them something is going wrong. Without this, the team is reduced to hoping nothing bad happens, which is not an operating posture.
A useful threat model is short, specific, written down, and reviewed when the system changes. The point is not exhaustive coverage of every theoretical risk — it is honest accounting of the risks this system has to answer for and the verifiable design choices that address them.
