System Prompt Leakage: Why Your Hidden Instructions Are Already Public
AI Security
Treat your system prompt as public. We explain why every nontrivial assistant eventually leaks its instructions, what attackers do with them, and how to design prompts that survive disclosure rather than depend on secrecy.
By Arjun Raghavan, Security & Systems Lead, BIPI · July 14, 2023 · 9 min read
Your system prompt is not a secret. It is shipped to the model with every request, it is shaped by your users' interactions, and there are dozens of public techniques to extract it. The sooner the team accepts this, the sooner the design improves.
How leaks actually happen
Direct asks work surprisingly often. So do indirect framings, asking the model to translate its prior context, to summarize the rules it follows, or to produce a poem that incorporates its instructions. Token level attacks that probe the model's behavior also reveal structure over time.
What attackers do with a leaked prompt
- Tailor injection payloads to the exact instruction format
- Identify which tools are wired up and which arguments they accept
- Discover internal naming, project codenames, and partner integrations
- Find policy gaps the prompt does not cover
Stop putting secrets in the prompt
API keys, database URLs, and customer identifiers do not belong in a system prompt. They belong in the tool layer, fetched at call time using the caller's identity. The prompt should describe behavior, not credentials.
Design prompts that survive disclosure
Write the system prompt as if it will be published next week. If publication would harm you, the harm comes from the design, not the prompt. Move the sensitive logic to code, where it belongs, and let the prompt describe the safe envelope.
Detecting leak attempts
Classify outputs for similarity to the system prompt itself. Track requests that probe meta behavior, asking about rules, instructions, or prior context. Rate limit and alert on patterns rather than on individual prompts.
Versioning and review
- Treat the system prompt as code, with reviews and history
- Tag every release with a version that appears in audit logs
- Run the eval harness on every prompt change
- Maintain a red team corpus that targets known prompt patterns
Secrecy is not a control. Behavior is.
What good looks like
A short, clear prompt that describes the assistant's role, the tone, the refusal policy, and the structured output contract. No credentials, no customer data, no internal URLs. Everything sensitive lives in the tool layer behind identity scoped APIs.
Closing
Assume disclosure, design accordingly, and you will spend less time chasing leaks and more time shipping safe features.
Read more field notes, explore our services, or get in touch at info@bipi.in. Privacy Policy · Terms.