Source |
GoogleSec |
Identifiant |
8644219 |
Date de publication |
2025-01-29 05:00:10 (vue: 2025-01-29 10:07:42) |
Titre |
How we estimate the risk from prompt injection attacks on AI systems |
Texte |
Posted by the Agentic AI Security TeamModern AI systems, like Gemini, are more capable than ever, helping retrieve data and perform actions on behalf of users. However, data from external sources present new security challenges if untrusted sources are available to execute instructions on AI systems. Attackers can take advantage of this by hiding malicious instructions in data that are likely to be retrieved by the AI system, to manipulate its behavior. This type of attack is commonly referred to as an "indirect prompt injection," a term first coined by Kai Greshake and the NVIDIA team.To mitigate the risk posed by this class of attacks, we are actively deploying defenses within our AI systems along with measurement and monitoring tools. One of these tools is a robust evaluation framework we have developed to automatically red-team an AI system\'s vulnerability to indirect prompt injection attacks. We will take you through our threat model, before describing three attack techniques we have implemented in our evaluation framework.Threat model and evaluation framework Our threat model concentrates on an attacker using indirect prompt injection to exfiltrate sensitive information, as illustrated above. The evaluation framework tests this by creating a hypothetical scenario, in which an AI agent can send and retrieve emails on behalf of the user. The agent is presented with a fictitious conversation history in which the user references private information suc |
Notes |
★★
|
Envoyé |
Oui |
Condensat |
automated beam we 2024 :aneesh above access achieve actions actively actor adapt addepalli address adds adjustments advantage against agent agentic alex along alongside alphabetical andreas any appended are assume attack attacker attackers attacks automate automated automatically available based because been before behalf behavior believe bullet can cannot capable cause challenges chongyang class coined combination commonly comply concentrates consisting constructed contained containing contents context contributions controlled converges conversation creating critic critic: current data defend defense defenses deploying describing designed develop developed different directly disclosure diverse does each eliciting email emails end ends engineering entire entirely estimate evaluation ever example execute executes exfiltrate expected external extracting fails fictitious first flynn follows four framework frameworkour frameworks from future gemini gena generate generated generating generic gibson gleaned greshake harder has hate have hayes helping heuristic hiding high histories history how however hypothetical ilia illustrated implemented improvements include:actor increases indirect inform information injection injections insights instructions involves itay iterative its jamie john juliette kai kaskasoli kept knowledge language last leveraging liang lihao like likely lin listed little making malicious manipulate measurable measure measurement measures mehrotra methods mitigate model monitoring more most must naive natural new not number nvidia observed once one only optimization order otherwise pappu passed passport path perform pluto policies posed possible; posted potential present presented prior private probability problem process promising prompt prompts protect provides providing pruning random rate recognizes red references referred refinement refines removed repeats request requesting requires responses result resulting retrieve retrieved returns risk robust safety samples scenario score scores search: searches security send sending sensitive set several sharon shi shuang shumailov silver simple single social solutions solve song sources space; speech sravanti standard starts strong succeeding success successful such suggestions summarize summary susceptibility suspicious system systems take tap target task team teaming teamingcrafting teammodern techniques term terzis tests text than thank these threat three through tokens tools track tree tries type unaligned unauthorized under until untrusted user users uses using versions violate violations vulnerability way weak which who will within work would yona |
Tags |
Tool
Vulnerability
Threat
|
Stories |
|
Move |
|