Identifying and Mitigating the Security Risks of Generative AI

28 minute read

Published:

GenAI favors attackers or defenders.

  • Dual-use Dilemma
    • Encryption is used for protecting “data at rest,” but it can also be used by ransomware to encrypt files.
    • Anonymity techniques can help protect regular users online and aid attackers in evading detection.
    • GenAI has shown amazing capabilities but can provide attackers and defenders powerful access to new capabilities, which are rapidly improving.
      • Email and Social media content can be monitored for manipulative content as well as improved network intrusion detection.

GenAI Capabilities

  • Generative Artificial Intelligence (GenAI):
    • emulate the structure and characteristics of input data to generate derived synthetic content.
    • This can include images, videos, audio, text, and other digital content.
  • Large Language Models (LLMs):
    • built typically on the transformer deep-learning architecture and trained on large amounts of text, from which they learn to emulate written language.
    • LLMs have also been extended to non-text modalities (image, audio, video, etc.)
  • Generating targeted text
    • that rivals the best hand-crafted messaging prose
    • with a capacity for imitation, empathy, and
    • with referring to specifics of any prior communication or context
  • Generating realistic images and video
    • can be customized based on very specific user input.
    • Synthetic combinations of realistic components and compelling deep fakes are easily produced and highly compelling.
  • Drawing on detailed technical knowledge and expertise
    • In particular, models can produce and analyze sophisticated source or machine code, reproduce specialized reasoning, and answer complex questions about biology, computer architecture, physiology, physics, law, defense tactics, and other topics.
    • Current models are not flawless, but the ability to perform some tasks effectively is game-changing.
  • Summarizing or paraphrasing
    • maintaining the style, tone, meaning, emotion, and intent.
  • Persisting on time-consuming and exhausting tasks without degradation of quality
    • While humans tire easily and may suffer psychological trauma when examining challenging social media communication, an AI model can continue undeterred.

Attacks

  • Spear-phishing
    • scammers can now skillfully craft coherent phishing emails, conversational and incredibly convincing, making them difficult to distinguish from legitimate communications
    • GenAI can leverage social engineering tactics to generate phishing emails.
      • Models can scrape the target’s social media feed and use it to create highly personalized messages, increasing the likelihood of successfully deceiving the recipient.
  • Dissemination of deepfakes
    • malicious users can disseminate widespread misinformation and disinformation that aligns with their specific narratives. Unsuspecting readers can easily fall victim to falsehoods.
  • Proliferation of cyberattacks
    • Adversaries can generate high-quality code to design sophisticated malware automatically; such malware may even include auto-code generation and execution capabilities.
    • LLMs can be used to create intelligent-agent systems for autonomous design planning and execution of attacks, where multiple LLMs can handle different roles, such as planning, reconnaissance, searching/scanning, code execution, remote control, and exfiltration.
      • chemistry agent, ChemCrow (Bran et al., 2023) - organic synthesis, drug discovery, and materials design
    • Prompt-injection attacks across the entire stack built around such an agent, potentially leading to cascading failures.
  • Low barrier-of-entry for adversaries
    • denial-of-service (DoS) attack on StackOverflow, where the platform was overloaded with responses generated by LLMs, overwhelming human moderators and prompting a temporary ban on LLMs (Makyen, 2022).
      • ChaosGPT, WormGPT, FraudGPT provides easy access.

GenAI vulnerabilities

  • Lack of social awareness and human sensibility
    • proficient in generating syntactically and semantically correct texts and lack a broader understanding of social context, and social factors (e.g., culture, value, norms) (Hovy and Yang, 2021)
      • there is a need to develop models that meet the expectations of human–to–human interaction while still ensuring human agency over technology
  • Hallucinations
    • generated output can be factually incorrect or entirely fictitious while still apparently coherent at a surface level
      • concern, especially when users without sufficient domain knowledge start excessively relying on these increasingly convincing language models
  • Data feedback loops
    • Potential problem for future training iterations that rely on scraping data from the internet
    • The data feedback can amplify the models’ biases, toxicity, and errors (Taori and Hashimoto, 2023; Gehman et al., 2020; Zhao et al., 2017).
    • Finally, the risk of data poisoning attacks becomes more feasible.
    • Recent studies have shown that web-scale datasets can be poisoned by maliciously introducing false or manipulated data into a small percentage of samples (Carlini et al., 2023a; Carlini and Terzis, 2022).
    • A recent push for “truly open-source models.”
      • meaning models for which the training data, the architecture, and the training methodology are all made available for public review, may be a first step to address these data-quality concerns.
  • Unpredictability
    • Models can performs task in zero-shot setting (i.e., without any task-specific training data) and have “emergent abilities” that were not explicitly designed into them (Wei et al., 2022).
      • Adversaries can manipulate the prompts. The model’s response may not be as robust due to these emerging distributional shifts.
        • “What is 6 + 7?” Vs “6 + 7 is 13. Is this true or false?” - the answer may be different

Defenses

  • Detecting LLM content
    • Detector $D(x) = 1$ if $x$ is generated by GenAI
    • These detectors exploit that the distribution of text generated by LLM is slightly different from natural text. However, text can be paraphrased to avoid being detected.
  • Watermarking
    • a “statistical signal” is embedded in the GenAI- generation process so this signal can be detected later.
      • The probability of the next-token prediction is slightly tweaked so that it can be detected later.
      • However, simple transformations can easily remove the watermark, such as paraphrasing for text (Sadasivan et al., 2023).
    • A more interesting case is to transform text or an image that is not watermarked and embed a watermark in it (e.g., take hate speech and embed GPT-4’s watermark in it, and claim that it was generated by GPT-4) (Sadasivan et al., 2023)
      • can be harder for attackers if the watermarking process involves a secret key
    • deep investigation is needed on possible scenarios for the use of watermarking.
  • Code analysis
    • Adversaries try to obfuscate code to evade detection. Therefore, companies spend a lot of resources to de-obfuscate code before analysis techniques can be applied.
    • To close this gap, perhaps LLMs can be trained to de-obfuscate code, for example, by fine-tuning examples of obfuscated and de-obfuscated code (there is some indication that LLMs like GPT-4 can do limited de-obfuscation).
  • Penetration testing
    • Penetration testing (pen-testing) is one of the predominant techniques to evaluate the vulnerability of a system. However, pen testing can be a cumbersome and mostly manual task. A pen-tester usually analyzes a system under investigation, uses existing tools to identify vulnerabilities, and then tries to exploit the identified vulnerabilities.
    • This can be labor-intensive, and usually, a pen-tester will not explore the entire space of potential vulnerabilities.
    • LLMs can help to automate this task, thus freeing a human pen-tester to focus on the most challenging vulnerabilities.
  • Multi-modal analysis
    • By leveraging multiple modalities together, LLMs can provide a more comprehensive understanding of complex information.
    • A detector for identifying “fake tweets” can use LLMs to handle these modalities collectively.
  • Personalized skill training
    • GenAI serves as a powerful tool to simulate conversations that require domain-specific expertise.
    • GenAI enables teaching methods tailored to each student’s unique learning style, pace, preference, and background.
  • Human–AI collaboration
    • For instance, let us consider an annotation pipeline augmented with LLMs. Rather than solely relying on human annotators, we can assign the jobs to workers and LLMs, then seek an agreement between human annotations and LLM predictions (Ziems et al., 2023).

Short-Term Goals

  • Use cases for emerging defense techniques

    • Detecting and watermarking content generated by GenAI
      • We need a comprehensive view of the attack and defense landscape of these techniques.
      • Removing watermarks from GenAI-generated content is quite easy.
        • Paraphrasing or passing the content through another GenAI system will remove the watermark (Sadasivan et al., 2023)).
      • However, inserting a watermark (at least in schemes that use a secret key) seems hard and the use of watermarks comes with desirable characteristics (lack of bias towards particular groups such as non-native English speakers, analytically understood false-positive rates, positive signal for all benign use cases of LLMs).
      • Given the state of watermarking in GenAI, what are the plausible use cases for deploying these techniques?
      • For example, mobile applications that certify their content’s provenance can be given a trust badge.
      • What is needed is a list of use cases where these techniques are effective despite their current limitations.
  • Current state of the art for LLM-enabled code analysis

    • Code capabilities of LLMs: code summarization, code completion, code obfuscation, and de-obfuscation
    • We need a comprehensive analysis of the code related capabilities of LLMs. Such an analysis is needed to inform possible defenses and possible threats.
  • Alignment of LLM-enabled code generation to secure coding practices

    • Unfortunately, LLMs trained in the content of programming communities such as StackOverflow learn to generate insecure or buggy code (Pearce et al., 2022).
    • Aligning LLMs to security and privacy requirements is key to making them useful.
    • Techniques such as Reinforcement Learning from Complier Feedback (RLCF) (Jain et al., 2023) and controlled code generation (He and Vechev, 2023) should be integrated into the training regimes of LLMs together with a common dataset of secure-coding practices.
  • Repository and service of SOTA attacks and defenses

    • Several defenses are being developed without evaluation against SOTA attacks.
    • Moreover, there is a lack of awareness of SOTA defenses in certain contexts.
      • What is the SOTA for watermarking content from GenAI that uses a private key, for example?
    • There is a need for a repository of SOTA attacks on various defense techniques (e.g., the latest attacks on deepfake detection).
    • Moreover, it will be impactful to have a service that provides SOTA techniques for defenses.
    • For example, the DARPA SemaFor program focuses on “not just detecting manipulated media, but also […] attribution and characterization” (DARPA Public Affairs, 2021).
    • If there were a service that provided SOTA from the DARPA SemaFor program, then it could be used by a wide audience and would ensure that the latest and greatest techniques in detecting manipulated media are being deployed.
    • For example, the ART repository from IBM has been extremely influential in the ML robustness community (***, n.d.).
    • This is essentially a community-organizing activity, but it could have a huge impact.

Emerging Defenses for GenAI

AI-generated vs. not AI-generated

  • Neural network-based detectors

    • These detectors are trained as binary classifiers to distinguish between AI and human-generated content.
      • (OpenAI, 2019; Jawahar et al., 2020; Mitchell et al., 2023; Bakhtin et al., 2019; Fagni et al., 2021).
    • For example, OpenAI fine-tunes RoBERTa-based (Liu et al., 2019) GPT-2 detector models to distinguish between non-AI generated and GPT-2 generated texts (OpenAI, 2019).

    • Image detection algorithms include classification DNNs that operate either
      • directly on pixel features (Sha et al., 2023; Wang et al., 2020; Marra et al., 2019; Marra et al., 2018),
        • on features extracted from the deepfakes (Nataraj et al., 2019; Frank et al., 2020; McCloskey and Albright, 2018; Guarnera et al., 2020; Liu et al., 2020; Zhang et al., 2019; Durall et al., 2019; Ricker et al., 2023), or
        • more recently on neural features extracted from foundation models such as CLIP (Ojha et al., 2023).
  • Zero-shot detectors
    • These detectors perform without additional training overhead and use some statistical signatures of AI-generated content to conduct the detection.
      • (Solaiman et al., 2019; Ippolito et al., 2020; Gehrmann et al., 2019).
  • Retrieval-based detectors:
    • These detectors are proposed in the context of LLMs, where the outputs of the LLM are stored in a database (Krishna et al., 2023).
    • For a candidate passage, they search this database for semantically similar matches to make their detection robust to simple paraphrasing.
    • We note that storing user-LLM conversations might lead to serious privacy concerns.
  • Watermarking-based detectors:
    • These detectors embed (imperceptible) signals in the generated medium itself so that they can later be detected efficiently
      • (Atallah et al., 2001; Wilson et al., 2014; Kirchenbauer et al., 2023a; Zhao et al., 2023b).

Watermarking-based Detection

  • one of the most promising approaches in this context.
  • Seven companies, including Google, have voluntarily committed to watermarking their AI-generated content (Belanger, 2023).
  • Watermarking is a cryptographically inspired concept (Katzenbeisser and Petitcolas, 2016) with a rich history predating GenAI.
    • https://www.amazon.com/Information-Hiding-Artech-Computer-Security/dp/1608079287
  • Watermarking consists of an embedding algorithm, Embed, which takes an input, e.g., text, and outputs a modified version of the input that carries the watermark. A detection algorithm, Detect, takes an input and outputs whether it is watermarked or not.
  • Optional arguments are a message to embed (the detection would then output this message) and a key necessary to retrieve the watermark.
  • The real challenge is to entangle a signal in the content, making it difficult to detect or remove without the key.
  • Security game
    • challenger generates watermarked content and provides it to the adversary
    • adversary creates a modified version of the content, and if the challenger’s detection algorithm does not output the presence of a watermark in the modified content, the adversary wins.
      • the algorithm needs to be restricted in some form, as otherwise, it can just output some arbitrary content (without a watermark) on every challenge and win
      • One possible restriction is to require the modified content to be similar to the provided content.
    • The adversary’s computational power in cryptography is limited, but such reduction proofs are not known for watermarking.
  • Designing a watermarking algorithm
    • Two challenging tasks
      • identify the best possible attack on their watermark
      • evaluate the attack’s success
    • Often, designers fail to do this properly (Lukas and Kerschbaum, 2023).
    • Evaluating watermarking algorithms is an art we propose that should be turned into more of a science.
    • A successful watermarking algorithm is a trade-off.
      • between keeping useful content and making modifications that make the watermark more difficult to detect.
    • A successful watermarking algorithm with high utility preservation
      • must hide its message close to the utility-bearing parts of the content but not too close.
      • The hope is that an adversary would have to alter enough of this close-to-utility content to have a noticeable impact on utility.
      • A recent line of work tries this specifically for AI-generated text (Kirchenbauer et al., 2023b; Kirchenbauer et al., 2023a; Zhao et al., 2023a; Christ et al., 2023; Kuditipudi et al., 2023).
      • However, subsequent attacks have already been published (Sadasivan et al., 2023), attacking watermarks and any detection algorithm for AI-generated content.
    • Specifically, the detectors are vulnerable to paraphrasing and recursive paraphrasing attacks.
    • The detection rate of one watermark-based detector (Kirchenbauer et al., 2023a) at 1% FPR (False Positive Rate) drops from 97% to 15% after 5 rounds of recursive paraphrasing.
    • Reverse Process
      • Moreover, such attacks demonstrate that the detectors are also susceptible to spoofing (Sadasivan et al., 2023).
      • In such cases, adversaries can deduce concealed LLM text signatures and incorporate them into human-generated text.
      • Consequently, the manipulated text may be erroneously identified as originating from the LLMs, leading to potential reputational harm for their creators.
    • In addition to these reliability issues, recent work has shown that detectors can also be biased against non-native English writers (Liang et al., 2023).
    • Thus, having a small average error may not be sufficient to justify deploying a detector in practice: such a detector may have very large errors within a sub-population, such as text written by non-native English writers, text covering a particular topic, or text written in a particular writing style.
    • Using deep neural networks as detection algorithms:
      • Due to their hard-to-interpret and comprehensive nature, they can detect watermarks in content even if it is modified by certain kinds of attacks (applied during training).
      • This has not yet been done for text, but several approaches exist for generated images (Lukas and Kerschbaum, 2023; Fernandez et al., 2023).
      • However, it has also been shown that some image watermark detectors are susceptible to adversarial examples, i.e., it is possible to remove watermarks by finding adversarial examples for deep neural network detection algorithms (Jiang et al., 2023).
      • In a deployment with publicly accessible detectors, it would thus be necessary to strengthen the detector against adversarial examples, which is also an error-prone task (Carlini et al., 2023b).
  • The possibility of developing robust watermark detectors in the future remains unclear, as a definitive answer currently evades us.
    • An “impossibility result” regarding detecting AI-generated text complicates the situation further (Sadasivan et al., 2023).
      • The authors argue that as language models advance, so does their ability to emulate human text.
      • With new advances in LLMs, the distribution of AI-generated text becomes increasingly similar to human-generated text, making it harder to detect.
      • This similarity is reflected in the decreasing total variation distance between the distributions of human and AI-generated text sequences.
      • Adversaries, by seeking to mimic the human-generated text distribution using AI models, implicitly reduce the total variation distance between the two distributions to evade detection.
    • Recent attack work shows that as the total variation between the two distributions decreases, the performance of even the best possible detector deteriorates (Sadasivan et al., 2023).
    • Paraphrasing as described above is only one mechanism for attackers to destroy watermarks, with emoji attacks (Kirchenbauer et al., 2023a) (where the model is asked to rewrite the text with some transformation such as emoji-after-each-word applied) and related generative attacks posing new challenges to watermark detectors.
    • Questions about the existence and feasibility of “semantic” watermarks, i.e., watermarks independent of the stylistic choices in the output, are still very much open to further research.
    • In conclusion
      • no provably robust watermark exists, and we must resort to empirical evaluation, which is prone to methodical errors.
      • Existing watermarking algorithms only withstand attacks when the adversary cannot access the detection algorithm.
      • There exist adaptive attacks to break known (human-made) watermarking algorithms.
      • Moreover, it is an open question whether a deep neural network, if accessible by the adversary, can serve as a robust detection algorithm even in the presence of adversarial examples.

Long-Term Goals

  • Need for socio-technical solutions
    • Need for new model evaluation metrics:
      • GenAI models operate in open-ended and complex output spaces, where determining a “good” output can be multifaceted and context-dependent.
      • Traditional model evaluation metrics, such as accuracy and performance, fall short of capturing the full scope of this complexity.
      • Additionally, the diverse and uncertain capabilities of GenAI models, owing to their general-purpose nature, exacerbate the challenge.
      • Consequently, there is a growing need for novel evaluation metrics that incorporate social awareness.
      • This entails understanding the social requirements of downstream applications and designing customized metrics that explicitly articulate the drawbacks and trade-offs (Liao and Xiao, 2023).
    • Trustworthiness
      • Do I trust the person I am interacting with?
        • To tackle this issue, an online reputation system can be developed.
        • Within this system, users would be encouraged to establish and maintain a public digital identity with verifiable credentials.
        • A consistent online presence could be established by linking this identity to various platforms and accounts across the web.
        • This practice is analogous to the traditional bylines of newspaper articles, where the name of the reporting journalist adds credibility to the content.
        • Furthermore, the reputation system should be designed to track the chain of information dissemination.
        • Such a system would empower users to trace the origin and subsequent sources of information, providing transparent insights into the content’s journey from its original creator to its current state.
        • We note that reputation-based mechanisms do not rule out the need for privacy in such online settings.
    • Accountability
      • Individual users should be held responsible for deliberate misuse or negligence.
      • Developers and providers of GenAI models should bear legal liability for the models’ actions.
      • Further investigation is necessary to determine how accountability should be applied and liability assigned across users, model developers, and model providers (Buiten, 2023).
    • Privacy
      • Scrapped data from the Internet may include a wealth of personal information about individuals, ranging from personal preferences to potentially sensitive details.
      • This raises severe privacy concerns, as training data can sometimes be extracted verbatim from the models (Carlini et al., 2020).
      • While the data may be publicly available, that does not mean it was intended for utilization by third-party entities for commercial purposes.
      • This lack of explicit consent and the resulting unauthorized use of data introduces new dimensions of privacy concerns, as a recent lawsuit underscores (Brittain, 2023; ***, 2023a).
      • Formalizing the evolving notions of privacy that address the ethical implications of utilizing data in the public domain for model training poses a vital challenge.
  • Multiple lines of defenses
    • The first line
      • involves training-time interventions to align models with predefined values (Ouyang et al., 2022; Bai et al., 2022).
      • A common approach is reinforcement learning from human feedback (RLHF) (Stiennon et al., 2020).
      • LLMs can be prompted to perform a range of NLP tasks.
      • However, the language modeling objective used for training – predicting the next token – differs significantly from the objective of “following the user’s prompt helpfully and safely,” which complicates the assessment of the quality of generated NLP text.
      • Additionally, this evaluation is subjective and context-dependent, making it challenging to capture via mathematical notions, such as a loss function. RLHF addresses this challenge by directly using human preferences as a reward signal for fine-tuning.
    • The next line of defense
      • involves the post-hoc detection and
      • filtering of inputs and outputs (Gehman et al., 2020; Solaiman and Dennison, 2021; Welbl et al., 2021; Xu et al., 2021) to catch inappropriate content that might slip through.
      • Pre-training filtering may not capture all potential sources of bias or harmful content, especially as GenAI models encounter diverse and evolving data sources.
      • Post-hoc detection helps identify any unforeseen biases or harmful patterns that may emerge during the model’s usage.
      • An additional advantage is customizability—post-hoc detection can be tailored to the specific needs of different applications, ensuring a more personalized and context-aware approach to content moderation.
      • Furthermore, post-hoc detection reduces false positives, ensuring legitimate content is not unnecessarily restricted. It is important to consider the theoretical impossibility of detection and filtering and consider how ML techniques may be combined with security techniques (Glukhov et al., 2023).
    • Red teaming
      • which proactively identifies vulnerabilities, weaknesses, and potential blind spots (Ganguli et al., 2022; OpenAI, 2023c; Perez et al., 2022).
      • Red teaming adopts an attacker’s mindset and conducts rigorous stress testing.
      • The goal is to simulate real-world attack scenarios and provide a practical assessment of the security measures by structurally probing the models.
      • This step is especially critical for a consumer-facing technology like GenAI, which is accessible to a wide range of users and thus may potentially be targeted by a large pool of adversaries.
      • Ensuring that the red team comprises a diverse group of experts is crucial to maximizing the efficacy of this approach.
  • Pluralistic value alignment
    • How do we decide which principles or objectives to encode in AI, and who can make these decisions?
    • With machines displaying human-like qualities, intriguing new questions emerge. For example, freedom of speech has long been recognized as a fundamental and indispensable democratic right for humans.
    • However, should we expect machines to be granted the same rights, or should we make a distinguishing case?
  • Reduce barrier-to-entry for GenAI research.
    • Potential centralization of influence, where a handful of companies wield unprecedented control over data, information, and decision-making processes.
      • limited healthy competition to stifling innovation and raising ethical concerns about responsible AI use.
    • may prioritize economic interests over the scaling of AI safety research, creating a mismatch with societal well-being.
    • Additionally, promoting access to open-source solutions enhances transparency and reliability. However, a caveat in this regard is that open-sourcing also increases access for adversaries and thus the potential for misuse.
    • For instance, recent work (Zou et al., 2023) shows how to carry out automated safety attacks on open-source LLM chatbots that surprisingly transfer to closed-source chatbots, such as ChatGPT, Bard, Claude.
    • Similarly, follow-up work to detect such safety attacks was also demonstrated on open-source models (Alon and Kamfonas, 2023).
  • Grounding
    • LLMs are increasingly being used in cybersecurity contexts, such as threat intelligence.
    • For example, imagine a security analyst who gathers all the information (e.g., reports, images, speech recordings) related to a threat scenario and then asks an LLM-powered application for a summary of the threat scenario.
    • A summary should not contain hallucinations or “made-up facts,” and most crucially, the summary of the threat scenario should only depend on the relevant data uploaded by the analyst. US Department of Defense has identified this as a major issue (see the recent IARPA BENGAL program https://www.iarpa.gov/research-programs/bengal).
    • Hallucinations can also devastate other high-stakes settings (e.g., healthcare). This is exactly the problem addressed by grounding.
    • Formally, grounding requires that the text generated by an LLM is attributable to an authoritative knowledge source.
      • Here, “attribution” means that a generic human would agree that the text follows from the authoritative source (Rashkin et al., 2023)
    • There are two broad categories of works in LLM grounding
      • (1) Detecting whether a given LLM response is grounded and
      • (2) Encouraging LLMs to generate grounded responses.
    • A popular approach for (1) is to use a separate natural language inference (NLI) model to test whether the knowledge text entails the generated text.
      • Other approaches include comparing the generated and knowledge texts using BLUERT (Rashkin et al., 2023), BERTScore (Zhang et al., 2020), and other text similarity metrics.
    • A recent evaluation finds the NLI-based approach to achieve strong results compared to the alternatives (Honovich et al., 2022).
    • In cases where the knowledge source is a corpus of documents rather than a single text, an additional step is retrieving the relevant knowledge text from the corpus.
    • This is typically done by mapping the generated text to a fact-checking query (e.g., the response “Joe Biden is the president of the United States” is mapped to “Who is the president of the United States?”), and using an off-the-shelf retrieval system to obtain the relevant knowledge text.
    • How do you make LLMs generate grounded responses in the first place?
      • A simple and effective method is to augment the prompt with relevant knowledge snippets and additional instructions, asking the LLM only to use the information available in the provided snippets; see (Ram et al., 2023) for extensions of this idea.
      • This approach is attractive as it only requires API access to the LLM.
      • Other approaches involve tuning the LLM to generate grounded responses with relevant citations.
      • One approach is to tune the LLM’s weights on a dataset of query and (grounded) response pairs.
      • Another approach is to use reinforcement learning to tune the weights based on feedback on the groundedness and plausibility of generated responses (Menick et al., 2022).
      • Such feedback may be obtained by training a reward model on human ratings, or using an NLI-based grounding detection model discussed above.
      • iteratively revising an LLM’s response in cases where it is found to be ungrounded.
        • A common approach here is to prompt the LLM back with feedback on how grounding fails, as shown in (Gao et al., 2023):
        • You said: {text}, I checked: {query}, I found this article: {knowledge}, This suggests …
      • Several prompting strategies have been proposed on prompting models to generate grounded responses (Gao et al., 2023; Madaan et al., 2023; Yao et al., 2023).
      • However, designing these strategies is still an art, and there is no clear consensus on which strategy works best.
      • When tuning models to generate grounded responses, a common pitfall is that the model loses its creativity and resorts to quoting verbatim from the knowledge sources. Avoiding this requires carefully balancing various training objectives—fluency, grounding, plausibility, etc.
      • Finally, more fundamental research is needed to understand why LLMs hallucinate (especially nonsensical text) and to design training strategies to mitigate such behavior.