GenAudit: Fixing Factual Errors in Language Model Outputs with Evidence

1Carnegie Mellon University, 2Northeastern University

An illustration of GenAudit’s user interface and sample predictions. Reference document (a clinical dischrage note) is on the left and the generated text to be fact-checked is on the right (generated by querying any LLM, but manually entered here for ease of illustration). Spans in the text which are not supported or are contradicted by the reference are highlighted in red, with suggested replacements in green. As the user moves to any line in the generated text, evidence found for all facts in it are highlighted using blue links.

Overview

LLMs can generate factually incorrect statements even when provided access to reference documents. Such errors can be dangerous in high-stakes applications (e.g., document-grounded QA for healthcare or finance). We present GenAudit — a tool intended to assist fact-checking LLM responses for document-grounded tasks. GenAudit suggests edits to the LLM response by revising or removing claims that are not supported by the reference document, and also presents evidence from the reference for facts that do appear to have support. We train models to execute these tasks, and design an interactive interface to present suggested edits and evidence to users. Comprehensive evaluation by human raters shows that GenAudit can detect errors in 8 different LLM outputs when summarizing documents from diverse domains. We provide the tool and fact-checking models for public use.

Quickstart

GenAudit is available to install via PyPi. You can get it up and running with the following commands in the shell. You can spawn more than one fact-checking models in the backend if you have multiple GPUs, which can fact-check multiple sentences in parallel.


  pip install genaudit

  # if you have single gpu
  python -m genaudit.launch --port <port-value> --qa-model hf:mistralai/Mistral-7B-Instruct-v0.1 \
    --factcheck-model hf:kundank/genaudit-usb-flanul2 --num-factcheck-processes 1 \
    --use-single-gpu --qa-quantize 4bit

  # if you have N gpus to use
  python -m genaudit.launch --port <port-value> --qa-model hf:mistralai/Mistral-7B-Instruct-v0.1 \
    --factcheck-model hf:kundank/genaudit-usb-flanul2 --num-factcheck-processes <N-1> \
    --qa-quantize 4bit

          

Fact-checking Models

We release a series of fine-tuned LLMs to work as the backend of the tool. They are of different sizes, and you can choose the optimal one depending on how much GPU memory you have and how much latency is acceptable. The model name can be passed via the --factcheck-model argument. The following models are available on huggingface now:


The fine-tuned FlanUL2 model should perform the best.

API Usage

            
from genaudit import FactChecker
fc = FactChecker("hf:kundank/genaudit-usb-flanul2")


ref = '''Carnegie Mellon University (CMU) is a private research university in Pittsburgh, Pennsylvania. \
The institution was formed by a merger of Carnegie Institute of Technology and Mellon Institute \
of Industrial Research in 1967. In the 1990s and into the 2000s, Carnegie Mellon solidified its \
status among American universities, consistently ranking in the top 25. In 2018, Carnegie Mellon's \
Tepper School of Business placed 12th in an annual ranking of U.S. business schools by Bloomberg Businessweek.'''

gen = "CMU is a top-ranked university located in Pittsburgh. It was formed by merging Carnegie Institute \
of Technology, Mellon Institute of Industrial Research, and the Cranberry Lemon Institute. Its business school \
was ranked 15th in the US by Bloomberg Businessweek."

fc.check(reference=ref, claim=gen)

######################
####### Output #######
######################
{'reference_sents': ['Carnegie Mellon University (CMU) is a private research university in Pittsburgh, Pennsylvania.',
  'The institution was formed by a merger of Carnegie Institute of Technology and Mellon Institute of Industrial Research in 1967.',
  'In the 1990s and into the 2000s, Carnegie Mellon solidified its status among American universities, consistently ranking in the top 25.',
  "In 2018, Carnegie Mellon's Tepper School of Business placed 12th in an annual ranking of U.S. business schools by Bloomberg Businessweek."],
 'claim_sents': [{'evidence_labels': [0, 2],
   'todelete_spans': [],
   'replacement_strings': [],
   'txt': 'CMU is a top-ranked university located in Pittsburgh.',
   'success': True,
   'edited_txt': 'CMU is a top-ranked university located in Pittsburgh.'},
  {'evidence_labels': [1],
   'todelete_spans': [[57, 58], [98, 133]],
   'replacement_strings': [' and', ''],
   'txt': 'It was formed by merging Carnegie Institute of Technology, Mellon Institute of Industrial Research, and the Cranberry Lemon Institute.',
   'success': True,
   'edited_txt': 'It was formed by merging Carnegie Institute of Technology and Mellon Institute of Industrial Research.'},
  {'evidence_labels': [3],
   'todelete_spans': [[30, 35]],
   'replacement_strings': [' 12th'],
   'txt': 'Its business school was ranked 15th in the US by Bloomberg Businessweek.',
   'success': True,
   'edited_txt': 'Its business school was ranked 12th in the US by Bloomberg Businessweek.'}]}
            
          

Performance

Performance of GenAudit

Results from human evaluation of GenAudit predictions (using fine-tuned Flan-UL2 backend) on LLM-generated summaries of documents from different datasets (please refer to paper for more details).

Notable Features

BibTeX

@article{krishna2024genaudit,
  title={GenAudit: Fixing Factual Errors in Language Model Outputs with Evidence},
  author={Krishna, Kundan and Ramprasad, Sanjana and Gupta, Prakhar and Wallace, Byron C and Lipton, Zachary C and Bigham, Jeffrey P},
  journal={arXiv preprint arXiv:2402.12566},
  year={2024}
}

Acknowledgements

We thank Saurabh Garg for insightful discussions and suggestions on the work, and Adithya Pratapa and Maneesh Bilalpur for help with testing the software.