GenAudit is a tool that helps find factual errors / hallucinations in model generated text by comparing against a reference document.
The tool suggests edits to the text to fix factual errors.
Simultaneously, it highlights evidence from the reference to support correct facts in the text and any suggested replacements.
In this demo, we took documents from different datasets and used LLMs to generate summaries for them. We then used GenAudit to show relevant evidence and
suggest edits to fix factual errors.
The documents used in this demo are taken from the below datasets:
XSUM (news articles) - https://huggingface.co/datasets/EdinburghNLP/xsum
ACIBENCH (clinical visit transcripts) - https://github.com/wyim/aci-bench
REDDIT-TIFU (social media posts) - https://huggingface.co/datasets/reddit_tifu
The models used to generated the summaries are:
Llama-7B - https://huggingface.co/meta-llama/Llama-2-7b-chat-hf
Llama-70B - https://huggingface.co/meta-llama/Llama-2-70b-chat-hf
Mistral-7B - https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1
Falcon-7B - https://huggingface.co/tiiuae/falcon-7b-instruct
Flan-UL2 - https://huggingface.co/google/flan-ul2
Gemini-pro - https://blog.google/technology/ai/google-gemini-ai/
GPT-3.5-turbo - https://platform.openai.com/docs/models/gpt-3-5-turbo (version gpt-3.5-turbo-16k-061)
GPT-4 - https://platform.openai.com/docs/models/gpt-4-and-gpt-4-turbo (gpt-4-0613)