60 lines
3.0 KiB
Plaintext
60 lines
3.0 KiB
Plaintext
Skip to main content
|
||
Computer Science > Computation and Language
|
||
arXiv:2305.14251 (cs)
|
||
[Submitted on 23 May 2023 (v1), last revised 11 Oct 2023 (this version, v2)]
|
||
FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation
|
||
Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, Hannaneh Hajishirzi
|
||
View PDF
|
||
Evaluating the factuality of long-form text generated by large language models (LMs) is non-trivial because (1) generations often contain a mixture of supported and unsupported pieces of information, making binary judgments of quality inadequate, and (2) human evaluation is time-consuming and costly. In this paper, we introduce FACTSCORE, a new evaluation that breaks a generation into a series of atomic facts and computes the percentage of atomic facts supported by a reliable knowledge source. We conduct an extensive human evaluation to obtain FACTSCOREs of people biographies generated by several state-of-the-art commercial LMs -- InstructGPT, ChatGPT, and the retrieval-augmented PerplexityAI -- and report new analysis demonstrating the need for such a fine-grained score (e.g., ChatGPT only achieves 58%). Since human evaluation is costly, we also introduce an automated model that estimates FACTSCORE using retrieval and a strong language model, with less than a 2% error rate. Finally, we use this automated metric to evaluate 6,500 generations from a new set of 13 recent LMs that would have cost $26K if evaluated by humans, with various findings: GPT-4 and ChatGPT are more factual than public models, and Vicuna and Alpaca are some of the best public models. FACTSCORE is available for public use via `pip install factscore`.
|
||
Comments: 25 pages; 7 figures. Published as a main conference paper at EMNLP 2023. Code available at this https URL
|
||
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
|
||
Cite as: arXiv:2305.14251 [cs.CL]
|
||
(or arXiv:2305.14251v2 [cs.CL] for this version)
|
||
|
||
https://doi.org/10.48550/arXiv.2305.14251
|
||
Focus to learn more
|
||
Submission history
|
||
From: Sewon Min [view email]
|
||
[v1] Tue, 23 May 2023 17:06:00 UTC (2,490 KB)
|
||
[v2] Wed, 11 Oct 2023 05:27:50 UTC (2,491 KB)
|
||
|
||
Access Paper:
|
||
View PDFTeX Source
|
||
view license
|
||
Current browse context: cs.CL
|
||
< prev next >
|
||
|
||
newrecent2023-05
|
||
Change to browse by: cs cs.AI cs.LG
|
||
References & Citations
|
||
NASA ADS
|
||
Google Scholar
|
||
Semantic Scholar
|
||
1 blog link (what is this?)
|
||
Export BibTeX Citation
|
||
Bookmark
|
||
Bibliographic Tools
|
||
Bibliographic and Citation Tools
|
||
Bibliographic Explorer Toggle
|
||
Bibliographic Explorer (What is the Explorer?)
|
||
Connected Papers Toggle
|
||
Connected Papers (What is Connected Papers?)
|
||
Litmaps Toggle
|
||
Litmaps (What is Litmaps?)
|
||
scite.ai Toggle
|
||
scite Smart Citations (What are Smart Citations?)
|
||
Code, Data, Media
|
||
Demos
|
||
Related Papers
|
||
About arXivLabs
|
||
Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)
|
||
About
|
||
Help
|
||
Contact
|
||
Subscribe
|
||
Copyright
|
||
Privacy Policy
|
||
Web Accessibility Assistance
|
||
|
||
arXiv Operational Status
|