update
This commit is contained in:
59
storage/PHJDAPH9/.zotero-ft-cache
Normal file
59
storage/PHJDAPH9/.zotero-ft-cache
Normal file
@@ -0,0 +1,59 @@
|
||||
Skip to main content
|
||||
Computer Science > Computation and Language
|
||||
arXiv:2305.14251 (cs)
|
||||
[Submitted on 23 May 2023 (v1), last revised 11 Oct 2023 (this version, v2)]
|
||||
FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation
|
||||
Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, Hannaneh Hajishirzi
|
||||
View PDF
|
||||
Evaluating the factuality of long-form text generated by large language models (LMs) is non-trivial because (1) generations often contain a mixture of supported and unsupported pieces of information, making binary judgments of quality inadequate, and (2) human evaluation is time-consuming and costly. In this paper, we introduce FACTSCORE, a new evaluation that breaks a generation into a series of atomic facts and computes the percentage of atomic facts supported by a reliable knowledge source. We conduct an extensive human evaluation to obtain FACTSCOREs of people biographies generated by several state-of-the-art commercial LMs -- InstructGPT, ChatGPT, and the retrieval-augmented PerplexityAI -- and report new analysis demonstrating the need for such a fine-grained score (e.g., ChatGPT only achieves 58%). Since human evaluation is costly, we also introduce an automated model that estimates FACTSCORE using retrieval and a strong language model, with less than a 2% error rate. Finally, we use this automated metric to evaluate 6,500 generations from a new set of 13 recent LMs that would have cost $26K if evaluated by humans, with various findings: GPT-4 and ChatGPT are more factual than public models, and Vicuna and Alpaca are some of the best public models. FACTSCORE is available for public use via `pip install factscore`.
|
||||
Comments: 25 pages; 7 figures. Published as a main conference paper at EMNLP 2023. Code available at this https URL
|
||||
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
|
||||
Cite as: arXiv:2305.14251 [cs.CL]
|
||||
(or arXiv:2305.14251v2 [cs.CL] for this version)
|
||||
|
||||
https://doi.org/10.48550/arXiv.2305.14251
|
||||
Focus to learn more
|
||||
Submission history
|
||||
From: Sewon Min [view email]
|
||||
[v1] Tue, 23 May 2023 17:06:00 UTC (2,490 KB)
|
||||
[v2] Wed, 11 Oct 2023 05:27:50 UTC (2,491 KB)
|
||||
|
||||
Access Paper:
|
||||
View PDFTeX Source
|
||||
view license
|
||||
Current browse context: cs.CL
|
||||
< prev next >
|
||||
|
||||
newrecent2023-05
|
||||
Change to browse by: cs cs.AI cs.LG
|
||||
References & Citations
|
||||
NASA ADS
|
||||
Google Scholar
|
||||
Semantic Scholar
|
||||
1 blog link (what is this?)
|
||||
Export BibTeX Citation
|
||||
Bookmark
|
||||
Bibliographic Tools
|
||||
Bibliographic and Citation Tools
|
||||
Bibliographic Explorer Toggle
|
||||
Bibliographic Explorer (What is the Explorer?)
|
||||
Connected Papers Toggle
|
||||
Connected Papers (What is Connected Papers?)
|
||||
Litmaps Toggle
|
||||
Litmaps (What is Litmaps?)
|
||||
scite.ai Toggle
|
||||
scite Smart Citations (What are Smart Citations?)
|
||||
Code, Data, Media
|
||||
Demos
|
||||
Related Papers
|
||||
About arXivLabs
|
||||
Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)
|
||||
About
|
||||
Help
|
||||
Contact
|
||||
Subscribe
|
||||
Copyright
|
||||
Privacy Policy
|
||||
Web Accessibility Assistance
|
||||
|
||||
arXiv Operational Status
|
||||
356
storage/PHJDAPH9/2305.html
Normal file
356
storage/PHJDAPH9/2305.html
Normal file
File diff suppressed because one or more lines are too long
Reference in New Issue
Block a user