Files
Zotero-Thesis/storage/FUHPB4WI/.zotero-ft-cache
fzzinchemical 02b00ee108 update
2026-01-22 22:01:07 +01:00

169 lines
7.5 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
Skip to main content
Elsevier Logo
Scopus Logo
Description for the menu
Back
Uncertainty-Informed Screening for Safer Solvents Used in the Synthesis of Perovskites via Language Models
Journal of Chemical Information and ModelingArticle2025
DOI: 10.1021/acs.jcim.5c00612
Copy to clipboard
Mukherjee, Arpan
a
;
Giri, Deepesh
b
;
Rajan, Krishna
a
Send mail to Rajan K.
a
Department of Materials Design and Innovation, University at Buffalo, Buffalo, 142601660, NY, United States
Show all information
1
70th percentile
Citation
Set citation alert
0.69
FWCI
More information about Field-Weighted Citation Impact
View PDF
Opens in a new tab.
Full text
Export
Save to list
Save to list functionality is only available if you are signed in and subscribed
DocumentImpactCited by (1)References (78)Similar documents
Abstract
Automated data curation for niche scientific topics, where data quality and contextual accuracy are paramount, poses significant challenges. Bidirectional contextual models such as BERT and ELMo excel in contextual understanding and determinism. However, they are constrained by their narrower training corpora and inability to synthesize information across fragmented or sparse contexts. Conversely, autoregressive generative models like GPT can synthesize dispersed information by leveraging broader contextual knowledge and yet often generate plausible but incorrect (“hallucinated”) information. To address these complementary limitations, we propose an ensemble approach that combines the deterministic precision of BERT/ELMo with the contextual depth of GPT. We have developed a hierarchical knowledge extraction framework to identify perovskites and their associated solvents in perovskite synthesis, progressing from broad topics to narrower details using two complementary methods. The first method leverages deterministic models like BERT/ELMo for precise entity extraction, while the second employs GPT for broader contextual synthesis and generalization. Outputs from both methods are validated through structure-matching and entity normalization, ensuring consistency and traceability. In the absence of benchmark data sets for this domain, we hold out a subset of papers for manual verification to serve as a reference set for tuning the rules for entity normalization. This enables quantitative evaluation of model precision, recall, and structural adherence while also providing a grounded estimate of model confidence. By intersecting the outputs from both methods, we generate a list of solvents with maximum confidence, combining precision with contextual depth to ensure accuracy and reliability. This approach increases precision at the expense of recall─a trade-off we accept given that, in high-trust scientific applications, minimizing hallucinations is often more critical than achieving full coverage, especially when downstream reliability is paramount. As a case study, the curated data set is used to predict the endocrine-disrupting (ED) potential of solvents with a pretrained deep learning model. Recognizing that machine learning models may not be trained on niche data sets such as perovskite-related solvents, we have quantified epistemic uncertainty using Shannon entropy. This measure evaluates the confidence of the ML model predictions, independent of uncertainties in the NLP-based data curation process, and identifies high-risk solvents requiring further validation. Additionally, the manual verification pipeline addresses ethical considerations around trust, structure, and transparency in AI-curated data sets. © 2025 The Authors. Published by American Chemical Society
Indexed keywords
MeSH
Calcium Compounds; Oxides; Solvents; Titanium; Uncertainty
Engineering controlled terms
Data accuracy; Data consistency; Data curation; Data reliability; Deep learning; Extraction; Forecasting; Knowledge management; Learning systems; Perovskite; Solvents; Uncertainty analysis
EMTREE drug terms
calcium derivative; oxide; perovskite; solvent; titanium
Engineering uncontrolled terms
American Chemical Society; Automated data; Contextual modeling; Data curation; Data quality; Data set; Excel; Language model; Normalisation; Uncertainty
EMTREE medical terms
chemistry; synthesis; uncertainty
Engineering main heading
Economic and social effects
Reaxys Chemistry database information
Reaxys is designed to support chemistry researchers at every stage with the ability to investigated chemistry related research topics in peer-reviewed literature, patents and substance databases. Reaxys retrieves substances, substance properties, reaction and synthesis data.
Substances
OO
View details
Expand Substance 4-butanolide
Powered by
Chemicals and CAS Registry Numbers
Unique identifiers assigned by the Chemical Abstracts Service (CAS) to ensure accurate identification and tracking of chemicals across scientific literature.
oxide 16833-27-5
perovskite 12194-71-7, 61027-03-0
titanium 7440-32-6
Calcium Compounds
Show more
Funding details
Details about financial support for research, including funding sources and grant numbers as provided in academic publications.
Funding sponsor Funding number Acronym
University at Buffalo
See opportunities by UB
See opportunities (opens in new window)
UB
CoRE center
Col-laboratory for a Regenerative Economy
National Science Foundation
See opportunities by NSF
See opportunities (opens in new window)
2315307 NSF
National Science Foundation
See opportunities by NSF
See opportunities (opens in new window)
NSF
Funding text
The authors acknowledge support from NSF Award No. 2315307: NSF Engines Development Award and the Col-laboratory for a Regenerative Economy (CoRE center) in the Department of Materials Design and Innovation - University at Buffalo.
Corresponding authors
Corresponding author K. Rajan
Affiliation Department of Materials Design and Innovation, University at Buffalo, Buffalo, 142601660, NY, United States
Email address krajan3@buffalo.edu
© Copyright 2025 Elsevier B.V., All rights reserved.
Abstract
Indexed keywords
Reaxys Chemistry database information
Chemicals and CAS Registry Numbers
Funding details
Corresponding authors
About Scopus
What is Scopus
Learn more about Scopus (opens in a new window)
Content coverage
Learn more about Scopus' content coverage (opens in a new window)
Scopus blog
Read the Scopus Blog (opens in a new window)
Scopus API
Learn more about Scopus API's (opens in a new window)
Privacy matters
View privacy matters page (opens in a new window)
Language
日本語版を表示する
日本語版を表示する
查看简体中文版本
查看简体中文版本
查看繁體中文版本
查看繁體中文版本
Просмотр версии на русском языке
Просмотр версии на русском языке
Customer Service
Help
View Scopus help files (opens in a new window)
Tutorials
Select to view tutorials (opens in a new window)
Contact us
Contact us (opens in a new window)
Go to the Elsevier site (opens in a new window)
Terms and conditions
View the terms and conditions of Elsevier (opens in a new window)
Privacy policy
View the privacy policy of Elsevier (opens in a new window)
Cookies settings
View the Cookie setting preferences
All content on this site: Copyright © 2026 Elsevier B.V.
Go to the Elsevier site (opens in a new window)
, its licensors, and contributors. All rights are reserved, including those for text and data mining, AI training, and similar technologies. For all open access content, the relevant licensing terms apply.
Go to RELX Group Homepage (Opens in a new window)