Skip to main content

Uncertainty-Informed Screening for Safer Solvents Used in the Synthesis of Perovskites via Language Models

Journal of Chemical Information and ModelingArticle2025
DOI: 10.1021/acs.jcim.5c00612
  • aDepartment of Materials Design and Innovation, University at Buffalo, Buffalo, 14260−1660, NY, United States
170th percentile
Citation
0.69
FWCI

Abstract

Automated data curation for niche scientific topics, where data quality and contextual accuracy are paramount, poses significant challenges. Bidirectional contextual models such as BERT and ELMo excel in contextual understanding and determinism. However, they are constrained by their narrower training corpora and inability to synthesize information across fragmented or sparse contexts. Conversely, autoregressive generative models like GPT can synthesize dispersed information by leveraging broader contextual knowledge and yet often generate plausible but incorrect (“hallucinated”) information. To address these complementary limitations, we propose an ensemble approach that combines the deterministic precision of BERT/ELMo with the contextual depth of GPT. We have developed a hierarchical knowledge extraction framework to identify perovskites and their associated solvents in perovskite synthesis, progressing from broad topics to narrower details using two complementary methods. The first method leverages deterministic models like BERT/ELMo for precise entity extraction, while the second employs GPT for broader contextual synthesis and generalization. Outputs from both methods are validated through structure-matching and entity normalization, ensuring consistency and traceability. In the absence of benchmark data sets for this domain, we hold out a subset of papers for manual verification to serve as a reference set for tuning the rules for entity normalization. This enables quantitative evaluation of model precision, recall, and structural adherence while also providing a grounded estimate of model confidence. By intersecting the outputs from both methods, we generate a list of solvents with maximum confidence, combining precision with contextual depth to ensure accuracy and reliability. This approach increases precision at the expense of recall─a trade-off we accept given that, in high-trust scientific applications, minimizing hallucinations is often more critical than achieving full coverage, especially when downstream reliability is paramount. As a case study, the curated data set is used to predict the endocrine-disrupting (ED) potential of solvents with a pretrained deep learning model. Recognizing that machine learning models may not be trained on niche data sets such as perovskite-related solvents, we have quantified epistemic uncertainty using Shannon entropy. This measure evaluates the confidence of the ML model predictions, independent of uncertainties in the NLP-based data curation process, and identifies high-risk solvents requiring further validation. Additionally, the manual verification pipeline addresses ethical considerations around trust, structure, and transparency in AI-curated data sets. © 2025 The Authors. Published by American Chemical Society

Indexed keywords

MeSH

Calcium Compounds; Oxides; Solvents; Titanium; Uncertainty

Engineering controlled terms

Data accuracy; Data consistency; Data curation; Data reliability; Deep learning; Extraction; Forecasting; Knowledge management; Learning systems; Perovskite; Solvents; Uncertainty analysis

EMTREE drug terms

calcium derivative; oxide; perovskite; solvent; titanium

Engineering uncontrolled terms

American Chemical Society; Automated data; Contextual modeling; Data curation; Data quality; Data set; Excel; Language model; Normalisation; Uncertainty

EMTREE medical terms

chemistry; synthesis; uncertainty

Engineering main heading

Economic and social effects

Reaxys Chemistry database information

Reaxys is designed to support chemistry researchers at every stage with the ability to investigated chemistry related research topics in peer-reviewed literature, patents and substance databases. Reaxys retrieves substances, substance properties, reaction and synthesis data.

Substances

Powered byLearn about Reaxys chemistry database information

Chemicals and CAS Registry Numbers

Unique identifiers assigned by the Chemical Abstracts Service (CAS) to ensure accurate identification and tracking of chemicals across scientific literature.

oxide16833-27-5
perovskite12194-71-7, 61027-03-0
titanium7440-32-6
Calcium Compounds

Funding details

Details about financial support for research, including funding sources and grant numbers as provided in academic publications.

Funding sponsorFunding numberAcronym

University at Buffalo

See opportunities by UBSee opportunities (opens in new window)
UB

CoRE center

Col-laboratory for a Regenerative Economy

National Science Foundation

See opportunities by NSFSee opportunities (opens in new window)
2315307NSF

National Science Foundation

See opportunities by NSFSee opportunities (opens in new window)
NSF

Funding text

The authors acknowledge support from NSF Award No. 2315307: NSF Engines Development Award and the Col-laboratory for a Regenerative Economy (CoRE center) in the Department of Materials Design and Innovation - University at Buffalo.

Corresponding authors

Corresponding authorK. Rajan
AffiliationDepartment of Materials Design and Innovation, University at Buffalo, Buffalo, 14260−1660, NY, United States
Email addresskrajan3@buffalo.edu

© Copyright 2025 Elsevier B.V., All rights reserved.