Uncertainty-Informed Screening for Safer Solvents Used in the Synthesis of Perovskites via Language Models
- aDepartment of Materials Design and Innovation, University at Buffalo, Buffalo, 14260−1660, NY, United States
Abstract
Automated data curation for niche scientific topics, where data quality and contextual accuracy are paramount, poses significant challenges. Bidirectional contextual models such as BERT and ELMo excel in contextual understanding and determinism. However, they are constrained by their narrower training corpora and inability to synthesize information across fragmented or sparse contexts. Conversely, autoregressive generative models like GPT can synthesize dispersed information by leveraging broader contextual knowledge and yet often generate plausible but incorrect (“hallucinated”) information. To address these complementary limitations, we propose an ensemble approach that combines the deterministic precision of BERT/ELMo with the contextual depth of GPT. We have developed a hierarchical knowledge extraction framework to identify perovskites and their associated solvents in perovskite synthesis, progressing from broad topics to narrower details using two complementary methods. The first method leverages deterministic models like BERT/ELMo for precise entity extraction, while the second employs GPT for broader contextual synthesis and generalization. Outputs from both methods are validated through structure-matching and entity normalization, ensuring consistency and traceability. In the absence of benchmark data sets for this domain, we hold out a subset of papers for manual verification to serve as a reference set for tuning the rules for entity normalization. This enables quantitative evaluation of model precision, recall, and structural adherence while also providing a grounded estimate of model confidence. By intersecting the outputs from both methods, we generate a list of solvents with maximum confidence, combining precision with contextual depth to ensure accuracy and reliability. This approach increases precision at the expense of recall─a trade-off we accept given that, in high-trust scientific applications, minimizing hallucinations is often more critical than achieving full coverage, especially when downstream reliability is paramount. As a case study, the curated data set is used to predict the endocrine-disrupting (ED) potential of solvents with a pretrained deep learning model. Recognizing that machine learning models may not be trained on niche data sets such as perovskite-related solvents, we have quantified epistemic uncertainty using Shannon entropy. This measure evaluates the confidence of the ML model predictions, independent of uncertainties in the NLP-based data curation process, and identifies high-risk solvents requiring further validation. Additionally, the manual verification pipeline addresses ethical considerations around trust, structure, and transparency in AI-curated data sets. © 2025 The Authors. Published by American Chemical Society
Indexed keywords
- MeSH
Calcium Compounds; Oxides; Solvents; Titanium; Uncertainty
- Engineering controlled terms
Data accuracy; Data consistency; Data curation; Data reliability; Deep learning; Extraction; Forecasting; Knowledge management; Learning systems; Perovskite; Solvents; Uncertainty analysis
- EMTREE drug terms
calcium derivative; oxide; perovskite; solvent; titanium
- Engineering uncontrolled terms
American Chemical Society; Automated data; Contextual modeling; Data curation; Data quality; Data set; Excel; Language model; Normalisation; Uncertainty
- EMTREE medical terms
chemistry; synthesis; uncertainty
- Engineering main heading
Economic and social effects
Reaxys Chemistry database information
Chemicals and CAS Registry Numbers
Unique identifiers assigned by the Chemical Abstracts Service (CAS) to ensure accurate identification and tracking of chemicals across scientific literature.
| oxide | 16833-27-5 |
| perovskite | 12194-71-7, 61027-03-0 |
| titanium | 7440-32-6 |
| Calcium Compounds |
Funding details
Details about financial support for research, including funding sources and grant numbers as provided in academic publications.
| Funding sponsor | Funding number | Acronym |
|---|---|---|
University at Buffalo See opportunities by UBSee opportunities (opens in new window) | UB | |
CoRE center | ||
Col-laboratory for a Regenerative Economy | ||
National Science Foundation See opportunities by NSFSee opportunities (opens in new window) | 2315307 | NSF |
National Science Foundation See opportunities by NSFSee opportunities (opens in new window) | NSF |
Funding text
The authors acknowledge support from NSF Award No. 2315307: NSF Engines Development Award and the Col-laboratory for a Regenerative Economy (CoRE center) in the Department of Materials Design and Innovation - University at Buffalo.Corresponding authors
| Corresponding author | K. Rajan |
| Affiliation | Department of Materials Design and Innovation, University at Buffalo, Buffalo, 14260−1660, NY, United States |
| Email address | krajan3@buffalo.edu |
© Copyright 2025 Elsevier B.V., All rights reserved.