87 lines
6.9 KiB
Markdown
87 lines
6.9 KiB
Markdown
---
|
||
tags:
|
||
- Webscraping
|
||
- NLP
|
||
- FactChecking
|
||
- ContextChecking
|
||
- ExplanationGeneration
|
||
- AES
|
||
---
|
||
|
||
You can combine web scraping with NLP by building a small “fact‑checking” pipeline: extract a claim, scrape relevant pages, then use retrieval and natural‑language inference models to decide whether the web evidence supports or contradicts the claim.[aclanthology+1](https://aclanthology.org/2025.knowledgenlp-1.26/)
|
||
|
||
## Overall idea
|
||
|
||
- Turn the text you want to verify into one or more clear, atomic claims (e.g. “X happened on date Y”).[mbzuai+1](https://mbzuai.ac.ae/news/new-resources-for-fact-checking-llms-presented-at-emnlp/)
|
||
- Use web scraping or search APIs to collect pages from trusted sources related to the claim.
|
||
- Use NLP models to rank, filter, and compare those pages against the claim, then output a verdict such as supported / contradicted / not enough info.[sciencedirect+1](https://www.sciencedirect.com/science/article/abs/pii/S0952197625002842)
|
||
|
||
## Step 1: Extract and normalize claims
|
||
|
||
Use NLP to make the input “checkable”:
|
||
- Sentence splitting and claim extraction to break long text into short, single‑fact claims.[aclanthology+1](https://aclanthology.org/2025.emnlp-main.1615/)
|
||
- Optionally generate clarifying questions about the claim (who, when, where) to guide retrieval and disambiguation.[aclanthology+1](https://aclanthology.org/2025.knowledgenlp-1.26/)
|
||
|
||
Useful keywords/approaches: “atomic fact decomposition”, “checkworthy claim detection”, and “JEDI fact decomposition for NLI”.[github+1](https://github.com/Cartus/Automated-Fact-Checking-Resources)
|
||
|
||
## Step 2: Web scraping for evidence
|
||
|
||
Use web scraping or search to gather potential evidence:
|
||
|
||
- Formulate search queries from the claim (or the generated questions), then fetch top results from selected domains (news, Wikipedia, official sites). FIRE and other systems iteratively refine queries until they get good evidence.[aclanthology+1](https://aclanthology.org/2025.findings-naacl.158.pdf)
|
||
- Parse the pages with BeautifulSoup, Firecrawl, or similar; DEFAME is one example that combines automated scraping with fact‑checking.[edam+1](https://edam.org.tr/Uploads/Yukleme_Resim/pdf-28-08-2023-23-40-14.pdf)
|
||
|
||
Key points:
|
||
|
||
- Restrict to trusted / whitelisted domains to reduce noise.
|
||
- Store page title, URL, and cleaned text paragraphs for later scoring.
|
||
|
||
## Step 3: Retrieve and rank relevant snippets
|
||
|
||
Rather than feeding whole pages to the model, use retrieval to find the most relevant passages:
|
||
|
||
- Split pages into passages (e.g. 2–3 sentences) and compute embeddings (e.g. sentence‑transformers) to rank them by similarity to the claim.[acm+1](https://dl.acm.org/doi/10.1145/3477495.3531827)
|
||
- Generative retrieval (GERE) and dual‑stage BM25 + dense retrieval (as in Fathom) are standard patterns you can replicate.[aclanthology+1](https://aclanthology.org/anthology-files/pdf/fever/2025.fever-1.20.pdf)
|
||
|
||
This gives you a small set of top‑k evidence snippets that actually talk about the claim.
|
||
|
||
## Step 4: Compare claim vs evidence with NLP
|
||
|
||
Use Natural Language Inference (NLI) or fact‑checking models:
|
||
|
||
- Treat the claim as hypothesis and each evidence snippet as premise, and run an NLI model to classify each pair as supported / contradicted / unrelated.[arxiv+1](https://arxiv.org/html/2407.18367v1)
|
||
- Aggregate scores across snippets to produce a final label like supported, contradicted, or not enough info (NEI), similar to FEVER‑style fact‑checking.[arxiv+1](https://arxiv.org/abs/2110.14532)
|
||
|
||
Patterns you can adapt:
|
||
|
||
- VERITAS‑NLI: scrapes news articles and uses NLI models to verify headlines in real time.[sciencedirect](https://www.sciencedirect.com/science/article/abs/pii/S0952197625002842)
|
||
- ClaimCheck and FIRE: decompose claims, retrieve web evidence, then use smaller LMs/NLI models to derive a verdict.[aclanthology+1](https://aclanthology.org/2025.findings-naacl.158.pdf)
|
||
|
||
## Step 5: Produce explanation and handle uncertainty
|
||
|
||
For practical use, you should return more than just a label:
|
||
|
||
- Return a short natural‑language explanation plus top evidence snippets and URLs (e.g. “According to [source], X happened in 2022, not 2020”). Systems like FacTeR‑Check and explanation‑oriented fact‑checking pipelines use this pattern.[sciencedirect+1](https://www.sciencedirect.com/science/article/pii/S0952197624016506)
|
||
- If evidence is weak or conflicting, mark the result as “uncertain / needs human review” instead of forcing a yes/no answer.[arxiv+1](https://arxiv.org/html/2412.15189v3)
|
||
|
||
|
||
1. [https://aclanthology.org/2025.knowledgenlp-1.26/](https://aclanthology.org/2025.knowledgenlp-1.26/)
|
||
2. [https://www.sciencedirect.com/science/article/abs/pii/S0952197625002842](https://www.sciencedirect.com/science/article/abs/pii/S0952197625002842)
|
||
3. [https://mbzuai.ac.ae/news/new-resources-for-fact-checking-llms-presented-at-emnlp/](https://mbzuai.ac.ae/news/new-resources-for-fact-checking-llms-presented-at-emnlp/)
|
||
4. [https://aclanthology.org/2025.emnlp-main.1615/](https://aclanthology.org/2025.emnlp-main.1615/)
|
||
5. [https://github.com/Cartus/Automated-Fact-Checking-Resources](https://github.com/Cartus/Automated-Fact-Checking-Resources)
|
||
6. [https://aclanthology.org/2025.findings-naacl.158.pdf](https://aclanthology.org/2025.findings-naacl.158.pdf)
|
||
7. [https://aclanthology.org/anthology-files/pdf/fever/2025.fever-1.20.pdf](https://aclanthology.org/anthology-files/pdf/fever/2025.fever-1.20.pdf)
|
||
8. [https://edam.org.tr/Uploads/Yukleme_Resim/pdf-28-08-2023-23-40-14.pdf](https://edam.org.tr/Uploads/Yukleme_Resim/pdf-28-08-2023-23-40-14.pdf)
|
||
9. [https://github.com/multimodal-ai-lab/DEFAME](https://github.com/multimodal-ai-lab/DEFAME)
|
||
10. [https://dl.acm.org/doi/10.1145/3477495.3531827](https://dl.acm.org/doi/10.1145/3477495.3531827)
|
||
11. [https://arxiv.org/html/2407.18367v1](https://arxiv.org/html/2407.18367v1)
|
||
12. [https://arxiv.org/abs/2110.14532](https://arxiv.org/abs/2110.14532)
|
||
13. [https://www.sciencedirect.com/science/article/pii/S0952197624016506](https://www.sciencedirect.com/science/article/pii/S0952197624016506)
|
||
14. [https://arxiv.org/html/2412.15189v3](https://arxiv.org/html/2412.15189v3)
|
||
15. [https://www.biz4group.com/blog/developing-ai-automated-fact-checking-system](https://www.biz4group.com/blog/developing-ai-automated-fact-checking-system)
|
||
16. [https://download.hrz.tu-darmstadt.de/pub/FB20/Dekanat/Publikationen/AIPHES/Andreas_Hanselowski_NIPS-WPOC-2017.pdf](https://download.hrz.tu-darmstadt.de/pub/FB20/Dekanat/Publikationen/AIPHES/Andreas_Hanselowski_NIPS-WPOC-2017.pdf)
|
||
17. [https://journals.itb.ac.id/index.php/jictra/article/view/24157](https://journals.itb.ac.id/index.php/jictra/article/view/24157)
|
||
18. [https://search.gesis.org/research_data/SDN-10.7802-2469](https://search.gesis.org/research_data/SDN-10.7802-2469)
|
||
19. [https://arxiv.org/abs/2504.18376](https://arxiv.org/abs/2504.18376)
|
||
20. [https://ui.adsabs.harvard.edu/abs/arXiv:2409.00061](https://ui.adsabs.harvard.edu/abs/arXiv:2409.00061) |