Date of Award
11-2025
Degree Type
Thesis
Degree Name
Master of Science (MS)
Department
<--Please Select Department-->
College
College of Computing and Digital Media
First Advisor
Roselyne R.T. Tchoua
Abstract
The scientific literature continues to expand rapidly, making manual extraction of structured scientific facts increasingly impractical. Traditional Machine Learning and Natural Language Processing (NLP) pipelines require large expert-annotated datasets, which are costly to produce. Novel Large Language Models (LLMs) face challenges in long-context scientific reasoning, hallucinations, and entity linking. This thesis investigates ELSIE-Blob, a domain-aware preprocessing method that segments scientific articles into compact text “blobs” containing components of entity relations (here, polymer names, melting point indicators, and numerical values). We test whether blob-based input allows lightweight, consumer-hardware-accessible LLMs to extract polymer–melting point (polymer–Tm) pairs accurately without training data. Experiments using 50 polymer science research articles reveal that ELSIE-Blob reduces text to ~2% of original content while improving information density. With blob inputs, open-source models (Mistral-7B, Llama-3-8B, and Phi-3-3B) achieve up to 73.13% blob-level accuracy, extracting up to 49 polymer–Tm pairs without supervision. Without blobs, no model successfully extracted structured relations from full-text articles. This work demonstrates that “less is more” for LLM-based scientific Information Extraction (IE) and highlights the role of domain-aware preprocessing in reducing hallucination, attention dilution, and computational cost.
Copyright
Copyright © 2026 Sameer Shaik
Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Recommended Citation
Shaik, Sameer, "WHEN IT COMES TO SCIENTIFIC INFORMATION EXTRACTION AND LLMS, LESS IS MORE" (2025). Theses and Dissertations from DePaul University. 53.
https://via.library.depaul.edu/theses-dissertations/53