Theses and Dissertations from DePaul University

Date of Award

11-2025

Degree Type

Thesis

Degree Name

Master of Science (MS)

Department

<--Please Select Department-->

College

College of Computing and Digital Media

First Advisor

Roselyne R.T. Tchoua

Abstract

The scientific literature continues to expand rapidly, making manual extraction of structured scientific facts increasingly impractical. Traditional Machine Learning and Natural Language Processing (NLP) pipelines require large expert-annotated datasets, which are costly to produce. Novel Large Language Models (LLMs) face challenges in long-context scientific reasoning, hallucinations, and entity linking. This thesis investigates ELSIE-Blob, a domain-aware preprocessing method that segments scientific articles into compact text “blobs” containing components of entity relations (here, polymer names, melting point indicators, and numerical values). We test whether blob-based input allows lightweight, consumer-hardware-accessible LLMs to extract polymer–melting point (polymer–Tm) pairs accurately without training data. Experiments using 50 polymer science research articles reveal that ELSIE-Blob reduces text to ~2% of original content while improving information density. With blob inputs, open-source models (Mistral-7B, Llama-3-8B, and Phi-3-3B) achieve up to 73.13% blob-level accuracy, extracting up to 49 polymer–Tm pairs without supervision. Without blobs, no model successfully extracted structured relations from full-text articles. This work demonstrates that “less is more” for LLM-based scientific Information Extraction (IE) and highlights the role of domain-aware preprocessing in reducing hallucination, attention dilution, and computational cost.

Share

COinS