College of Computing and Digital Media Dissertations

Date of Award

Summer 7-20-2023

Degree Type


Degree Name

Master of Science (MS)


School of Computing

First Advisor

Tanu Malik, PhD

Second Advisor

Ashish Gehani, PhD

Third Advisor

Alexander Rasin, PhD


Reproducibility of applications is paramount in several scenarios such as collaborative work and software testing. Containers provide an easy way of addressing reproducibility by packaging the application's software and data dependencies into one executable unit, which can be executed multiple times in different environments. With the increased use of containers in industry as well as academia, current research has examined the provisioning and storage cost of containers and has shown that container deployments often include unnecessary software packages. Current methods to optimize the container size prune unnecessary data at the granularity of files and thus make binary decisions. We show that such methods do not translate efficiently to scientific data files, where only a subset of data may be accessed across several files. In this thesis, we propose a method of looking at this problem at the granularity of bytes. Instead of keeping track of which files are accessed, we keep track of the portions of files accessed in the form of file offsets. This I/O lineage allows us to package only relevant parts of data files, significantly reducing the storage and sharing cost of containers. Results show the generality of our method across different data formats and reduction approximately equal to the amount of data fetched into memory.



To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.