Date of Award
Summer 7-20-2023
Degree Type
Thesis
Degree Name
Master of Science (MS)
School
School of Computing
First Advisor
Tanu Malik, PhD
Second Advisor
Ashish Gehani, PhD
Third Advisor
Alexander Rasin, PhD
Abstract
Reproducibility of applications is paramount in several scenarios such as collaborative work and software testing. Containers provide an easy way of addressing reproducibility by packaging the application's software and data dependencies into one executable unit, which can be executed multiple times in different environments. With the increased use of containers in industry as well as academia, current research has examined the provisioning and storage cost of containers and has shown that container deployments often include unnecessary software packages. Current methods to optimize the container size prune unnecessary data at the granularity of files and thus make binary decisions. We show that such methods do not translate efficiently to scientific data files, where only a subset of data may be accessed across several files. In this thesis, we propose a method of looking at this problem at the granularity of bytes. Instead of keeping track of which files are accessed, we keep track of the portions of files accessed in the form of file offsets. This I/O lineage allows us to package only relevant parts of data files, significantly reducing the storage and sharing cost of containers. Results show the generality of our method across different data formats and reduction approximately equal to the amount of data fetched into memory.
Recommended Citation
Tikmany, Rohan, "Interposition based container optimization for data intensive applications" (2023). College of Computing and Digital Media Dissertations. 53.
https://via.library.depaul.edu/cdm_etd/53
Included in
Databases and Information Systems Commons, Numerical Analysis and Scientific Computing Commons