Thesis-Learned-Selection-Strategy-for-Lightweight-Integer-Compression-Algorithm-Parameterization/introduction.tex at main · morpfl/Thesis-Learned-Selection-Strategy-for-Lightweight-Integer-Compression-Algorithm-Parameterization · GitHub

1
2
3
4
5
6
7
8
\chapter{Introduction}
The amount of stored data has been increased exponentially during the last decade, reached 33 Zettabytes in 2018 and will grow to 175 Zettabytes by 2025 as the IDC \cite{Reinsel2018} predicts. This growth leads to new challenges for data processing tasks in different applications like database systems \cite{Abadi2006} or Machine Learning \cite{Elgohary2016}. Compressing the data and reducing the physical size of it has therefore become a crucial step for query executions or analytical tasks. Regarding database systems, column-organized data is usually encoded as sequence of integers \cite{Stonebraker}. Hence, the query processing is applied only to these integer sequence encoding. Compressing integer values prior to query processing leads to a better performance in terms of compression rate and ratio \cite{Woltmann2021}. Although a large variety of lightweight integer compression algorithm exists, there is no single-best one \cite{Damme2017, Damme2019} as they all behave differently regarding input data and hardware properties. Additionally, integer compression algorithms can be initialized with certain parameters like the input format, output format, or the maximum bit width of the data being compressed. The parameterization also influences the behavior of the algorithm. Selecting the best combination of compression algorithm and parameters requires a suitable selection strategy. Certain approaches have been proposed that aim to solve this task. On the one hand, rule-based strategies are based on decision trees leading to the best-fitting algorithm \cite{Abadi2006}. Cost-based approaches on the other hand define cost functions for each algorithm and choose the one with the lowest costs \cite{Damme2019}.
Strategies based on Machine Learning do not require knowledge about the algorithm behavior as it is considered as a black box. The Machine Learning model learns the behavior from training data. Regarding the algorithm selection, Machine Learning approaches lead to the best results in comparison to a cost-based and rule-based approach \cite{Woltmann2021}.
However, none of the strategies considers the best-fitting parameterization.

In this thesis, we propose an extension of the \emph{Learned Selection Strategy for Lightweight Integer Compression Algorithms} \cite{Woltmann2021}. This extension additionally considers the best-fitting algorithm parameterization. For this, we describe the steps that are necessary in order to generate representative training data, derive features from them, create and tune the Machine Learning model, and evaluate the approach against a baseline. We show that applying a strategy which considers algorithm and parameterization leads to better compression results than using the simplest algorithm with parameters covering the largest range of data.

The thesis is structured in five chapters. Firstly, we describe general concepts of Lightweight Integer Compression, Algorithm Selection and Machine Learning in Chapter 2. Subsequently, we present the concepts and methods that will be used for the data generation process, feature engineering and hyperparameter tuning of the Machine Learning models in Chapter 3. In Chapter 4, we describe the specific implementation by presenting pipelines for the processes of data generation and validation. We evaluate the approach in Chapter 5 by using real-wold data on the one hand and by comparing it against a baseline strategy on the other hand. Finally, we conclude the thesis and give an outlook in Chapter 6.