Compressing Relations and Indexes

Jonathan Goldstein , Raghu Ramakrishnan

Uri Shaft

Summarized by Eyal shani

 

1. Introduction

Compressing the relational Database is a big issue today. Performance, scalabilty, and administration are some of the problems we come against in big Databases as Data Warehouses. Compression is good way do deal with these problems. Good Compression can reduce the I/O, and the size of the data files. This can bring to faster performance, good scalabilty and ease of administration.

This paper describe a new compression algorithm which distinguish from earlier algorithms, in that it can decompress individual tuples and individual fields, rather then a full page at a time. This characteristic offers significant improvements in cases where page is stored in the buffer pool and repeatedly probed for a few tuples. The page can be stored in compressed form in memory, and retrieving an individual tuple is extremely fast. The algorithm supports very fast decompression, which is very important come in comparison to I/O. the decompression is faster then I/O therefore improves the performance (there is less I/O in compressed pages, and the decompression doesn't interfere).

The Algorithm achieves compression on numeric fields only, while low and medium cardinality fields of other types can be handled by mapping values to a small range of integers. The algorithm is effective for records with many low to medium cardinality fields as fact tables in decision support systems.

 

2. Compressing a Relation.

The algorithm is using the common information amongst tuples on a page. It uses the range of values in a field to compress the data. This common information is called the frame of reference. Of course, with a page level compression the range of values is for the data in the specific page only. The degree of compression obtained by page-level compression technique depends greatly on the range of values in each field for the set of tuples stored on a page.

Consider a set S of points or rectangles and collect, from S, the minima and maxima for all dimensions in S. The minima and maxima provide a frame of reference F, in which all the points/rectangles lie. For instance, if

S = {(511,1001),(517,1007),(514,1031)}

Then

F = [511,1001] * [517,1031]

The set of point S would be represented using the following bit string:

S = {(000,0000),(110,00110),(011,11110)}

 

Sometimes, it is sufficient to represent a point or rectangle by a bounding rectangle, e.g., in the index levels of an R-tree. In this case, we can reduce the number of bits required as much as we want by trading off precision. The idea is that we can use bits to represent equally spaced numbers within the frame of reference, thereby creating a uniform 'grid' whose coarseness depends on the number of bits used to represent 'cuts' along each dimension.

3. Compressing an indexing structure.

The compression technique is good for both R-trees and B-trees.

Compression within the R-tree Internal nodes: Since the internal node of R-tree simply store (hyper-rectangle, page pointer) pairs, compression is used to compress the hyper rectangle stored in each pair. As a result, the header for the internal node must now include the frame of reference for all the hyper-rectangles on the page.

Compression within the R-tree: the leaves store (point location, value) pairs and we can compress the point location by looking at each dimension as a field.

Compression within B-tree: the algorithm technique yields the benefits of such optimizations as storing (key, rid - list) pairs and recognizing "runs" of rids. The key, separated to fields, can be compressed.

The paper gives an algorithm to produce an index structure containing the data records in the leaf pages to gain better compression. The close the points in the page the better compression we'll have.

 

4. Performance Evaluation.

The paper gives testing of the effectiveness of the compression technique in a variety of situations. I bring here only one diagram, which I believe gives the best view.

The diagram shows the compression achieved with the sales data set (the most common data warehouse set). It shows that the compression is very depended on the cardinality of data, and the number of dimensions. When we have up to 6 dimensions, we can see the number of tuples affect the compression.

5. Remarks.

As you can see, this algorithm is quite naive. I don’t see how a relational database can work with this algorithm in the file management layer as wrote in the paper. A DBA must control which tables and indexes to compress and which not. Tables including fields other than numbers, and high cardinality will suffer a lot with this algorithm. But if we look over, Its an interesting idea. A good algorithm, which will enable fast decompression of individual tuples and fields can improve very much the Performance of the RDBMS.