Selectivity Estimation in Spatial Databases

Swarup Acharya, Viswanath Poosasla, Sridhar Ramaswamy

Summerized by: Eyal Flato

Introduction

The article describes a method to approximate the result of a selectivity query on geographic databases.

Spatial databases contain geographic information that consists of points, segments, polygons, polylines, etc. It is mainly used in Geographic Information Systems (GIS).

GISs are of a great interest of both scientific and commercial communities. There are several products available for maintaining geographic information (ESRI’s ARC-INFO, InterGraph’s MGE), and a support for spatial data exist in the leading databases (Oracle, Informix).

The selectivity problem is a range query for which one need to know the number of elements in the dataset that lies in a given range. The spatial version of the query is an input 2-dimensional range and the result is the number of 2-dimensional elements that intersect this range. Since data might be very large and traversing it might take long time, estimation of this number should be achieved quickly.

Selectivity estimation is useful for query optimizers were the result size estimation is needed so the optimizer can run the query in the most efficient way. It can also be used to give the user a feedback about the size of a result from a query.

There are some approaches for answering selectivity estimation queries on relation databases. The main idea behind those approaches is to hold a compact summery of the data (such as an histogram, or a random sample) and answer the selectivity query on the summery. Those methods do not apply straight-forward to spatial databases. Some differences can be identified between regular data and spatial data: (i) the elements may differ in shape and size; (ii) the distribution of frequencies does not vary dramatically (not many spatial element cover the same point); (iii) the values (locations in space) are spread non-uniformly.

The problem is simplified as follows: the elements and the query are axis-parallel rectangles in the plane. The selectivity of a query rectangle is the number of elements from the database that intersect it.

Simple techniques for selectivity estimation

Answering the selectivity problem exactly require to either scan the input dataset or to use an index. Both methods are too time-consuming.

**Uniformity assumption**: We assume that the elements of the dataset have the same sizes and uniformly distributed in the input space. This assumption gives us the first (and simple) technique to estimate the selectivity by simply returning the ration between the areas of the query rectangle and the input data area.

The next techniques use grouping. The input space is partitioned to regions. Each region contains information about the number of elements intersecting it and the average width and height of an element. For a query rectangle, the number of elements in the regions that fully contained in the query rectangle is exactly calculated. We assume uniformity for the rest of the regions in the partitioning that intersect the query rectangle. In each case we are given a number K that is the limit on the number of groups we can create in the partitioning.

** Equi-area partitioning** - partition the input data area into regions with the equal (or close) areas. The partition algorithm divides in each step the largest region to two regions by a vertical or a horizontal line. After K steps we get K regions. The objective of this method is to bring the maximal error to a minimum knowing that applying the uniformity assumption on large area results in a large error. The equi-area method does not consider skew in the distribution of the elements within the created regions.

** Equi-count partitioning** - Similar to the equi-area partitioning but here the groups contain the same (or close) number of elements. The partitioning is done in the same way, but here, in each step the region with the largest number of elements is divided to two. The equi-count method generates more regions where there are a lot of elements while it is not needed if the elements are distributed uniformly.

** R-tree index based partitioning** - constructing a tree data structure in which each level introduces more details on the data. The groups are partitioned trying to minimize the area of the regions, overlap, etc. The R-tree method creates many non-uniform regions.

The first two techniques require the whole data to be in the memory while partitioning. The last method takes O(N log(N)) to construct the tree.

**Min-Skew** algorithm

This algorithm is a new approach presented by the authors trying to deal with the problems of the previous methods: accuracy and construction time.

The algorithm consist of two main steps:

- Compact approximation of the input dataset. We divide the input into a grid with a given density. We count the number of elements intersecting each cell of the grid.
- Working on the grid that we construct in the first step, we apply greedy algorithm for partitioning the input. We define an objective function:
**spatial skew**, which is the weighted variance of grid values summed over all the region of the partition. We partition the input area to regions in a greedy way minimizing the objective function. In each step we divide into two parts the region that its partitioning will minimize more the objective function.

Experimental results

Tests were performed on two types of datasets: synthetic data and real-life data (representing roads map). The techniques that were studied were: equi-area, equi-count, R-tree and min-skew that were presented in the paper. In addition random sampling method, fractal-based parametric technique and a simple uniformity assumption on all the input were tested for comparison. For each technique several queries were performed. The quality of a technique for a specific set of queries is given by the **average relative error** (relating to the exact answer that were computed offline), as the relative error is lower the approximation technique is better.

The parameters that were tested:

- Query size - average size of a query (size between 5% and 25% of the input dataset area)
- Space allocation for summery data (number of groups or number of samples)

The tests shows that the fractal technique, the uniformity assumption and the random sampling method results in the worst relative errors.

We can also see from the experiments that for all the techniques, as the query size grows the relative error drops. That is because large queries fully contain more regions for which the answer is exact (for the grouping techniques).

The min-skew algorithm gives better results than the other grouping techniques (equi-area, equi-count and R-tree) do.

Testing the effect of the allocated size for summery, we could see that all algorithms give better result when the allocated memory grows. For a small number of allowed groups (50) the results of the min-skew algorithm were much better than the other techniques.

Concentrating on the min-skew algorithm, we tested the effect of changing the grid density in the first step of the min-skew algorithm. In this case the results weren't intuitive. We can see from the results that from a certain value of density refining the density of the grid more does not improve the results of the algorithm and in some cases (synthetic input data) the results were worst for growing grid density. The reason for this is that for large queries denser grid might not improve the approximation, and cells in dense grid can become smaller than the dataset elements such that an element might be counted more than once.

Trying to overcome this phenomenon we apply a progressive refinement of regions in the first step of the min-skew algorithm. We start with a coarse grid creating part of the allowed groups of the summery and than refine the grid and generate more groups. The number of times that this refinement is done is pre-determined. Testing this improvement shows that one or two refinements give the best results but they are stilworst than the relative error achieved by selecting the most appropriate density of the grid.

Testing the construction times showed that all techniques except for min-skew take significantly more time with increasing data size.

Summery and personal view

The paper presents the min-skew algorithm that is empirically proved to be better than other techniques, as well as easy to implement and understand. The paper does not describe results for larger datasets (in the tests 40000 elements was the largest amount).

I think that the min-skew technique can be easily improved to support non-rectangular data and queries.

On the other hand, there is a problem in this technique where elements can be counted more than once if they are intersecting more than one cell of the grid. This is one of the reasons for increasing relative error when the grid becomes denser.

The first step of the min skew algorithm (compact approximation) can use other types of partitioning (for example KD-tree). This way we can reduce the approximation errors that is caused by the inaccurate counting of elements in the grid cells.

Links

PowerPoint presentation (in Hebrew)

Selectivity Estimation in Spatial Databases (PostScript file)