Mingke Erin Li

GIS Basics: Index

Good to begin well, better to end well.

– June K. Robinson

I set up a mind map and wrote detailed documentation to help myself organize, understand, and memorize knowledge and concepts regarding GIS when preparing for my PhD candidacy exam.

In this post, 1D and 2D indices are discussed.

I mentioned that a pure file organization strategy is not sufficient to optimize the data retrieval operation in most cases, so an index is needed. An index is an auxiliary structure to speed the searching and retrieval of records in data files. The specific structure of an index is designed based on a particular application or the nature of the data file per se. Here I start with the 1D index and will expand the topic to the 2D index and spatial index which are one of the most significant parts of spatial data management.

1D index

A 1D index is used to linearly index records. A single-level index itself is an ordered file whose records have two fields: index field and pointer field. An index field contains the ordered values of the indexing field in the data file, and a pointer field contains the address of the disk block storing records in the data file. Each access to a disk block on the secondary storage or a disk page on the primary storage counts for one I/O which is the basic measurement of data access efficiency. When the data file is big enough and the corresponding index file is too big to fit in a single disk block, the index file will be placed on multiple disk blocks. In this case, the index may be itself indexed and so on recursively to achieve better overall efficiency, which is called a multi-level index. Although an index file is essentially an ordered file, it has two primary differences: the ordering can be operated on multiple file fields, and some indices manage to insert and delete records in a data file efficiently. B-tree and its variants are good examples of achieving efficient insertion and deletion operations.

B-tree

As a typical tree structure, B-tree has three types of nodes: root, internal nodes, and leaf nodes. A node is corresponding to a disk block on the secondary storage or a disk page on the primary storage. Each node contains a list of increasing index fields and a list of pointers to the data records. Internal nodes have an extra list of pointers to the immediate descendants of this node. Another understanding could be that each node has a list of index fields or entries, and each entry contains a key-pointer pair where keys are in increasing order, and pointers can either point to a child node or a data record depending on if it is a leaf node. A B-tree should be defined in the manner of the allowed max number of index fields of each node and the max number of immediate descendants of each internal node. This is called the fan-out ratio or the order of the B-tree. The structure of the B-tree has three important characteristics: 1) It is fully balanced, which means the depth of each leaf node is the same. 2) Each node is at least half-full, e.g., if a B-tree is defined to have at most 4 index fields then each of its nodes should have 2, 3, or 4 index fields. 3) The ancestor node defines the ranges of index fields of its descendants.

2-3 tree

2-3 tree is a special case of B-tree where the fan-out ratio equals 3. Thus, in a 2-3 tree, each node has no more than 2 index fields and each internal node has no more than 3 immediate descendants. 2-3 tree keeps all properties of a B-tree.

Three operations and their complexity are introduced here: search, insertion, and deletion. A search operation starts from the root node, traces along with the descendants, finds the exact match index field and retrieves the records of the data file via the pointer stored in the exactly matched index field. There is no restructuring of the index structure during the search operation. An insertion operation searches the place to insert the index field. If there is an empty space under a node, then a simple insertion will be conducted, and no restructure of the index happens. If the node has an overflow, then the node will be split into two equal parts and the middle value will be promoted to the parent level. If the root node comes across an overflow, then the same mechanism will be applied and the middle value will be promoted to the new root node. The deletion operation also first searches the target index field. If the node maintains half-full after the deletion of the index field, then a simple deletion will be carried out, and no restructuring is needed. If the node has an underflow, then the nodes will be merged appropriately to make sure the whole tree is balanced. This process is inverse to the insertion restructuring. The time complexity of search, insertion, and deletion operations is O(logn) where n is the number of records in a data file to be indexed. This means that the processing time is proportional to the logarithm of the collection size in the worst case.

B+ tree

B+ tree contains pointers to records only in the leaf nodes, and all keys/index values are contained at leaf nodes. Therefore, the internal nodes have a simpler structure and take less storage space compared to a B-tree. Leaf nodes are stored as a structurally linked list in a B+ tree. Other than this, the insertion and deletion operations are easier in the B+ tree.

2D index

It should be noted that computer storage is essentially one dimensional, so what we can do is to unravel the higher-dimensional space into one-dimensional indices. First, I focus on the regular square tessellation of the Euclidean space, or a raster space, and introduce several 2D ordering methods and raster storing or compressing methods.

2D ordering

Common 2D ordering mechanisms include row ordering, row prime ordering, cantor diagonal ordering, spiral ordering, Morton ordering or z-ordering, and Peano-Hilbert ordering or Hilbert ordering. Among these, z-ordering and Hilbert ordering are essentially discretized fractals that are space-filling, better at preserving nearness, and can be constructed at arbitrary levels of detail. Preserving nearness means that cells that are closer to each other in space have closer indices values. These 2D ordering mechanisms can be applied in the raster compressing approaches introduced below.

Raster storing and compressing method

The run-length encoding counts the number of consecutive cells with the same value along a certain movement trace. One can move along the row-ordering trace, z-ordering trace, or the other ordering trace, and get completely different encoding results. The chain encoding is usually used to encode the boundary of a region in the raster space, which defines 8 possible movement directions with 8 different codes, 0-7 for example, and follow the boundary clockwise or counterclockwise. In this case, it repeats the direction codes when moving in the same direction. A more compact way to encode region boundaries is to combine chain encoding and run-length encoding. Another encoding method is block encoding using various sizes of square blocks to encode a region, which is a generalization of run-length encoding in two dimensions. The last method for raster encoding or compressing is the region quadtree. The structure under the hood is a tree, where each non-leaf node has exactly four descendants, namely four quadrants. The order of quadrants at the same level in a quadtree (or the visiting order of the squares in the raster space) can depend on different 2D ordering methods. Take a quadtree of the depth of d, then each leaf quadrant can be labeled with a string of size <= d. With the Hilbert ordering, the order of the leaves corresponds to a left to right scan of the leaves, and the size of the label is the depth of the leaf in the tree, which is not equal for all leaves. The label of an internal node is the prefix of its descendants. Each leaf quadrant can have associated attributes or point to records of more detailed information in a data file. To construct a quadtree, one starts from the root which is the whole raster, subdivide it into four equal-sized squares, and if any squares are not homogeneous then subdivide them recursively until each descendant is homogeneous. A quadtree is not balanced and is deeper over a heterogeneous area. Some operations can be realized based on the quadtree structure, including common ones like union, intersection, difference, and complement, and the labeling and counting of the number of connected components in an areal spatial object. The region quadtree is useful in some cases because it takes advantage of the 2D nature of the data and has a variable adaptive resolution in raster space. However, there are a few situations when the region quadtree is not efficient: 1) Translating or rotating objects in the raster space, will lead to a completely different quadtree structure. 2) Raster data is dynamic and needs frequent updates. 3) Raster data is highly inhomogeneous. 4) Raster is largely composed of rasterized point or linear features.

Spatial index for vector data

I separate the vector data structures used by a spatial access method or SAM as an individual topic. The SAM is meaningful to accelerate data access on the secondary storage and facilitate the complex geometric algorithms especially when data collection size is large. The expectation of SAM is on both time complexity and space complexity. The time complexity here is measured as the number of I/O which is corresponding to the CPU transfer time mentioned above. The space complexity should be comparable to the indexed data file, which means it should be the order of the number of pages the data file occupied. The SAM should have dynamicity to support the insertion and deletion of records and adapt to the indexed data file’s growth and shrink. In the 2D context, SAM can be either space-driven or data-driven. A space-driven SAM partition the searching space or embedding space into sub-space usually is rectangle in shape, and independent of the distribution of spatial objects. The partition can happen on a plane space or a curved space such as a sphere, and these space-driven structures are always with redundancy; we will revisit it in detail later. On the contrary, a data-driven SAM partitions the spatial objects by adapting to their distribution in the embedding space. The data-driven SAM is always overlapping. The spatial objects in a data file to be indexed can be points, lines, and polygons. Thus, the following introduced spatial indices are specifically designed for points, lines, and polygons, respectively. Particularly, indices designed for polygons are always based on the minimum bounding box or MBB, where a query has the first filter step to test against the MBB and the second refinement step to test against the actual geometries. The design of SAM can only accelerate the first filter step whose time complexity would be O(logn), and the time complexity of the second refinement step is constant due to the sequential search.

Fixed grid

A fixed grid is a space-driven SAM designed initially for point objects, where the searching space is sub-divided into a 2D array of equal-sized cells. The cell or bucket is the basic element in the fixed grid structure, where each bucket is associated with a disk page or a disk block physically storing the points in the data file. The resolution or the spatial size of a bucket depends on the total number of points in the data file to be indexed and the page capacity of the system hardware. One page is usually 1k to 4k of storage capacity. The fixed grid index is essentially a 2D array directory in which each item corresponding to each bucket contains the address of the corresponding page.

The common operations on a fixed grid structure are inserting points, deleting points, point query, and window query. Deleting a point is straight forward, all need is to find out which bucket has the target point and do deletion – a single I/O is needed if the indexing directory is in the main memory. Inserting a point is a similar process unless a cell overflow happens after the insertion. In this case, an overflow page will be created and chained to the initial page. The point query operation then needs q I/O where the point is on the qth page in the overflow chain. In the cases of no overflow, the point query just needs one single I/O. Window query of points computes the set of buckets overlapping with the query window and accesses each associated page. Thus, the number of I/O is proportional to the query window area.

2D ordering introduced before can be applied in the fixed grid, for example, the Morton ordering, so that points close to each other in space can be further close to each other on disk. Fixed grid index is at least better than the occasion where no index is applied. However, as the name suggests it is not flexible and the problems exist when the points’ spatial distribution is highly skewed, or the dataset size varies significantly with repeated insertions of new points and ends up with a skewed distribution. A highly skewed distribution will result in a long overflow page chain.

Grid file (dynamic grid)

A grid file is a dynamic grid structure that is an extension of the fixed grid. The main difference is that the grid file is a 2D array of different sizes of cells that can dynamically adapt to the point distribution. Same as the fixed grid, a 2D array directory is used to store the addresses of pages associated with buckets. A bucket can be divided if it overflows. To do so, the vertical or horizontal subdividing lines can be put at arbitrary positions depending on the points’ location. Buckets can be ‘merged’ if their merging results in a rectangle, where these two adjacent buckets can reference the same page. In the grid file, linear arrays are needed to describe the partition of the coordinate axis into intervals, which are called scales.

Inserting a point into the grid file has three cases: no cell splits, the cell splits without directory splits, and the cell splits with directory splits. If the point is in a cell that associates with a page that is not full, then no cell splits happen. If the point is in a non-full cell that associates with a full-page because more than one cell corresponds to the same page, then a new page will be allocated. If the point is in a full cell (of course the page is full then), a cell split will be needed. Point query and window point query are similar in the fixed grid. Compared to a fixed grid structure, the grid file benefits the point search which needs only two disk accesses and adapts to points’ distribution due to its dynamic structure. The problem of the grid file would be in the case that the dataset is large, and the directory can no longer fit in the main memory. I think this is a common problem for other spatial indices.

Although the grid structure is initially designed for point objects, both the fixed grid and grid file can be used for polygon objects (MBB) as well. Each MBB is assigned to its overlapped cells, and duplication occurs when the MBB contains a cell and the MBB intersects the cells. If an MBB is contained in a cell, then no duplication will appear. The insertion and deletion operations are the same for points although there would be more cell splits because of the duplication of objects. Two more operations are discussed here: point-in-polygon query and window query of polygons. To do a point-in-polygon query, one needs to first test which cell contains the point and read the corresponding disk page, then sequentially test which entry (MBB) in the disk page contains the point, finally test against the actual geometry which is called refinement. The whole process can be viewed as a constant I/O time. A window-query of polygons has two steps. First, test and find out all cells overlapping the query window and read the corresponding disk pages. Second, scan each page to return all MBB then remove the duplications by sorting. The number of I/O of this process is proportional to the window area. The main disadvantages of managing polygons with grid structures are object duplications in neighboring cells and the expensive duplicate removing process. Of course, if the data file is large and the corresponding index directory is too large to fit in the main memory, then more complexity will occur.

Point quadtree

Point quadtree is another space-driven SAM designed for point objects, which shares a lot of common characteristics with the region quadtree. It has a root point which is the first point to be inserted in the embedding space, and each of the non-leaf nodes has four immediate descendants. This is an unbalanced tree as a region quadtree. The difference to the region quadtree is that the position of each partition into quadrants is centered on a data point so that the quadrants are not equal in size and shape. Each data point is a non-leaf node and corresponds to a data record, which contains the coordinate in two fields, the descendants in four fields, and other non-spatial attributes in the other fields. The insertion sequence plays an important role in constructing the point quadtree as different sequences result in completely different quadtree structures, and therefore this structure is not suitable for dynamic geospatial data. The build time for a point quadtree is O(N log N) where n is the number of points indexed by the quadtree. The time complexity of a point query is O(logn). Another problem of point quadtree other than dynamic data management issue is that when applying the structure in higher dimensions, the number of descendants will exponentially increase, for example, in 3D space a point can have 8 descendants (2k).

2D tree

2D tree or generally kD tree resolves the problems of point quadtree where the number of descendants exponentially increases in higher dimensional space at the expense of deeper tree levels. It has a root point which is the first point to be inserted in the embedding space, and each of the non-leaf nodes has two immediate descendants no matter the space dimensions. This is an unbalanced binary tree. The construction of the 2D tree is as follows: at the even depths (root level is 0), partition the space along the y axis direction and compare the x coordinate of the newly insert point to its parent to decide whether it is a left descendant or a right descendant; at the odd level, partition the space along the x-axis direction and compare the y coordinate of the newly insert point to its parent. The left descendant is always less than the right descendent at each level. The two descendants of each non-leaf node are not equal in size. Each non-leaf node corresponds to a data record, which contains the coordinate in two fields, the descendants in two fields, and other non-spatial attributes in the other fields. Similar to the point quadtree, the insertion sequence of points matters, so it is not good for dynamic geospatial data either. Here two query operations are introduced for 2D tree structure: point query and window query.

To query a point, one starts from the root, compares the x coordinate to the non-leaf node at even levels and y coordinates at odd levels, recursively until the exact match is found. To query points falling in a window or a range (defined by two coordinate pairs marking the extreme x and y coordinates), one starts from the root, tests if the node point falls in the window, and compares x or y coordinate to the two window corners at even or odd levels, recursively until the deepest tree level is reached.

PM quadtree

PM quadtree is a space-driven SAM designed for planar network objects, which is a variant of region quadtree and has three types: PM1 quadtree, PM2 quadtree, and PM3 quadtree. Here I only focus on the PM1 quadtree. As the region quadtree, a PM1 quadtree has a root, leaf nodes, and non-leaf nodes which have four equal immediate descendants. In a PM1 quadtree, vertices and edges are separated into distinct leaf nodes and are allowed with different field structures. There are three constraints for a PM1 quadtree: 1) leaf node contains at most one vertex of the network; 2) leaf node containing one vertex cannot contain part of an edge that is not incident with the vertex, and 3) leaf node containing no vertex can only contain one part of an edge. The insertion operation in a PM1 quadtree searches for the appropriate place to insert parts of the edge, and edges are successively clipped by the sub-quadrant boundaries in the end. Deletion operation in a PM1 quadtree is a similar process that inverses the insertion. PM2 quadtree and PM3 quadtree improves PM1 quadtree in the manner of smaller tree structure although have more complex associated records. Specifically, a PM2 quadtree has a more stable structure under translations or rotations of the planar graph.

Strip tree

Strip tree is a data-driven SAM designed for curve and line objects. The root of the strip tree is the minimum bounding box of the entire line or curve, while different from the one used in R-tree, the sides are parallel to the line joining the endpoints of the curve instead of parallel to the x and y-axis. The minimum bounding box ensures that the dead zone is the smallest. Each non-leaf node of the strip tree has two unequal descendants, where partition happens at the point where the curve touches the minimum bounding box. The binary tree grows recursively until the width of the resulted strip is less than a threshold value, and these strips are the leaf nodes. Quite a lot of operations are practical with the strip tree structure, including computing the length of the curve, testing proximity of a point, displaying curve at different resolutions, testing if two curves intersect, computing union of two strip trees, computing area of a closed curve, testing point-in-area, and intersecting a curve with an area.

BSP tree

Binary space partitioning or BSP tree is a space-driven SAM that hierarchically structures sets of directed line segments. It is a binary, unbalanced tree. The structure depends on the order of insertion of the segments.

Quadtree

A quadtree is a space-driven SAM that can be used for areal objects represented by MBB. It has been introduced above, which has a tree root, non-leaf nodes having exactly 4 equal descendants, and leaf nodes associated with disk pages. To construct a quadtree for areal objects, one starts by dividing the searching space into 4 quadrants, and assign MBB into a quadrant that completely contains the MBB or several quadrants that overlaps the MBB. Space is recursively divided until the number of MBB overlapping each quadrant at the leaf node is less than the page capacity. The insertion operation is relatively simple: one can follow as many paths as there are leaves overlapping the MBB to be inserted and read the associated page. If the disk page is not full, then simply add the new entry to the page. If the disk page is full, then divide the page into 4 quadrants and add the new entry to each of the new quadrants.

Point-in-polygon query and window query are supported by this structure. A point-in-polygon query needs to follow the path from the root to a leaf and choose among 4 quadrants that contain the target point at each level. The found leaf is read and scanned to find out which MBB contains the point. Of course, a refinement step should be followed to test against the actual geometry. The window query needs to find out all MBB that intersect the query window. As introduced in the region quadtree part, the nodes can be labeled by 2D ordering methods, for example, the Hilbert ordering. The order of the leaves corresponds to a left to right scan of the leaves, and the size of the label is the depth of the leaf in the tree, which is not equal to all leaves. The label of an internal node is the prefix of its descendants. There are a few drawbacks of using a quadtree to index areal objects. First, the fan-out ratio is fixed to 4 which occupies a small part of a page. Second, the query time is related to the tree depth due to the unbalanced tree structure, so the performance can be degraded when the tree depth is large and even worse for dynamic data. Third, as in the grid structure, objects are likely to be duplicated in neighboring quadrants.

Linear quadtree

A linear quadtree is largely based on the quadtree, which is also a space-driven SAM that is used for areal objects represented by MBB. The core idea is after labeling the quadtree leaves, we define the lexicographic order of these labels, such as 020<03<301<31, thus we can use a B+ tree to index these labels corresponding to the leaf nodes. The application of B+ helps to map from 2D partitioning of the searching space into a 1D list of quadrants. The advantage is that the B+ tree is always balanced, and the structure is dynamic and efficient in storage and response time, although duplicated MBB may be stored in different B+ tree entries. The commercial DBMS using a B+ tree to index their non-spatial data can extend their system to handle spatial data in this way.

Three operations in linear quadtree are introduced here. First, to insert an areal object (MBB) can come across either no split of the quadtree when the entry number does not meet the page capacity or a split of the quadtree. The new entry can be simply added to the corresponding quadrant page in the former case, and the old leaf node in the B+ tree needs to be deleted and four new leaf nodes will be added in the latter case. The point-in-polygon query has three steps: 1) compute the point label whose length is the same as the tree depth, 2) retrieve the corresponding quadtree leaf in the B+ tree, where the quadtree leaf label should be either the same is the point’s label of the prefix of the point’s label, and 3) access and scan the leaf node to find out which MBB contains the target point. Another refinement step should be followed. The whole process needs (d+1) of I/O where d is the depth of the quadtree. The window query steps are as follows: 1) We assume the window is defined by its NW and SE corner vertices, the first step is to do a point query for these two corner points as the first two steps in the point query operation. This gives us two quadrants’ labels marking an interval on the B+ tree. 2) Perform a range query with this interval on the B+ tree. This step returns all quadrants within this label interval, which is all quadrants from the lower bound to the upper bound via a row ordering. 3) For each returned quadrant, test if it overlaps the window, and if it does then test each of its MBB to find out those who overlap the query window. This whole process has 2d I/O for step 1, d I/O for reaching lower bound leaf in step 2, and another as many as I/O to scan the chained B+ tree leaves depending on the window size.

Z-ordering tree

A Z-ordering tree is a space-driven SAM used for areal objects. The biggest difference compared to the linear quadtree above is that the z-ordering tree decomposes each areal object into a quadtree instead of working with MBB, which leads to a raster approximation in the end. The idea is, at each decomposition level, check if each piece of the target object overlaps all leaf nodes of a decomposed quadrant, if not another decomposition will be performed recursively until the deepest tree level is met. The resulting quadrants are labeled by the B+ tree as in a linear quadtree. The point-in-polygon query and the window query are similar to those in a linear quadtree. However, compared to a linear quadtree, the B+ tree of a z-ordering tree may have larger depth and resulting in more I/O, this is because the B+ tree of a linear quadtree has their leaf nodes containing several MBB as long as the number is not over the page capacity, while the B+ tree of a z-ordering tree has their leaf nodes contain only one object identifier. There are three occasions of duplications in the z-ordering tree: the object’s identifier stored as many times as there are cells in its approximation, the same labels may be duplicated for different identifiers where several objects share the same cell at the same level, and some objects may overlap the same cell but at different decomposition levels. This means that the keys of leaf nodes may be duplicated.

There are two variants for the z-ordering tree. The first variant is simply to decompose the object into elementary cells without combing it into a larger quadrant, similar to a raster. In this case, the redundancy is the worst although the point and window queries are simple. The reason is, take the window query as an example, in a linear quadtree one needs to find out the corresponding quadtree leaves in the B+ tree after computing the window corners’ labels whose length are the same as the tree depth. This step can be skipped in the case of a raster variant of the z-ordering tree because the labels of the point will always be the same as those of the quadtree leaves. Another variant of the z-ordering tree is to assign objects to the smallest quadrants instead of to all of the approximation quadrants. This is good in the manner of no object duplications and thus less deep B+ tree, although the small ratio between the object size and the assigned quadrant size can be a problem.

R tree

R tree is a data-driven SAM that is designed for areal objects. It is essentially a multidimensional extension of the B tree which is always balanced and at least half full for each node. The tree root has at least two entries, and internal nodes and leaf nodes represent directory rectangles. Each node corresponds to a disk page. The directory rectangle at an internal node stores a list of node entries which are collections of lower-level directory rectangles and a list of page addresses. The maximum entry number of a node depends on the entry size and the page size. The directory rectangle at a leaf node stores a list of leaf entries that are actual MBB to be indexed. Along with the leaf entries, a list of object addresses that are inside of pages is stored at each leaf node.

To initialize an R tree, two algorithms are considered here. The first one is a quadratic split algorithm. For example, we have M objects (MBB) to be indexed in total, this algorithm starts by picking two far-away seeds whose dead space is maximum. Now we have two groups to enrich. For each of the remaining entries, pick one leading to the maximum difference in two groups’ expansions measured as the enlarged dead space. Once picked up the entry, add it to the closest group which would have minimum expansion after having the new entry. If two groups are equally close to the new entry, then choose the group with the smallest area or fewest elements. This process is quadratic in M. Another one is called a linear algorithm, which picks two entries as two seeds whose distance along one axis is maximum, then assign each of the remaining entries to the group having the minimum expansion. This process is relatively fast at the sacrifice of more overlapping between the directory rectangles.

To insert a new object into an R tree, one starts by traversing the tree top-down until reaches a leaf node whose directory rectangle contains/overlaps the target MBB. If the leaf node is not full, then enlarge the leaf node’s directory rectangle to include the new MBB, and update the node entry stored in its parent node. If the parent node’s directory rectangle is enlarged as well then one needs to update the parent’s entry stored in the upper level and may need to recursively update up to the root. If the leaf node is full, then a split is needed. One needs to update the old leaf node and its parent node, and insert the new entry to a new leaf node. If the old leaf node’s parent node is full, then a node split is needed again and may need to recursively split up to the root. The deletion operation in an R tree needs three steps: 1) find the leaf node containing the entry, 2) remove the entry from the leaf, 3) reorganize the R tree if needed. The point-in-polygon query in an R tree begins by traversing the tree from the root, return all leaves whose directory rectangles contain the target point. It should be noted that because directory rectangles may have overlaps, multiple leaf nodes can be returned at this step. Then for each returned directory rectangle, sequentially scan the leaf entries and keep those containing the target point. The window query in an R tree is conducted similarly: start from traversing the tree from the root, return all leaves whose directory rectangles intersect the query window. For each returned ones, sequentially scan the leaf entries and keep those intersecting the query window.

The obvious benefit of the R tree is no object duplication due to its data-driven nature, and the tree structure adapts to the skewness of the data distribution. Because of the similarity to the B tree, accessing an MBB has time and space efficiency in the R tree. The main drawbacks are the poor spatial organization due to the random insertions, and the directory rectangles may overlap each other which results in inefficient point and window queries.

R* tree

R* tree has the same structure as the R tree, and two improvements compared to the R tree include an improved split algorithm and a force reinsertion strategy when doing insertion operations. The initialization of an R* tree is as follows. To determine the best split line, it assumes that the split is only to be performed along one axis each time and explore all possible distributions of objects above or below the split line. The best split line is the one leading to the minimum sum perimeters of nodes’ directory rectangles. Given the split line, one chooses the distribution with the minimum overlaps between directory rectangles. If two distributions have the same overlaps, then go with the distribution with the minimum area. This algorithm leads to fewer overlaps between nodes, less area covered by the nodes, and less perimeter of a node’s directory rectangle. The insertion operation adopted the force reinsertion strategy, which allows reinserting some entries whenever a node overflows so that it is likely to avoid the frequent split and local reorganization in an R tree, although a split has to be done in an R* tree when detecting the same node overflows a second time. Both the improved split algorithm and the force reinsertion strategy result in a better organization of nodes than an R tree and consequently lead to improved query operation performance.

R+ tree

R+ tree is another variant of R tree that aims to remove overlaps between directory rectangles at each level at the sacrifice of duplicated object storage. It has the same structure as the R tree, while an internal leaf contains all rectangles in the subtree, and a directory rectangle needs to be assigned to all leaf nodes it overlaps. The benefit of this structure is zero-overlapping between nodes which leads to optimized point query. The drawback is the duplicated directory rectangles. Other than this, it cannot guarantee an optimized storage utilization so that it is larger than the R tree created for the same dataset. The construction and maintenance are more complex as well. Here I briefly go through the insertion operation in an R+ tree. A directory rectangle is to be inserted in all nodes the directory rectangles of which it overlaps, and keep zero overlaps among directory rectangles at the same level. This is achieved by clipping the target MBB and assign each of the pieces to the corresponding node. There are occasions when the directory rectangles inserted with the new entry need to be enlarged without overlapping. In the cases of potential overlaps in this step, some reinsertion is needed. Finally, a node has to split if it overflows. This complicates the algorithm and deteriorates the space utilization.

Packed R tree

Packed R tree is an improved R tree suitable for static data without frequent updates. The idea is to sort the base data according to their locations and build the index bottom-up, which will eventually maximize the space utilization although overlapping between nodes is possible. The sort-tile-recursive algorithm is used. First, sort all leaf nodes’ directory rectangles according to the x coordinate of their centroids into groups. The number of groups is the square root of the total object over node capacity. Then for each group, sort all directory rectangles on the y coordinate of their centroid. Finally, take the set of the directory rectangle of leaves and construct recursively the upper levels with the same algorithm.

Main references

Rigaux et al., 2002. Spatial databases: with application to GIS, San Francisco: Morgan Kaufmann Publishers.

Stefanakis, E., 2014. Geographic Databases and Information Systems, North Charleston, SC: CreateSpace Independent Publishing Platform.

Stefanakis, E., 2015. Web Mapping and Geospatial Web Services: An Introduction, North Charleston, SC: CreateSpace Independent Publishing Platform.

Tomlin, C.D., 1990. Geographic information systems and cartographic modeling, Englewood Cliffs, N.J.: Prentice-Hall.

Worboys, M. and Duckham, M., 2004. GIS: a computing perspective, Boca Raton: CRC Press.

Supplement course notes

UNB GGE 4423 Advanced GIS taught by Dr. Emmanuel Stefanakis

UofC ENGO 645 Spatial Databases and Data Mining taught by Dr. Emmanuel Stefanakis