Mingke Erin Li

GIS Basics: Data Manipulation

Good to begin well, better to end well.

– June K. Robinson

I set up a mind map and wrote detailed documentation to help myself organize, understand, and memorize knowledge and concepts regarding GIS when preparing for my PhD candidacy exam.

This is the fourth part of the doc: Data Manipulation.

This is another huge topic. All geometry measurement and spatial analyses can be viewed as the manipulation of spatial data. I will go through the basics of algorithms, computational geometric analysis, geodatabase operations, data model conversion, network analysis, map algebra, terrain related analysis, geostatistics, and spatial data mining.

Basics of algorithms

In short, an algorithm is a sequence of operations that to be done on the data organized in a data structure, the data structure has an important impact on the optimization of the algorithm. It can also be viewed as the abstraction of the program to be run on a physical machine. An algorithm has three components: model of computation, language, and performance evaluation. The model of computation defines the data type and the operations available. For example, the random access machine model or RAM model assumes that the access of any unit of data takes constant time and the basic algebraic operations are available. Language to express an algorithm is usually straightforward like C or other pseudo-language. The performance evaluation is usually presented by time complexity and space complexity, which is a function of the size of the input and consider only the cost of each key operation. The real computing time also depends on the implementation and machine. The complexity normally means the worst-case complexity or the upper bound denoted by the big O; the average-case complexity is hard to evaluate. The approximation of the big O has the following common cases: O(1) constant complexity, O(n) linear complexity, O(logn) logarithmic complexity, O(N log N) sub-linear complexity, O(na) polynomial complexity, and O(an) exponential complexity. The optimality means the lower bound of all candidate algorithms to solve a certain task, denoted by the omega Ω. The optimality can be evaluated through the output size and can be achieved by transformation of problems or problem reduction. There are three strategies to design a good algorithm: incremental algorithms, divide and conquer strategy, and sweep line strategy.

Incremental algorithms

The incremental algorithm is to take a small subset of input that is solvable, then add each of the remaining input elements while maintaining the solution at each step. The application example is the convex-hull problem, which is to build the smallest convex polygon containing all input points. The optimized algorithm starts from sorting the points by their x coordinate and builds a triangle using the most left three points. Then for each of the remaining points, from left to right, find two points in the existing polygon which can connect with the new point and construct two tangent lines of the polygon. The complexity of the algorithm is O(N log N) where n is the total number of points.

Divide and conquer strategy

The divide and conquer strategy employs the top-down and bottom-up steps, namely to divide the input recursively until the sub-problems are small enough to be solved, then recursively merge the solutions. The optimization example used in this strategy is the algorithm of the half-plane intersection problem. The algorithm recursively merges the intersection of the most bottom half-planes to the upper levels until finds the final resulting intersected polygon. The complexity of the algorithm is O(N log N) where n is the total number of half-planes.

Another example of the strategy is in the external sorting algorithm, where the input is too big to fit in the main memory and an external sorting is needed as the pre-processing. In the sort step, the input array is recursively halved until the pairs of elements can be trivially sorted and thus compose a binary tree. In the merge step, the sorted sub-arrays are recursively merged bottom-up. It should be noted that in this context, at each node of the binary tree, one should manipulate data in chunks of pages rather than one record at a time. The complexity of the algorithm is O(nlogmn) where m is the page numbers and n is the total number of records to be indexed.

Sweep line strategy

The sweep line strategy is to decompose the geometric input into vertical strips and use a vertical line to sweep the inputs from left to right and stop at the strip boundaries while maintaining and updating the needed information. Take the rectangle intersection as an example. Sweep the vertical line from left to right, all polygons intersected the vertical line are stored in the sweep-line-active-list, and the intersection of each two polygons can be detected as their projected overlaps in the y-axis. The movement of the line stops when encountering the start or the end of a polygon, which is stored in the event list. The movement keeps on until meeting the right end. The complexity of the algorithm is O(nlogn+k) where n is the total number of polygons and k is the number of intersections.

Geodatabase operations

The operations in a geodatabase include basic operations like create, edit, update, and delete, which are largely covered in the physical design of DBMS when talking about various indices. Under this section, I focus on query operations especially those location-based queries or overlap operations, spatial join, and the query execution plan.

Query based on attributes

Queries based on attributes are largely based on relational algebra, which includes union, difference, product, project, selection, restrict, intersection, divide, and join. One can create a non-spatial index on the identifier field of relation, B tree, for example, to speed up the queries. If the relations have no indices built, then the join operation can follow a sort/merge join algorithm whose time complexity is O(nlogmn) where n is the size of the dataset and m is the size of the main memory. Alternatively, a hash-join relational algorithm can be used for non-indexed relations.

Point collinearity

To test if three points are collinear, one can compute the area of a triangle composed by these three points, given their coordinate pairs. If the area is zero, then three points are collinear.

Point on segment

To test if a point is on a segment, one first needs to test if three points (the tested point and two endpoints of the segment) are collinear. If they are collinear, then test if the x, y coordinate of the test point falls in the intervals bounded by the x, y coordinates of the segment endpoints. If it does, then the point is on the segment.

Point in polygon

There are two algorithms commonly used to test if a point is in a polygon. The first is the semi-line algorithm. Conduct a semi-line starting from the testing point, if the number of crossing edges of the polygon is even then the point is out of the polygon, if the number is odd then the point is inside of the polygon. A simple choice is to conduct a right-directed horizontal line from the point. There are three tricks to handle special cases: preliminary scan the edges to test if the point lies on the boundary, do not count edges collinear with the semi-line and count an edge as a crossing edge only if there exists at least one endpoint of the edge strictly above the semi-line. The complexity of the algorithm is O(n) where n is the total number of polygon edges.

Another algorithm is called the winding number algorithm, which imagines an observer to move counter-clockwise along the polygon boundary and always face toward the testing point. Calculate the angle constructed with each move by using trigonometry and sum up the angle at the end. If the observer turns through one complete circle then the point is in the polygon, otherwise, the point is out. The total angle the observer turns is used to test the complete circle. The complexity of the algorithm is O(n) where n is the total number of polygon edges, although the usage of trigonometry decreases the computation efficiency.

The intersection operation can be a Boolean analysis that returns true or false as results and can compute the actual intersection geometries as the results. This applies to intersection operations of segments, polylines, and polygons.

Segment intersections

The Boolean analysis of the segment intersection can be optimized by the Plane-sweep algorithm which uses the sweep-line strategy introduced above. We assume there is no vertical segment in the segment set to be tested. Sweep a vertical line from left to right, and maintain an active list that contains the ordered segments according to their y coordinates as well as their intersections with the vertical line. The event list stores the x coordinates of the segment endpoints. One only needs to test the consecutive segments in the active list for whether they are intersected or not. The intersection of segments can be checked by their four endpoints’ coordinates. The active list should support efficient operations like insertion and deletion and can be implemented as a tree structure. The complexity of the algorithm is O(n) where n is the total number of segments to be tested.

To compute the intersection points of a segment set, two methods can be used. The first one is an extension of the above Plane-sweep algorithm, which keeps track of each intersection between segments. The complexity of the algorithm is O((n+k)logn) where n is the total number of segments to be tested and k is the number of intersection points. Another method is called the red-blue intersection. The algorithm assumes two sets of segments where there is no intersection within one set. The idea is to maintain two active lists for each of the red and blue segment sets, and the active lists change only when the right or the left endpoint of a segment is encountered. Namely, the event list can be managed as a simple static array of the 2n sorted endpoints. The complexity of the algorithm is O(nlogn+k) where n is the total number of segments to be tested and k is the number of intersection points.

Polyline intersections

Computing the intersection points between polylines can be done by the same red-blue intersection algorithm for segments without modification. This is because two polylines can be viewed as the red and blue segment sets in which there is no intersection within one set. The complexity of the algorithm is O(nlogn+k) where n is the total number of segments and k is the number of intersection points.

Polygon intersections

The Boolean analysis of the polygon intersections is to test whether any of the edges of one polygon intersects any edge in another polygon. This can be done by the segment intersection algorithms introduced above. If there is no intersection between all edges, then one needs to test if one polygon contains another polygon by testing if all vertices of one polygon fall in the other polygon. The complexity of the algorithm is O(N log N) where n is the total number of polygons to be tested.

The approach to compute the intersection geometry between two polygons depends on if the polygon is convex. For two convex polygons that are intersected, one needs to perform a synchronized scan on the boundaries of two polygons. A test of that if one polygon contains another polygon is also needed in the end. The complexity of the algorithm is O(n) where n is the total number of polygons to be tested. For two simple polygons that are intersected, four steps are needed: classification of vertices, finding the intersection points, edge fragment selection, and the extraction of the resulting polygons. The complexity of the algorithm is O(n2+k2) where n is the total number of polygons to be tested k is the number of intersections.

If one needs to test or compute the pairwise intersections of a set of polygons, then the polygons can be firstly represented by their minimum boundary rectangles whose sides are parallel to the x, y axes. Then the task can be divided into the pairwise intersections of the minimum boundary rectangles followed by a refinement step which tests against the actual polygon geometry using the algorithms above. The first sub-task can be done by the sweep line strategy following these steps: 1) sort the segments of the rectangles on their xmin coordinate, 2) sweep a vertical line from left to right and stop at each event, namely a start or end of a horizontal segment, 3) store each met horizontal segment in the active list, and 4) perform a range query q(ymin, ymax) on the active list when a vertical segment s(ymin, ymax) is met, and report all horizontal segments that intersect s. The complexity of the algorithm is O(nlog2n+k) where n is the total number of segments and k is the number of intersections. If the polygon dataset is too big to fit in the main memory, then a distribution-sweeping-based algorithm can be used to facilitate the query, where an external sort algorithm will be performed first followed by a plane-sweep algorithm. The complexity of the algorithm is O(nlogmn+k) where n is the total number of segments, m is the page number, and k is the number of intersections. To compute the intersection rectangles, the algorithms above need to be modified to maintain two active lists for the ‘blue’ rectangles intersecting strip and the ‘red’ rectangles intersecting strip. The complexity of the algorithm is O(nlogmn+k).

Windowing

Windowing means to do a Boolean analysis of geometries (point/line/polygon) against a searching window. One needs to first scan the edges of the polygon (or the segments of the polyline) to be tested, and test whether each of the edges (or segments) intersects the window edges. In the end, one also needs to test if the target point/polyline/polygon is contained in the searching window. The complexity of the algorithm is O(n) where n is the total number of geometries to be tested.

Clipping

Clipping means to compute the intersection geometries found by the windowing operation. The idea is to consider each edge of the clipping rectangle as a half-space, and clip the polygon (or the line segment) against each of these four half-spaces one at a time. The complexity of the algorithm is O(n) where n is the total number of geometries to be tested.

Query based on attributes + location

For a loosely coupled system, the query based on both attributes and location needs to query on two systems separately and combine the results at the end. While for an integrated system, the query can be run with much more efficiency.

Spatial join

The spatial join between relations R1 and R2 constructs the pairs of tuples from R1 and R2 whose spatial attributes satisfy a spatial predicate, usually the spatial predicate is overlap. Spatial join generally has a filter step and a refinement step. The refinement step namely to test/compute overlaps between two geometries has been discussed above. Here we focus on the filter step, which joins two datasets and filters out the candidate pairs of geometries that will be followed by the refinement step. There are four scenarios: two datasets are with a linear structure, two datasets are indexed by R trees, only one of the datasets have been indexed, and none of the datasets has an index.

Spatial join with z-ordering index

We assume two datasets are indexed with the z-ordering method and have the same tree depth. To join these two datasets, the lists of their leaf entries are merged first, then keep the pair of the entries from the two lists as a candidate if one key is a prefix of the other. After this, sort the candidate pairs and remove the duplicates which need O(N log N) time complexity.

Spatial join with R trees

We assume two datasets are indexed with R tree have the same tree depth. The first method of spatial join is to do a depth-first tree traversal of both R trees which needs a nested loop over the node entries. To avoid the inefficient nested loop process, the second method is to restrict the search space by applying the nested loop of entries only in the intersection space of two rectangle directories. To further avoid the nested loop process, the third method is to apply a sweep-line technique on the entries of two rectangle directories. If two R trees have different tree depths, then a window query is needed when one reaches the leaf node on the less deep tree.

Spatial join with a single index

If only one dataset is indexed, an indexed nested-loop strategy will be performed. This algorithm scans the non-indexed dataset and for each of its records, deliver to the index of the other dataset a window query with the record’s MBB as the searching window.

Spatial join without indices

If none of the datasets has an index, then a spatial hash join can be carried out. The first step is to partition the first dataset into several buckets where each MBB is assigned to a single bucket, and the bucket extent is the MBB of all its MBBs inside. The second step is to assign the MBB in the second dataset to the initial buckets created in the first step whose extent overlaps it. The third step is to join each bucket of the first dataset with a single bucket from the second dataset with the same extent. This process can be further sped up by a plane-sweep algorithm or an internal data structure.

Query execution plan

A query execution follows the flow: 1) Syntactic analysis. In this step, the user query will be checked again grammar-translation rules by a parser and checked against the schema information controlled by a type checker. The output of the syntactic analysis is the algebraic tree. 2) Semantic analysis. In this step, an optimizer will find out the best solution to a query and rewrite the rules and select the best access path method. The output of the semantic analysis is the query execution plan. 3) Evaluation. This step is done by the query processor. The query execution plan is usually in a tree format which is the inflated equivalent of the algebraic expressions. Typically, a sequential spatial join or a multi-way spatial join needs a query execution plan.

Data model conversion

Under this topic, I only consider the two most commonly used data models – raster and vector, although there have been quite a few data models existing. This topic is closely related to digital image processing, and I would leave more algorithm details there. The brief introduction is below.

Vectorization

Vectorization generally has five steps. The first step is random noise removal. Generally, noise removal can be done by low pass filtering, spatial averaging, median filtering, unsharp masking, and image restoration depending on the noise types. Types of noise and the removal methods will be introduced in the image processing part in more detail. This second step is thresholding. We assume the raster data to be vectorized is a grayscale value image, and this step aims to generate a binary image where back (1) is the foreground and white (0) is the background. Thresholding is to compare the grayscale value or intensity of each pixel against the pre-set threshold. If the value is greater than the threshold, then assign the foreground value (1) to the pixel, otherwise assign the background value (0) to the pixel. The thresholding can be a global threshold that is applied to the whole image, called simple thresholding, or an adaptive threshold or local threshold which is applied to a local part of the image, called adaptive thresholding. The third step is thinning or skeleton, which is a kind of image erosion, aiming to keep only the primary feature (e.g., median axis) on the image and remove the others. The fourth step is called chain coding. This step is to link the edge pixels found in the previous step. How to encode linear features or boundaries in a raster by chain code method has been discussed in the 2D ordering section in the physical DBMS design part. The last step is transformation into vectors. In this step, a vector reduction is needed where one needs to pay attention to the balance between the line smoothness and the conversion accuracy. The Douglas-Peucker algorithm is commonly used in line simplification. The basic idea of the Douglas-Peucker algorithm is to set a threshold segment depending on the target scale beforehand, connect two endpoints of the polyline to be simplified, find the farthest vertex of the polyline to the connecting line, and compare this distance to the threshold chosen. If the distance is greater than the threshold, then connect this point to each of the polyline endpoints and repeat the above steps recursively until the distance is shorter than the threshold chosen.

Rasterization

Rasterization is usually for the display purpose and will not be discussed here.

Network analysis

Networks are typical models for topology representations and are usually represented by graphs. The network space and the types of graphs have been discussed in the Geospatial Data section. From the perspective of computing, three representation methods are useful for networks: the set of labeled edges, adjacency matrix, and adjacency list. Compared to a set of labeled edges which is efficient for storage, an adjacency matrix has computing efficiency, which stores the path cost of each node pair in a matrix fashion. However, the computing efficiency is at the cost of inefficient storage, because duplicated matrix elements may be stored in the non-directed graph, and extra values like 0 and infinite have to be stored in the matrix to present a path to one node itself and no path between two nodes. The adjacency list is an approach of representation achieving a balance between the storage and computing efficiency, which eliminates the zeros in an adjacency matrix. Here I go through some basic algorithms regarding the network model.

Systematic traversal of the nodes

A systematic traversal of the nodes in a network has two basic strategies, either depth-first or breadth-first traversal. For a depth-first traversal, nodes are structured as a stack which normally follows the last-in-first-out (LIFO) style. On the contrary, for a breadth-first traversal, nodes are structured as a queue that follows a first-in-first-out (FIFO) style.

Shortest path problem

The shortest path problem is one of the classical network problems, aiming to find out the shortest path or least-cost path within a network. Three algorithms are introduced here: Dijkstra’s algorithm, A* algorithm, and Floyd-Warrshall algorithm. Dijkstra’s algorithm is used to find out the shortest path from one starting node to all nodes. The algorithm follows a breadth-first strategy, whose time complexity is O(n2) where n is the number of vertices. A* algorithm is a goal-directed version of Dijkstra’s algorithm. The algorithm uses a combined heuristic considering the path cost as well as the distance to the destination estimated by an estimation function. Thus, the A* algorithm follows the path toward to goal instead of having to traverse the other unnecessary nodes. The time complexity is O(n2) where n is the number of vertices. Floyd-Warrshall algorithm aims to find out the all-pair shortest path and allows negative path in the network. The idea is to compare all paths of possible intermediate nodes and find out the shortest one. The time complexity is O(n3) where n is the number of vertices.

Traveling salesperson problem

The traveling salesperson problem aims to find the shortest possible route that visits every node exactly once and returns to the starting point. There is no polynomial-time solution for this problem. The time complexity is Θ(n!).

Map algebra

According to Tomlin’s representation model, layer-by-layer geographic analysis can be classified as local analysis, focal analysis, and zonal analysis, depending on the cell ranges considered in the analytical operations. This can be called as map algebra.

Local analysis

A local operation means to compute a new value for each cell location as a function of existing data explicitly associated with that location on one or more layers. Classification is a type of local operations, which assigns each cell to a class via a mathematical function or a look-up table and give the cell the corresponding class number. Generalization is another type of local operations, which reduces the details associated with individual locations on a layer. Overlay operations consider multiple layers and compute the new value for each location according to some criteria combining the considered layers.

Focal analysis

Focal operations mean to compute new values for each location as a function of values in its neighborhood. The neighborhood can be the immediate neighborhood meaning the connected cell locations, or the extended neighborhood meaning those neighbors in a certain distance. Additionally, a floating neighborhood represents the scenario where each location on the layer is considered, while a static neighborhood represents that each location on the layer is characterized by its spatial relation against the static zone. The focal analysis includes neighborhood operations, interpolation operations, surficial operations, and connectivity operations. The neighborhood operations are to assign new attribute values to individual locations where distance, topology, or direction in the neighborhood is depicted. The interpolation operations are to assign new attribute values to individual locations where values in the neighborhood are considered and, for example, averaged. Here interpolation does not have the same meaning as that in geostatistics. Surficial operations mean to assign surficial characteristic values to individual locations, such as slope, aspect, etc. Connectivity operations mean to assign new attribute values to individual locations that consider the running total of the neighborhood, such as optimum pathfinding, intervisibility, etc.

Zonal analysis

Zonal operations compute new values for each location as a function of existing values associated with a zone containing that location. The zone on which operations are applied can be the entire zone or sub-zones. Zonal operations include search operations and measurement operations, where search operations aim to retrieve information characterizing individual locations on a layer that coincide with zones of another layer, and measurement operations are to assign new values to individual location on a layer that corresponds to a measurement done in a zone.

Under this topic, I go through five sub-topics: terrain data acquisition, data format, deterministic interpolation, topographical analysis, and hydrological analysis.

Terrain data acquisition

There are mainly three approaches to acquire terrain data, namely ground surveying, aerial or satellite-based photogrammetry, and active remote sensing. Ground surveying generates high-accuracy elevation data at the cost of time and labor. Photogrammetric data capture is discussed in detail in the photogrammetry section. The basic idea is to compute 3D coordinates of points of interest from stereo pairs. Geo-referencing using DGPS/INS can help to obtain exterior orientation parameters of sensors at the exposure time. Correlation matching and least square matching techniques help to determine point pairs on the stereo images. Finally, space intersection and bundle adjustment can be used to calculate the 3D coordinates of the objects. Sampling strategies for photogrammetric data capture include systematic sampling over the image, progressive sampling which adapts to the complexity of terrain surface and leads to hierarchical sampling pattern, selective sampling which needs human intervention, and systematic measures along the contours. Among these, progressive sampling can be combined with selective sampling as a composite sampling strategy. Based on the vehicle used to obtain the images, photogrammetric data capture can be aerial-based or satellite-based (e.g., ASTER data) where aerial photogrammetry usually gives better quality. Active remote sensing techniques generally include RADAR and LiDAR, which use radio waves and pulsed laser to carry out ranging measurements, respectively. SRTM data is an example of the product by RADAR interferometry.

Data format

Three common terrain data formats are introduced here: regular grids or DEM, contours, and Triangulated Irregular Network or TIN. Regular grids are the most commonly used data format for terrains because of their similarity with the array structure of digital computers, hence, the management is relatively convenient. Regular grids can be generated by gridding interpolation from contours, ground surveying, and photogrammetric data. To generate grids from contours, one needs three steps: convert scanned contour maps into vectors (introduced later), manually tag the elevation values to each corresponding contour line, and compute grid values by sequential steepest slope algorithm. This algorithm follows these steps: 1) establish four lines toward eight directions for a grid, 2) find the intersection of each of the eight directions with the nearest contours, 3) calculate the slope along each of the four lines, 4) select the steepest slope, and 5) calculate the elevation value of the grid by linear interpolation along the selected line. The drawback of regular grids is the fixed gridding intervals so that the grid density cannot adapt to the complexity of the terrain relief, although quadtree structure can be used to encode and compress the terrain rasters.

Contour lines are lines on a map connecting points of equal elevation. Digitize a paper contour map follow the steps: 1) scanning the paper map, 2) removing the noise, and 3) generating contours by edge enhancement, binary extraction, skeletonization, and contour following. These processing methods are discussed in detail under the ‘Digital Image Processing’ topic. To generate contours from a TIN, one needs to decide on the contour intervals, then interpolate the value using the neighboring vertices, and carry out line simplification at the end if necessary.

The TIN format can adapt to the complexity of the terrain relief and avoid gathering redundant data in simple relief. TIN overcomes the inflexibility of regular grids at the cost of complex data format and representation, which includes nodes, edges, triangles, and topology. A TIN is created based on the Delaunay triangulation. To construct a TIN, one starts from a point set with 3D coordinates of each point, then create triangles that are closest to equilaterals by connecting neighboring points. The Delaunay triangulation of a point set is unique, requiring that the circumcircle of each triangle facet does not contain any other members of the point set in its interior and that the external edges of the triangulation form the convex hull of the point set. Thiessen polygons or Voronoi diagrams are the duals of the Delaunay triangulations, which are achieved by connecting the centroids of the neighboring triangles.

Deterministic interpolation

Interpolation is the process to predict an attribute value at an unsampled location by using the measurements made at point locations within the same area or region. Interpolation of terrain data is useful in the following cases: computing elevation at individual point locations, computing elevations of a rectangular grid from scattered sampling points (i.e., gridding), computing elevations of points between contours, and densification or coarsening of a rectangular grid (i.e., resampling). The interpolation methods introduced in this section are deterministic interpolation which does not use geostatistics. The classification of deterministic interpolation can be based on if the predicted values equal to the observed values for all sampled points, or based on if the same set of interpolation parameters are used for the whole area. An exact interpolation predicts the same values as the observation values for each surveyed point, while an inexact interpolation method does not. A global fit interpolation uses the same function and the same parameters for the whole study area while the local fit interpolation considers the local situation for each location and applies different interpolation parameters. Deterministic interpolation has two main drawbacks. The first one is that the goodness of the prediction can only be evaluated by computing estimates for a set of extra checkpoints that have not been used in the original interpolation. The second one is that no prior way to determine the best weighting parameters or the size of the searching window. Different deterministic interpolation methods are introduced below.

Trend surface interpolation is a common method of global fit interpolation, which uses a polynomial surface to fit the whole area. The geographic coordinates of sampled points are used to generate the polynomial function which is essentially a polynomial function expanded to two dimensions. The polynomial parameters are estimated by minimizing the squared deviations from the trend. This method has the advantages of its simple form, and that the same surface is estimated regardless of the orientation of the reference geography. However, it has a few main drawbacks: 1) extreme simplicity of the predicted surface, 2) cannot pass the data points but rather has an average characteristic, 3) accelerate without limit in areas without control points which lead to edge effects, and 4) high degree polynomial trend surfaces have computational difficulties.

Local fit interpolation has several sub-topics including the TIN interpolation, grid interpolation, spline surface interpolation, and gridding. TIN interpolation can be computed by linear interpolation, 2nd exact fit interpolation, and quintic interpolation. Among these, the linear interpolation method is the most straightforward one, which involves three triangle vertices of the certain triangular facet and computes three coefficients. The linear interpolation produces a continuous but not smooth surface, where the normal line is constant for each triangular facet while changes abruptly across the edges. The 2nd exact fit interpolation takes account of the three closest triangle facets and fits a second-degree polynomial trend surface by involving six vertices. In this interpolation method, 6 coefficients need to be determined, and the contour maps are curved rather than straight. Similar to the linear interpolation, this interpolation produces a continuous but not smooth surface and has abrupt changes in normal lines across edges. On the contrary, the quintic interpolation produces the continuous and smooth surface by fitting a bivariate fifth-degree polynomial in x and y directions and involving 21 coefficients in total.

Grid interpolation has several common methods including nearest-neighbor interpolation, bilinear interpolation, and cubic convolution interpolation. The nearest neighbor method is to simply assign the value of the nearest grid mesh point to the unknown point. Bilinear interpolation is to take the grid mesh as the basic unit to define the piecewise polynomial function for each unknown point. The process has four coefficients to be determined and produces a continuous but not smooth surface. Cubic convolution is similar to bilinear interpolation except that its 16 coefficients are calculated by 16 nearest input mesh points. It produces a continuous surface with the tendency to smooth the data. Additionally, inverse distance weighted or IDW interpolation can weigh the nearby data points depending on their distances during the interpolation.

Gridding is the process to compute elevations of a rectangular grid from scattered sampling points. This can be done by moving the average method and the linear projection method.

Other than the above-introduced interpolation methods, spline surface interpolation is another local fit interpolation. It minimizes the cumulative sum of the squares of the second derivative terms taken over each point of the surface. I will not write more tedious details regarding gridding and spline surface interpolation here.

Topographical analysis

Generated terrain data can have a lot of applications in topographical analysis and hydrological analysis, especially for those regular grid data. Slope and aspect are the first derivatives of DEM. The slope is the rate of the elevation change at each location, which is also a measure of the steepness of the terrain surface. Aspect measures the direction of the elevation change at each location, which can be understood as the direction to which the cell is oriented. The curvature of the surface is the second-order derivative of DEM, and it measures the rate of change of slopes. The shaded relief can be created for a DEM. It simulates how the surface would look like with different sun positions defined as the azimuth (i.e., rotation angle from North) and altitude (i.e., elevation above the horizon). An orthophoto or orthoimage is a rectified digital image that transforms the central projection of the photograph into an orthogonal view of the ground. The process of removing the distorting effects of tilt and terrain relief is called orthorectification or simply rectification. This can be done with polynomial rectification or differential rectification, where the latter one gives more accurate results at the cost of a more expensive process. DEM or contours can also be used to calculate the 3D surface area and volume. To calculate the surface volume from contours, one splits the ground along the contour planes into a series of horizontal slabs. Each slab is viewed as a prismoid where the height is the contour interval and the upper and bottom area is the corresponding enclosed area of contours. The surface volume is calculated as the sum of all slab volume in the end. To calculate the surface volume from regular grids, one simply computes the product of the cell area and the average height.

Hydrological analysis

A couple of commonly used hydrological analyses are briefly introduced here. Depression filling is used to fix the local anomalies of elevations to ensure the proper delineation of basins and streams. Flow direction describes the steepest downhill path for each location, which is usually defined as four or eight directions for square regular grids. The drainage network connects the flow moves with arrows based on the flow directions. Watershed is an attribute of each point of the network that identifies the region upstream of that point. Finally, the flow accumulation of each cell is the sum of all cells downstream of it following the directions indicated in the drainage network.

Geostatistics

Geostatistics is a big topic, and I will not dive too much into it here. These topics are visited: Kriging, point pattern analysis, autocorrelation analysis, and spatial statistical modeling.

Kriging

Different from the deterministic interpolation introduced above, geostatistical interpolation such as Kriging interpolation can determine the size/orientation/shape of the neighborhood and evaluate the errors associated with the interpolated values. Variograms or semivariograms are important tools used in Kriging interpolation as the reference of the size of the neighborhood by plotting the semivariance as a function of the lag distances. Specifically, the semivariance is simply half of the variance, expressing the degree of relationship between points on the surface. Lag distance is the constant space on which semivariance is measured. On a typical semivariogram, a nugget is the semivariance value when the lag distance is zero, the range is the distance range over which the regionalized variable has auto-correlation, and above which the autocorrelation is not met. This range value can be referred to as the maximum neighborhood in the Kriging interpolation, and the semivariance above this value becomes stable and roughly equal to the variance of the surface. This particular semivariance value above range is called sill. Regarding regionalized variables, there are some common assumptions. Stationarity assumes that the correlation between the value of a regionalized variable at a certain location and the value at another location is independent of their spatial locations. The intrinsic stationarity assumes that the variance between the value of a regionalized variable at a certain location and the value at another location is the same when these two locations have the same distance and direction. Second-order stationarity assumes that the covariance between the value of a regionalized variable at a certain location and the value at another location is the same when these two locations are at the same distance. If the regionalized variable meets the stationary assumption, the lag distance will be viewed as a scalar when generating semivariograms, which means the lag distance can be measured among all possible directions. If the regionalized variable does not meet the stationary assumption, namely has the anisotropy characteristic, then the lag distance will be viewed as a vector rather than a scaler. In this case, the directions among locations can be discretized into bins and the variograms are estimated separately for each direction class.

Kriging interpolation generally follows the steps: 1) determine the experimental covariance function by measured elevation at the reference points, 2) choose a model covariance function to resemble the experimental covariance function (e.g., exponential model, Gaussian model, and spherical model), 3) determine parameters in the model covariance function using the least square adjustment, and 4) determine the interpolated elevation and its quality. There a couple of different Kriging interpolations, including simple Kriging, ordinary Kriging, Block Kriging, Point Kriging, and Universal Kriging. Among these, simple Kriging interpolation generates the best linear estimate of the elevation. This approach is used when the trend is completely known (that is, all parameters and covariates known). Ordinary Kriging or punctual Kriging generates the best linear unbiased estimate of the elevation. This method assumes stationarity and treats trend as a simple constant. In this case, local means are not necessarily closely related to the population mean so only the samples in the local neighborhood are used for the estimate. Two variants of the ordinary Kriging namely block Kriging and point Kriging estimate the value of a block or a point from a set of nearby sample values, respectively. Universal Kriging is used in the scenario when regression coefficients are unknown, and no stationary is assumed. In this case, the regionalized variable can be viewed as the drift surface plus a residual surface. Universal Kriging interpolation follows three general steps: 1) achieving stationarity by estimating and removing the drift, 2) Kriging the stationary residuals, and 3) combining the estimated residuals with the drift to obtain estimates of the actual surface. In short, geostatistical interpolation has the advantages of smoothing, declustering, anisotropy, and precision. Smoothing means that Kriging estimates are based on the proportion of total sample variance accounted for by random noise. Declustering means the weight assigned to a point is lowered if its information is duplicated by nearby highly correlated points. Anisotropy means that Kriging can account for directions of correlation, and more weights are greater for samples in the direction with higher correlation. Lastly, precision means that Kriging can compute the most precise estimates possible from the measured sample points.

Point pattern analysis

Point pattern analyses can indicate how individuals locate with regard to each other by testing the complete spatial randomness null hypothesis. Statistics proposed to test spatial point patterns include Pielou’s, Clark and Evan’s, Pollard’s, Johnson and Zimmer’s, and Eberhardt’s statistics.

Auto-correlation analysis

The auto-correlation analysis includes global auto-correlation tests and local auto-correlation tests. Common statistics used for global auto-correlation tests are global Moran’s I, global Geary’s C, and Ripley’s k. Common statistics used for local auto-correlation tests are local Moran’s I, local Geary’s C, and Getis-Ord Gi* statistics. This can be used to test the hot spots and cold spots along with their confidential level.

Spatial statistical modeling

Statistical models accounting for spatial autocorrelation in model residuals include autocovariate regression, spatial eigenvector mapping, generalized least squares, conditional autoregressive models, simultaneous autoregressive models, generalized estimating equations, and geographically weighted regression or GWR.

Spatial data mining

Knowledge discovery from databases or KDD is the process of discovering hidden information or data patterns that are valid, novel, useful, and understandable from large databases. Typically, a KDD process starts from the data selection, which determines a subset of data from a database. Data pre-processing is the data cleaning process to fix data noise, duplications, voids, etc. Data enrichment is to combine the selected data with other external data. Data reduction and projection are to transform data into more efficient representations via reducing the data dimensions or records. Data mining is the process to discover patterns from the data after the above operations. Pattern interpretation and reporting are to evaluate and understand the discovered patterns. This section mainly focuses on the data pre-processing and data mining techniques.

Data pre-processing

General data pre-processing includes the following aspects: descriptive data summarization, data cleaning, data integration and transformation, data reduction, discretization, and concept hierarchy generation.

Four aspects are introduced below.

Data cleaning

Data cleaning aims to fill in missing values, remove outliers, fix inconsistent data, and eliminate redundancy. To handle the noisy data, one can do: 1) binning – sort data and classify into bins, then smooth data by bin means, bin median, bin boundaries, etc.; 2) regression – smooth data by fitting regression functions; 3) clustering – detect and remove outliers; and 4) combined computer and human inspection – detect suspicious values by computer and double-check by a human.

Data integration and transformation

Data integration aims to integrate multiple databases, data cubes, or data files. Data transformation is to normalize and aggregate integrated data. To do this, the following operations are practical: 1) smoothing – remove noise from data; 2) aggregation – summarize data or construction data cubes; 3) generalization – climb concept hierarchies; 4) normalization – scale data to fall within a small and specified range; and 5) an attribute or feature construction – construct new attributes from the given attributes.

Data reduction

Data reduction is to reduce data volume by reducing data dimensions or records while producing similar analytical results. Data reduction can be achieved by 1) data cube aggregation; 2) dimensionality reduction – remove unnecessary attributes; 3) data compression; 4) numericity reduction, and 5) discretization and concept hierarchy generation.

Discretization and concept hierarchy generation

The target of data discretization is to discrete values into intervals from continuous data. Typical methods include: 1) binning, 2) histogram analysis; 3) clustering analysis; 4) decision-tree analysis; and 5) correlation analysis.

Data mining techniques mainly include segmentation or clustering, classification, association, deviation, trends and regression, and generalization. These techniques are briefly introduced below.

Segmentation or clustering

Segmentation or clustering is to determine a finite set of implicit groups that describe the data by partitioning data into groups or classes. Cluster analysis algorithms, normally unsupervised machine learning is applied to determine classes. There are four general methods regarding clustering, including partitioning methods, hierarchical methods, density-based methods, and grid-based methods. I will not dive into specific algorithms.

K-means clustering and k-medoids clustering are two commonly used partitioning methods. K-means clustering starts by the user randomly selecting k seeds which initially represent cluster means. Then assign each of the remaining objects to its closest cluster. A new cluster mean is calculated of the objects for each cluster. Repeat the second and third steps until meeting the convergence criterion. This method works well when clusters are compact and separated from one another, while it is sensitive to noise and needs users to predefine the number of clusters. Different from the k-means clustering, k-medoids clustering uses the medoid which is an actual object to represent a cluster. Specific approaches include partitioning around medoids or PAM, clustering large applications or CLARA, and clustering large applications based on randomized search or CLARANS.

Hierarchical clustering methods are based on the tree of clusters. Two categories of algorithms bottom-up approach and a top-down approach. Agglomerative nesting or AGNES is an example of a bottom-up approach that merges nodes that have the least dissimilarity. Divisive analysis or DIANA is an example of a top-down approach and has the inverse order of AGNES.

Density-based clustering can discover clusters of arbitrary shapes and regards clusters as the dense regions of objects in data that are separated by regions of low density (i.e., noise). Density-based spatial clustering of applications with noise or DBSCAN is one of the most important algorithms under this category. It examines the data points within a distance from a specific point, and compare it to the pre-defined minimum point density, which results in a neighborhood of the point. Ordering points to identify the clustering structure or OPTICS is an extension of DBSCAN, which imposes an order on the objects and hierarchical clustering. The detailed algorithms are not introduced here.

Grid-based clustering methods quantize the object space into a finite number of cells to construct a grid structure. It has a relatively fast processing time because the cell is the basic data unit and the processing time depends on the number of cells instead of the object numbers. A statistical information grid approach STING is an example of grid-based clustering, which is query-independent, easy to parallelize, and has incremental updates.

Classification

Classification aims to predict the class label to which a set of data belong based on some training datasets via fining rules or methods to assign data items into pre-existing classes. This is usually referred to as supervised machine learning. General methods include Bayesian classification, decision tree induction, artificial neural networks, and support vector machine.

Association

Association aims to find rules that predict the object relationships as well as the value of some attributes based on the value of other attributes. Association rules are a type of dependency relationship, expressed as x->y (c%, r%) where x, y are disjoint sets of database items, c% is the confidence, and r% is the support. Bayesian networks can be used in data association.

Deviation

Deviation aims to find data items that exhibit unusual deviations from the expectation, which include errors or special cases. Common methods of deviation detection include clustering, outlier detection, and evolution analysis.

Trend and regression analysis are to fit lines or curves on the data for summarization, explanation, and prediction. Linear or logistic regression, stepwise regression, and sequential pattern extraction are commonly used.

Generalization

Generalization is a compact description of the data. Summary rules are a small set of logical statements that condense the information in the database, and characteristic rules are an assertion that data items belonging to a specified concept have stated properties.

Main references

Rigaux et al., 2002. Spatial databases: with application to GIS, San Francisco: Morgan Kaufmann Publishers.

Stefanakis, E., 2014. Geographic Databases and Information Systems, North Charleston, SC: CreateSpace Independent Publishing Platform.

Stefanakis, E., 2015. Web Mapping and Geospatial Web Services: An Introduction, North Charleston, SC: CreateSpace Independent Publishing Platform.

Tomlin, C.D., 1990. Geographic information systems and cartographic modeling, Englewood Cliffs, N.J.: Prentice-Hall.

Worboys, M. and Duckham, M., 2004. GIS: a computing perspective, Boca Raton: CRC Press.

Supplement course notes

UNB GGE 4423 Advanced GIS taught by Dr. Emmanuel Stefanakis

UofC ENGO 645 Spatial Databases and Data Mining taught by Dr. Emmanuel Stefanakis