GIS Basics: Geospatial Data

January 20, 2021

Good to begin well, better to end well.

– June K. Robinson

I set up a mind map and wrote detailed documentation to help myself organize, understand, and memorize knowledge and concepts regarding GIS when preparing for my PhD candidacy exam.

This is the third part of the doc: Geospatial Data.

This topic is about geospatial data, and I will start from the data capture methods and spatial object, and go through the data repository, modeling process, data quality, spatial data model, and spatial data format.

Data capture methods

Methods to acquire geospatial data include geodetic measurement, census and sampling, air-photographs and remote sensing images, legacy data sources, other sensors, and volunteered geographic information or VGI. VGI was proposed in 2007. In contrast to citizen science and crowdsourcing, VGI highlights the active attitude of people when contributing data.

Spatial object

Spatial objects or geographic objects compose a theme which is a collection of geospatial information corresponding to a particular topic. A spatial object can be an atomic object or a complex object if it consists of multiple spatial objects. According to its geometry, a spatial object can also be classified as a point object, linear object, or area object. Three components are essential for a spatial object: identifier, spatial attributes, and thematic attributes. The spatial dimensions of a spatial object include its geographic position, geometry characteristics, graphical properties, and spatial relations to other objects. The schema of a spatial object is the collection of its attributes. Dynamic spatial objects also have their properties like life and motion, and temporal relations exist among dynamic spatial objects.

Modeling process

A model is an artificial construction where the source domain is represented in the target domain. The purpose of a model is to simplify and abstract away from the source domain. Modeling stages in GIS follow a life cycle starting from the application domain which is the subject of an application domain model. For example, if the application domain is the package delivery network, then the application domain model would be a network model constructed via the initial study. A conceptual computational model can be developed through the system analysis, for example, an object-orientated model or a relational model. After the system design, a logical computation model is developed such as RDBMS, OODBMS, or ORDBMS. Finally, the physical computational model can be established based on the logical model through system implementation.

Modeling space

The discussion of modeling space includes Euclidean space, network space, and metric space.

Euclidean space

Euclidean space is a space in any finite number of dimensions (usually 2D or 3D) where the axioms and postulates of Euclidean geometry apply. Euclidean plane is a Euclidean space with two dimensions, and a Cartesian plane is a Euclidean plane plus one specific method of representing points. Search space or embedding space means a finite space with a specific interest in observing or analyzing. In a Euclidean plane, typical simple transformations of geometries include translation, scaling, rotation, reflection, and shear. The transformation in the Euclidean plane can be generally classified as Euclidean transformations, similarity transformations, affine transformations, projective transformations, and topological transformations, where all Euclidean transformations are similarity transformations, all similarity transformations are affine transformations, all affine transformations are projective transformations. The translation is an example of Euclidean transformations that preserve the size and shape of the geometry. Scaling is an example of similarity transformations that preserving the shape. Rotation, reflection, and shear are examples of affine transformations that preserve the affine properties (e.g., parallelism). Central projection is an example of projective transformations that preserve the projective properties. Topological transformations preserve the topological properties (i.e., preserved under a homeomorphism) of embedded objects. Elliptic space and hyperbolic space are examples of non-Euclidean space. I will not dive in.

Network space

Network space focuses on the topological properties of embedded objects and is usually represented by a graph. A graph is a highly abstracted model of spatial relations and expresses merely the connectedness between elements. A graph consists of nodes and edges or arcs, where a node is a distinguished point that connects a list of edges, and an edge is a polyline that starts at a node and ends at a node. A graph can be classified depending on its properties. A graph with the assigned label with each edge is called a labeled graph. A connected graph is such that any two of its nodes are connected by a certain path. If a path from a node to itself traversing at least one edge, the graph is called a cycle, if not, it is called an acyclic graph. A tree is a connected, acyclic graph. A rooted tree has one of its nodes (i.e., root) distinguished from the other nodes (i.e., immediate descendants and leaves). A directed graph is a graph in which each edge is assigned with a direction, and can be further classified as a directed cycle and directed acyclic graph. A graph can be a planar graph or non-planar graph based on whether the edges can only intersect at nodes of the graph (planar graph) or not (non-planar graph). Lastly, a dual graph can be constructed by associating a node in a planar graph with each face of the planar graph. For example, the dual graph of a diagonal triangulation of a polygon is a tree.

Metric space

Metric space includes the concepts of the distance between objects in the space. Euclidean space is an archetypal example of a metric space. Geodesic distance is the length of a line segment connecting two points in a Cartesian plane, while is the length along the grate circle in elliptic space. Manhattan distance is the sum of offsets along with individual directions, for example, along the x and y-axis. Spherical Manhattan distance is measured as the difference in latitudes plus the difference in longitudes. The concept of distance can be generalized to the consequence of motion, for example, the travel time distance means the minimum time required to travel from one node to another. Besides, the difference between the position of two nodes in a fixed list such as a gazetteer is termed as the lexicographic distance. A metric space has three features namely uncontroversial, symmetric, and the triangle inequality. Travel time distance may not be considered in a metric space because it lacks the property of symmetric.

Coordinate systems

Coordinate systems are a set of rules to determine the position of points in space using an ordered set of numbers. Depending on the reference surface it is associated with, a coordinate system can be a planar system, spherical system, or ellipsoidal system, where the spherical and ellipsoidal systems are often referred to as a 2.5-dimensional space. Otherwise, a coordinate system can be a true 3D space where the volume can be calculated.

Cartesian, polar, and geographic coordinate systems

A Cartesian coordinate system has its axes orthogonal to each other. A 2D Cartesian coordinate system is a planar system, and a 3D Cartesian coordinate system is a true 3D system. Particularly, a geocentric Cartesian coordinate reference system puts the origin at the center of the Earth’s mass, Z-axis coincides with the rotation axis of the Earth, and the X-axis lies on the equatorial plane and crosses the zero meridians of Greenwich. Then Y-axis is determined following the right-handed 3D Cartesian coordinate system.

A polar coordinate system is a planar system defined in a 2D space, where a point position is represented by the distance and the azimuth. This is used in the Hough transformation for straight-line fitting in the digital image processing.

A geographic coordinate system is a spherical system or ellipsoidal system referencing locations on the Earth’s surface using two angular measurements: latitude and longitude, closely related to parallels and meridians. However, a lat-lon pair can refer to different locations depending on the spheroid or ellipsoid it is based on.

Geoid, best-fitting ellipsoid, and datum

The geoid is a specific equipotential surface, which defines the best, in a least-square sense, the global mean sea level. The mean sea level is the arithmetic mean height of the sea obtained by long-term observation of the sea level in one place using a tide gauge. The mean sea level is not an equipotential surface. Because the geoid is hard to be described mathematically, the reference ellipsoids are used as mathematical models to approximate, globally or locally, the geoid. Typical reference ellipsoids include Clarke 1866, Hayford 1909, GRS 1980, etc., defined by their parameters semi-major axis, semi-minor axis, and flattening. Then, a geodetic datum can be defined based on the selected reference ellipsoid by mounting it in a geocentric Cartesian coordinate system and specifying the center point, orientation, and scale of the reference surface. Hence, the horizontal datum and vertical datum are defined simultaneously. In this scenario, the height is measured along the ellipsoidal normal and is called the ellipsoidal height (h), and is what a GNSS normally offers. However, a height should be measured perpendicular to the equipotential surfaces (namely the geoid) along the direction of the plumb line, and this height is called the orthometric height (H). Geoid undulation or geodetic height (N) is the approximated difference between the ellipsoidal height and the orthometric height, typically ranging from -110m to 90m throughout the Earth (i.e., N = h - H). The geoid undulation is an approximation because of the small angle between the ellipsoidal normal and the plumb line long which h and H are measured at a certain location. This specific angle is called the deflection of the vertical which can be neglected. Therefore, the vertical datum is often defined as a geoid model (e.g., CGVD2013) or a tidal model (e.g., CGVD1928) instead of the third axis in a pure Cartesian coordinate system. Nevertheless, it should be noted that the realization of a vertical datum is usually combined with a reference ellipsoid so that it is compatible with the other space-based positioning techniques.

Map projections and projected coordinate systems

The map projection refers to the transformation from the geographic coordinates to the Cartesian coordinates of the planar surface, which is for simpler analysis and visualization. This process will lead to map distortions in the manner of area distortions, form distortions namely angle and direction distortion, and distance distortions. Map projections can be classified depending on different criteria. According to the shape of the projection surface, namely a plane, cone, or cylinder, a projection can be azimuthal projection, conical projection, and cylindrical projection, respectively. According to the placement of the projection surface concerning the globe, a projection can be tangent projection, secant projection, and multisurface projection. Depending on the orientation of the projection surface relative to the Earth, a projection can be orthogonal projection, transverse projection, and oblique projection. Depending on which type of map distortion a map projection can minimize, it can be classified as equivalent projection, conformal projection, and equidistant projection, preserving area, form, and distance, respectively. Finally, based on the implementation methods a map projection can be geometric projection, semi-geometric projection, and conventional projection. The selection of projection should depend on the location and the shape of the area to be projected, as well as the application of the resulting map.

Astronomical coordinate system

An astronomical coordinate system is used to describe the positions of satellites, planets, stars, galaxies, and other celestial objects relative to physical reference points available to a situated observer. There are four basic systems: the equatorial coordinate system, the altazimuth coordinate system, the celestial coordinate system, and the galactic coordinate system. I will not dive into these systems.

Entity based model

An entity based model or object based model treats the space as populated by discrete identifiable entities, each with a geospatial reference. An entity should be identifiable, relevant (meaning to be of interest), and describable (meaning to have characteristics).

The vector mode is usually used for representing an entity based spatial model, which has three basic geometry types, namely point, line, and polygon. Typically, a polyline is a finite set of line segments (i.e., edges) such that each edge endpoint is shared by exactly two edges, except two extreme points. A closed polyline has no extreme points or two extreme points are identical. A simple polyline has no pairs of nonconsecutive edges intersecting at any place. A monotone polyline means that there is a line in the Euclidean plane such that the projection of the vertices onto the line preserves the ordering of the vertices list. A simple polygon is defined as the area enclosed by a simple closed polyline. A convex polygon has all its internal angles not greater than 180°. Also, any pair of points inside of a convex polygon will compose a line segment that is fully included in the polygon. A monotone polygon is a simple polygon such that its boundary can be split into exactly two monotone polylines. Every convex polygon is monotone, while not all monotone polygon is convex. A multi-polygon means more than one non-connected simple polygons. A polygon with an interior polygon representing empty space can be viewed as a polygon with a hole.

Spatial relationships between objects include distance, direction, and topology. Typically, spatial relationships can be represented by five data structures in an entity based model, namely spaghetti model, network model, topological model or node-arc-area model, doubly connect edge list or DCEL, and object DCEL.

Spaghetti model

A spaghetti model can represent planar configurations of points, lines, and polygons by storing the geometry of the spatial object as a set of lists of straight-line segments. For example, a polygon can be represented by [1,5,10,6,8] where the elements in the list are the vertices identities. The geometry of the spatial objects is described independently of other objects, so mixed geometries can be represented in this model. This structure is good due to its simplicity, and it is easy to input any new objects. However, no topology between lines and polygons is stored in this model, and space usage is inefficient because of the redundant representation.

Network model

The network spatial model can store topological relationships among points and polylines. The network can be either planer or nonplanar. The arcs in the network are represented by ordered nodes. However, no information about the relations between 2D objects is stored in this model.

Topological model

The topological model is also known as the node-arc-area model, which is a planar subdivision of the embedding space into adjacent polygons. Different from the network model, the network in the topological model can only be planar. In this model, a directed arc is represented by its start node and end node, and the left polygon and right polygon are also stored with each directed arc. The region in the topological mode is represented by directed arcs. This model is efficient for the computation of topological queries and consistent updates. However, some spatial objects may not have semantic meaning in a real-world application, and its complex structure leads to inefficient operations such as the input of new objects, which may require pre-computation of part of the graph.

Doubly connect edge list

The doubly connected edge list or DCEL ignores the geometry details and only focuses on the topology of a connected planar graph by storing the sequence of arcs around a node and an area. Specifically, it does not store the composition of an area, the consisting nodes of arcs, and coordinates of nodes. For example, the DCEL relation stores the arc ID, begin node, end node, left area, right area, previous arc, and next arc.

Object DCEL

An object DCEL is used to describe the aggregations of strongly connected areal objects. It also ignores the details of geometry, and only stores the topology between the objects. For example, an object DCEL relation stores the arc ID, begin node, and next arc.

Additionally, the description of topology can follow the 4-intersection model or 9-intersection model, which characterize spatial objects by their boundary and interior (in 4-intersection model), or boundary, interior, and exterior (in 9-intersection model), and describe the topological relationships based on their combinations. Operations on the entity based model include set-orientated operations, topological operations, and Euclidean operations. Set orientated operations are like equals, member of, is empty, a subset of, disjoint from, intersection, union, and difference. Topological operations include the testing of boundary, interior, closure, meets, overlaps, is inside, covers, connected, components, extremes, is within, etc. Euclidean operations are computing distance, bearing/angle, length, area, perimeter, and centroid. These operations are discussed in detail in the data manipulation section.

Field based model

A field based model or space based model treats the geographic information as collections of spatial distributions. In this model, each point in space is associated with one or more attribute values defined as the continuous functions of x and y. The field can be classified based on their attributes’ properties: nominal field, ordinal field, interval field, and ratio field. Nominal fields are those consisting of simple labels like names of cities. Ordinal fields consist of ordered labels like developing countries and developed countries. Interval fields consist of quantities on a scale without a fixed origin such as the Celsius temperature or Fahrenheit temperature scales. The ratio field consists of quantities on a scale concerning a fixed zero such as the Kelvin temperature, height, and weight. These attributes support arithmetical operations including addition, subtraction, multiplication, and division. The field can be classified based on its mathematical properties like continuous, differentiable, and discrete. Every differentiable field must also be continuous while not every continuous field is differentiable. Based on the properties of direction dependence, a field can be an isotropic field or an anisotropic field. Additionally, spatial autocorrelation and spatial heterogeneity are important properties of the spatial phenomenon which can be displayed in a field based model. According to the first law of geography, everything is related to everything else, but near things are more related than distant things. This describes the essence of spatial autocorrelation. Spatial heterogeneity describes a patchy landscape and spatial dependence refers to the local non-independence of occurrences that are near each other. Spatial dependence is the process that creates the clusters of occurrences, which is the reason for spatial heterogeneity.

The tessellation mode is usually used for representing a field based spatial model, which decomposes the embedding space into discrete cells. The decomposition process can lead to regular tessellation by using uniform cells or irregular tessellation by using irregular cells. In the tessellation mode, a point can be represented by a single cell, an arc by a sequence of neighboring cells, and a connected area by a collection of contiguous cells. A fixed representation model tessellating a plane uses a regular grid or raster which is a collection of polygonal units of equal size. The regular cells used in partitioning the plane can not only be squares but also be hexagons or triangles. A nested tessellation scheme decomposes a sphere is the core of DGGS. Operations on this tessellation mode of representation can be local, focal, and zonal, proposed by Tomlin. This topic is discussed in the data manipulation section. Typical irregular tessellation of an embedding plane includes polygon triangulation, partitioning along the medial axis of a polygon, polygon trapezoidalization, and convex partitioning. Delaunay triangulation is largely used in the generation of TIN, whose dual is the Thiessen polygon. A constrained Delaunay triangulation is constrained to follow a given set of edges. A greedy triangulation aims to produce the minimum total edge length in the triangulation. The medial axis of a polygon is the Voronoi diagram computed for the line segments composing the polygon boundary, which is also called a skeleton of a polygon. If we do an inscribed circle for any point on the skeleton, it will have two points touching the polygon boundary. Polygon trapezoidalization is to partition a simple polygon into trapezoids which is a quadrilateral with at least two parallel edges. The convex partitioning of a simple polygon is to minimize the number of elements in the partition and keep the constraint that those elements be convex.

Data repository

The following spatial database architecture or techniques are introduced: distributed spatial database, spatial data infrastructure, data warehouse, and data mart.

Distributed spatial database

Spatial data are frequently collected and maintained by individual organizations and governments. Thus, it is necessary to connect multiple sites by a computer network. A distributed DBMS or DDBMS is needed to manage a distributed database and can be a homogeneous DDBMS or a heterogeneous DDBMS depending on whether the DBMS software and data models are the same among multiple databases or not. For a heterogeneous DDBMS using different DBMS software and different data modes, a gateway interface is used to provide unified access. A relational distributed database designs are fragmentation and replication. Fragmentation means to store subsets of relation in different database units, where subsets of tuples mean a horizontal fragmentation and subsets of attributes means a vertical fragmentation. Replication means to store duplicated fragments among different database units to improve reliability and performance. The distributed database has the advantages of decentralization, reliability, better performance for local users, and modularity. It also faces challenges such as the complexity of design, implementation, development, and maintenance, security concerns, and consistency constraints across multiple databases.

Spatial data infrastructure

Spatial data infrastructure is a collection of technologies, policies, and operational frameworks to promote availability and facilitate access to geospatial data. The goal is to ensure the economy of resources in data collection, availability of updated data to users, and interoperability of individual applications. Necessary components of a spatial data infrastructure include geographic data, metadata, data catalogs, visualization tools, and services and software. The architecture of a spatial data infrastructure has geographic data and metadata as a lower level, search, and retrieval services as middle level, user and application interfaces as upper level. To facilitate the interoperability and efficient access and retrieval of data, the Open Geospatial Consortium has proved standard ways to deliver and map geographic data and processes over the web. This includes a series of standard services and specifications like web map service (WMS), web feature service (WFS), web coverage service (WCS), geography markup language (GML), keyhole markup language (KML), etc.

Data warehouse

A data warehouse or DW is a data repository used for reporting and analyzing vast collections of data in an organization. The data warehouse has the features of enterprise-oriented, integrated, non-volatile, read-only, heterogeneous sources, multiple levels of detail, and supporting decision making. Data are organized in a multidimensional data structure which is called a multidimensional paradigm in a data warehouse so that it facilitates the navigation within the database and different levels of granularity. This can be viewed as a datacube that has facts, dimensions, and measures. Data structures like star schema, snowflake schema, and fact constellation can be used to model the multidimensional paradigm. A spatial data warehouse integrates spatial data along with non-spatial data, where the spatial granularity can be measured in the non-geometric spatial dimension, geometric-to-non-geometric spatial dimension, or fully geometric spatial dimension.

Data mart

A data mart is a specialized, subject-orientated, highly aggregated mini data warehouse dealing with a coarser granularity. A data mart is built from a subset of the data warehouse, and one enterprise data warehouse can have multiple data marts. OnLine Analytical Processing or OLAP is a popular client of the data warehouse or data mart to support decision making. Common functions of OLAP include drill-down, drill-up, drill-across, filtering, slicing, dicing, pivoting, and outlier detection. An OLAP-server can structure data by a relational approach (i.e., ROLAP server), a multi-dimensional approach (i.e., MOLAP server), or a hybrid approach (i.e., HOLAP server).

Data quality

Data quality largely depends on the uncertainty of the spatial modeling process. The uncertainty includes uncertain specifications, uncertain measurements, and uncertain transformations. The uncertain specification means the vagueness of definition, for example, the boundary of a lake in the real world. The uncertain measurements include a lack of correlation between observations and the reality, namely inaccuracy, and the lack of detail in an observation, namely the imprecision. The uncertain transformation occurs when one needs to convert the raw measurements into the specification of the properties of interest. The uncertainty arises due to the imperfection in the tools or methods of representing, observing, measuring, and making inferences about the world. Fuzzy set theory is a common approach to locational uncertainty in GIS, where a fuzzy membership function grades the levels of belief in whether an element belongs to the set or not. Based on the fuzzy set theory, a series of operations can be applied, such as fuzzy overlay operation, fuzzy distance operation, fuzzy select operation, and more.

Spatial data format

Standard spatial data formats as well as the standard exchange formats are needed to promote interoperability. De facto standards include drawing interchange format or DXF, digital geographic information exchange standard or DIGEST, topologically integrated geographic encoding a referencing or TIGER, national transfer format or NTF, spatial data transfer standard or SDTS, etc. Standard format for raster representation includes GIF, TIFF, JEP, etc. Standard format for vector representation includes TIGER, ESRI shapefile, ESRI TIN, etc. Standard exchange formats are needed in the process of request and response process, including the Open Geospatial Datastore Interface or OGDI, XML, KML, GML, etc. Organizations like the Open Geospatial Consortium (OGC) and ISO Technical Committee 211 (ISO TC/211) work on the global standardization issues regarding GIS.

Main references

Rigaux et al., 2002. Spatial databases: with application to GIS, San Francisco: Morgan Kaufmann Publishers.

Stefanakis, E., 2014. Geographic Databases and Information Systems, North Charleston, SC: CreateSpace Independent Publishing Platform.

Stefanakis, E., 2015. Web Mapping and Geospatial Web Services: An Introduction, North Charleston, SC: CreateSpace Independent Publishing Platform.

Tomlin, C.D., 1990. Geographic information systems and cartographic modeling, Englewood Cliffs, N.J.: Prentice-Hall.

Worboys, M. and Duckham, M., 2004. GIS: a computing perspective, Boca Raton: CRC Press.

Supplement course notes

UNB GGE 4423 Advanced GIS taught by Dr. Emmanuel Stefanakis

UofC ENGO 645 Spatial Databases and Data Mining taught by Dr. Emmanuel Stefanakis