Open map List

[DA-046] Computational Geography

Computational Geography emerged in the 1980s in response to the reductionist limitations of early GIS software, which inhibited deep analyses of rich geographic data. Today, Computational Geography continues to integrate a wide range of domains to facilitate spatial analyses that require computational resources or ontological paradigms beyond that made available in traditional GIS software packages. These include novel approaches for the mass creation of geospatial data, large-scale database design for the effective storage and querying of spatial identifiers (i.e., distributed spatial databases), and methodologies which enable simulations and/or analysis in the context of large-scale, frequently near-real-time, spatially-explicit sources of information. The topics studied within Computational Geography directly enable many of the world’s largest public databases, including Google Maps and Open Street Map (OSM), as well as many modern analytic pipelines designed to study human behavior with the integration of large volumes of location information (e.g., mobile phone data) with other geospatial sources (e.g., satellite imagery).

Author and citation

Runfola, D. (2022). Computational Geography. The Geographic Information Science & Technology Body of Knowledge (1st Quarter 2022 Edition), John P. Wilson (Ed.). DOI: 10.22224/gistbok/2022.1.7.

This entry was first published on March 7, 2022. No earlier editions exist.

Explanation

Overview
Optimizing Spatial Data Indexing and Retrieval in a Computational Geography Framework
Primitive Geospatial Operators in Partitioned and Parallelized Environments
Spatially Distributed Analyses
Tools & Techniques Common to Computational Geography

1. Overview

The term “Computational Geography” was coined by David Mark and colleagues at the National center for Geographic Information and Analysis in the 1980s, and formalized by Stan Openshaw with the founding of The Centre for Computational Geography at University of Leeds (Batty, 2020; for a broader review on the history of quantitative analysis in geography, see Mark, 2003). Used interchangeably with “GeoComputation”, the computational geography movement emerged in response to reductionist decisions made to simplify spatial data creation, retention and analysis during the emergence of Geographic Information Systems (GIS) through the 1980s (Gahegan & GeoComputation International Steering Group, 2001)⁠. While many of these early limitations in GIS capabilities have been improved over the last four decades, Computational Geographers continue to engage with challenges that are technically or ontologically infeasible to implement in contemporary GIS software platforms. Today, the overlap between computational geography and GeoComputation remains substantial, with the core distinction between the groups being an emphasis on computer engineering and optimization at a database or system-level within the computational geography community (i.e., building general purpose distributed spatial databases (Hughes et al. 2015), primitive operators in parallel environments (Shook et al. 2016), or new strategies for distributed processing (Worboys & Duckham 2006)) as contrasted to a more frequent focus on model-specific optimizations in the GeoComputation community (i.e., deriving novel ways to distribute agent based models).

computational geography

Figure 1. Diagram illustrating topics comprising the field of Computational Geography, circa 2021. Source: author.

The topics that Computational Geography has engaged with have been redefined over time, focusing on discovery-oriented research that augment or improve contemporary, tradecraft-driven GIS capabilities. Alongside this evolution, the disciplinary contributors to Computational Geography have also shifted – growing to incorporate not only geographers with strong computational skills, but also information and computer scientists that engage with spatial data sources. This interdisciplinary shift is reflected in academia today, with the two programs offering Ph.D. degrees in Computational Geography both sitting outside of Geography departments (Texas A&M, 2021; William & Mary, 2021)⁠.

The rapid increase in available information has resulted in both practical and theoretical challenges to traditional GIS-based modalities of inquiry – ranging from the inability of modern desktop GIS software to handle large datasets in non-distributed environments, to ontological and technical discussions around how spatial datasets can be more helpfully represented than allowed for in common GIS frameworks (Bostock et al., 2017)⁠. Many tools and techniques derived by computational geographers have, in response, begun to provide solutions to these challenges. These contributions have been at the nexus of three topical areas: (1) spatial data representation and storage, (2) spatial analytics, statistics, and GeoAI, and (3) computational optimization. Recognizing that each of these topical areas warrant individual articles, here we focus specifically on contributions at the intersections of each of these topical areas (see figure 1): spatial data indexing and retrieval, distributed spatial data integration and querying, and spatially distributed analytics.

2. Optimizing Spatial Data Indexing and Retrieval in a Computational Geography Framework

As the volume and nature of spatial information has grown and changed, so too have the technical capabilities required to record and query that data. Within the GIS community, the development of Spatial Database Management Systems (SDBMS) enabled new data models, storage, and querying techniques based on geographic information, providing new capabilities for analysis in both GIS desktop and programmatic environments. In parallel to these efforts, the computer science and computational geography communities sought to implement spatial querying functionality into traditional, a-spatial database environments. These efforts - inclusive of software engineering (i.e., Hughes et al. 2015) and the development of new referencing systems (Tsui 1997) - have today resulted in a wide range of solutions for optimizing spatial data indexing and retrieval, depending on the task the researcher is most interested in.

The addition of spatial query functionality to relational databases was one of the earliest developments, largely due to the prevalence of relational databases in enterprise data storage systems. With each data entity represented as a row, a range of techniques were derived to add a geometry column of information, as well as implement algorithms to select rows based on the values contained within these geometry column. These techniques continue to be advanced to this day, with recent research focused on the parallelization or distribution of spatial queries (Giannousis et al., 2019; Ilba, 2021)⁠ and optimal spatial indexing strategies in relational databases (Chaves Carniel et al., 2018; Schön et al., 2013)⁠.

Distinct from relational databases, document-oriented approaches sought to enable a more flexible data acquisition strategy, in which an explicit schema does not need to be defined a-priori. This allows for – for example – the addition of new types of data collection for some units within the database, alongside a range of other benefits. Novel algorithms to enable spatial indexing and searching within document store databases were developed alongside many of the databases themselves (i.e., MongoDB’s 2dsphere), to mixed effect dependent on the computational task of interest (Bartoszewski et al., 2019; Makris et al., 2021)⁠. Research into improving the efficiency of spatial querying in document-oriented databases continues today (Makris et al. 2021; Xiang et al., 2016; Yan et al., 2016; Zhu & Gong, 2014)⁠.

In addition to relational and document-oriented database approaches, key-value stores have recently been a focus of research. The advantage of key-value stores lies in their underlying, column-based structure, which are highly efficient for the processing of sparse datasets. Further, most key-value stores are designed to operate efficiently in highly distributed environments. The majority of research into this topic focuses on an implementation of spatial querying on top of the popular hBase and Accumulo databases, with a range of algorithms supporting spatial querying and indexing in this paradigm being derived. The most popular of these is the GeoMesa project (Hughes et al., 2015)⁠.

For researchers interested in non-traditional modalities for spatial analysis, a small number of graph-based databases have been created in which network relationships can be conceptualized as spatial, social, or other factors. Most common of these is Neo4j, which enables unstructured entity characterization, with entities being members of one or a series of inter-related networks with defined nodes and edges (Webber, 2012)⁠. Neo4j-Spatial specifically enables querying across a network defined by geography, including common operations such as searching within specified regions or distances from a point (Taverner, 2012)⁠. Other graph-based databases with spatial functionality include RedisGraph and AllegroGraph.

A wide range of more nascent frameworks for distributed spatial databases have recently been proposed by the community, built on or combining the technologies noted above. These include SpatialSpark, GeoSpark, Simba, LocationSpark, SparkGIS, TrajSpark, DITA, Gragoon, and likely many more. A full survey of these tools, and their contributions to this growing suite of techniques, can be found in Alam et al., 2021⁠.

3. Primitive Geospatial Operators in Partitioned and Parallelized Environments

The challenge of representing spatial data continues to evolve, with different conceptions promoting markedly different definitions of spatial boundaries (i.e., fuzzy boundaries), geographic relationships (i.e., network representations), and geographic attributes (i.e., fuzzy class membership vs. discrete; see Goodchild et al. 1998). As the volume of geographic data has grown, accessing information that may be collected and stored across a broad range of modalities is a core challenge. The most common solutions today involve the distribution of primitive geospatial operations across multiple computer nodes, aggregating, integrating, or selecting data that may be of use for a particular analysis.

An intuitive example of this challenge can be explored with a traditional “Zonal Statistics” operation, in which a grid of values (i.e., temperature measured every 500 meters over a surface) is aggregated to a coarser geographic region (i.e., a country). This is occasionally a necessary first-step in an analysis which seeks to understand the relationship(s) between one geospatial dataset – i.e., precipitation – and another – i.e., country boundaries. In cases with millions or billions of points, the time costs of averaging individual values across an entire country on an individual computer can be untenable, stretching into years, decades, or more. As an alternative, multiple computers can load partitions of the data, aggregate the values for a set of points, and then relay their findings to a centralized node to produce the final calculation (Goodman et al., 2019; J. Zhang et al., 2014)⁠. To support the wide range of different geospatial operations, implementations of primitive geospatial operations in parallel environments have included point-in-region searches (Kondor et al., 2014; Priya & Kalpana, 2018; Tarmur & Ozturan, 2019)⁠, spatial joins (Aghajarian & Prasad, 2017; You et al., 2015)⁠, clips (Puri & Prasad, 2015)⁠, and a wide range of techniques – frequently based on the R-Tree algorithm - designed to automatically partition spatial data for distribution across nodes for general application (Roumelis et al., 2017, 2019; You et al., 2013; J. Zhang & You, 2013)⁠. These techniques provide the “glue” between database architectures and analyses (i.e., machine learning or simulation models) a researcher may ultimately seek to perform.

4. Spatially Distributed Analyses

The demand for spatial data analysis is growing from a range of industries, but the computational costs (i.e., time required to process on a computer core) of geometry-aware algorithms has inhibited this growth. One area of intensive inquiry to overcome this bottleneck has been in improving our capability to distribute spatial analytic models across a range of processors or computational nodes (i.e., spatial parallel or distributed computing). The models being distributed can be highly variable in nature – ranging from agent based simulations to convolutional neural network (i.e., Brewer et al., 2021; Goodman et al., 2020)⁠ or other machine learning approaches. Two approaches have been taken: specialized solutions specific to a single typology of analytic model, and generalized approaches that seek to distribute data across any arbitrary target model.

Model-specific distribution approaches have emerged in a number of subfields. One of the most prominent of these have been agent-based simulation and modeling efforts, with packages such as OpenABL (Cosenza et al., 2018) providing easily accessible, distributed approaches for a range of primitive functions required by the agent based modeling community. OpenABL built heavily on a long lineage of ABM-specific distributed systems, including the well-known REPAST, REPAST-HPC, Mason, D-Mason, Flame, and more (see Cosenza et al., 2018, for a full review of the evolution of these and related efforts).

The remote sensing community has been highly active in the development of distributed analytic models, predominantly for large-scale classification of satellite imagery as a part of near-real-time algorithms (Hawick & James, 1997)⁠, and on-demand processing of archival imagery (Yang et al., 2005; M. Zhang et al., 2015)⁠. Building on these approaches, recent research has begun to focus on the analysis of satellite imagery in distributed environments using deep learning models – i.e., distributing across infrastructure such as GPUs specialized for deep learning algorithms (Li & Choi, 2021; Sedona et al., 2019)⁠.

Augmenting these specialized approaches, general frameworks to enable distributed spatial model analyses have included GeoBeam (He et al., 2019)⁠, Niharika (Ray et al., 2013)⁠, and a number of novel implementations of R-tree indexing to promote concurrent spatial operations (Dai, 2009)⁠. More broadly, distributed frameworks such as Dask and Spark have also been used as flexible engines to assist in the distribution of spatial models across arbitrary numbers of nodes (i.e., Erlacher et al., 2021)⁠.

5. Tools & Techniques Common to Computational Geography

Today, study in Computational Geography requires a depth of understanding of topics which include:

GPU and CPU based parallelization and distribution of spatial algorithms.
Distributed database and file system architectures, and their pros/cons for use in the context of spatial data (i.e., the implications of partitioning strategies in common architectures such as HDFS, hBase, Accumulo, Sharded Mongo, and Hive).
Distinctions and use cases for different database technologies in the context of spatial querying.
Computational theory regarding algorithm optimization and fundamental search and sort strategies.
Fundamental GIS knowledge (i.e., data representation, projection, topology, data types).
Fundamental Remote Sensing knowledge (i.e., atmospheric correction, physical reflectance).
Understanding of different spatial statistical techniques, inclusive of spatial regression models and other autocorrelative techniques (i.e., kriging strategies).
Understanding of machine learning models such as SVM, KNN, Regression Trees, and ANNs.
Understanding of computer vision techniques, especially convolutional networks.

Outside of the small selection of computational geography programs today, most scholars in the field acquire this diverse range of skills through interdisciplinary collaboration, or through multiple degrees in interrelated fields (i.e., computer science or data science and geography).

References

Learning outcomes

199 - Create large (i.e., terrabyte-scale), distributed databases holding inter-related spatial identifiers
Create large (i.e., terrabyte-scale), distributed databases holding inter-related spatial identifiers
926 - Efficiently query distributed databases using spatial and a-spatial strategies
Efficiently query distributed databases using spatial and a-spatial strategies
992 - Explain different ontological approaches to representing spatial data, with a focus on the geographic information retained, or lost, by differing representations
Explain different ontological approaches to representing spatial data, with a focus on the geographic information retained, or lost, by differing representations
994 - Explain distinctions between GIS and Computational Geography approaches to data collection, retention and analysis
Explain distinctions between GIS and Computational Geography approaches to data collection, retention and analysis
1400 - Implement spatial analytic models in distributed contexts (i.e., a across multiple computational nodes
Implement spatial analytic models in distributed contexts (i.e., a across multiple computational nodes
1582 - Stream multi-modal information from distributed databases into spatial analytic models
Stream multi-modal information from distributed databases into spatial analytic models

[DA-046] Computational Geography

Tags

Author and citation

Explanation

References

Learning outcomes

Related topics

[DA-046] Computational Geography

Tags

Author and citation﻿

Explanation﻿

References

Learning outcomes

Related topics

Author and citation

Explanation