Chapter 9 - GIS data collection

This chapter is an overview of the problem of collecting data, which is becoming less expensive in GIS. Nevertheless, you need to be clear about the three important forms of data:

9.1 Introduction

The typology of data sources in Table 9.1 is fairly simple: data you collect yourself or get from other sources. But a very important source of environmental data are simple data matrices (often called "flat files") such as this extract of rainfall for 30 California cities:

ID NAME X Y RAINFALL
1 Eureka 254 650 39.6
2 Red Bluff 362 602 233.3
3 Thermal 726 185 18.2
.
.
.
30 Colosa 350 538 15.9

There are very easy tools for incorporating these data into most GIS - but you can also do this with a spreadsheet and plot the data in 2D or 3D. As we shall see, however, you need some way of relating the data to (x, y) locations.

One aspect of GIS that is steadily improving is the ready availability of data, either in raw or spatially referenced - even ArcGIS shapefile - form. In September 2008 we were able to download Arc data for Hurricane Hanna from NOAA. Hence data capture costs for many purposes are approaching zero, although you may be benefiting from quite expensive efforts on the part of others.

Figure 9.1 is worth studying as a guide to your own research, although you probably won't be digitizing. But GIS data in all their many forms can quickly become bewildering if you don't organize them and delete what you don't want.

9.2 Primary geographic data capture

Most of us will be dealing with vector data, but rasters are the most common primary data source because there are so many sensors, as illustrated in Figure 9.2, which may be confusing. Think of the plot as a generally L-shaped "cloud" of sensors: large-pixel sensors at the bottom-right capturing data frequently (think weather), and small-pixel sensors at the top-left being used for special local purposes (think a plane taking detailed photos of wetland vegetation). What's missing from this already-full plot is spatial extent; the area being viewed.

A simple formula expresses this in general terms:
DATA VOLUME = (SPATIAL EXTENT / SPATIAL RESOLUTION)D
where D is the dimension of analysis (D = 0 for a point sample, 1 for sampling along a line, 2 for a square region, etc.). The idea is that data capture views a region of a certain extent (max - min coordinates) and divides it into cells of a given resolution. The bigger the extent and the smaller the resolution, the more data. You can extend this idea into the temporal realm as well, but at least time has only one dimension! Data volume grows at a very rapid rate; a substantial amount of the floor space of the sprawling USGS EROS Data Center is devoted to storage.

And when you add to this formula the idea of multispectral imagery - imagine sensors capturing images at multiple wavelengths in Figure 9.3, volume grows even faster. (And we're not talking here about the even vaster holdings of the military/government so-called intelligence industry.)

One downside of our online course is that we cannot do fieldwork together. These days it is fairly easy to connect a GPS receiver to a laptop and take both outside to do real-time mapping, resulting in a table of (x, y, t) measurements that can be plotted in a spreadsheet or GIS. Here for example are the locations of a few of the GPS coordinates of the "boundary stones" marking the limits of the US District of Columbia. Some of you could probably do this with your cell phone!
     LONG                LAT     
 77° 9' 34.07" W   38° 53' 24.44" N
 77° 9' 48.60" W   38° 54'  5.69" N
 77° 8' 46.20" W   38° 54' 42.52" N
 77° 8' 45.68" W   38° 52'  7.49" N
  77° 7'  6.14" W   38° 50' 58.92" N 

9.3 Secondary geographic data capture

Every technical profession has its history of drudgery, but even I came to large-scale GIS when digitizing was peaking as the main means for capturing spatial data. You can skim this section, but if you ever need to do historical research, you may become familiar with these processes.

But give some attention to the discussion surrounding Figure 9.8, particularly because it demonstrates how GIS supports both transformations among forms of representation, and
movement between fields (left) and objects (right) and back again.

I do this all the time in my research. And remember: everything becomes a pixel when it is viewed on a computer.

You should read the discussion of error (which supplements Chapter 6) to get a sense of how many places it can arise in science (not only GIScience). This theme appears in many places in the book and is a useful antidote to the notion that data are always accurate. In general (cf. Figure 6.1) the idea is that
MODEL =                     DATA                              + MEASUREMENT ERROR
  = (REALITY  + CONCEPTUAL ERROR) + MEASUREMENT ERROR
so there are many places for uncertainty to arise. The examples shown here are practically illustrated by Chapter 16 of the ArcGIS book, which are optional.

9.4 Obtaining data from external sources (data transfer)

Environmental science may be the most active area of so-called "data transfer" in which data are shared among researchers and managers. Look at Table 9.3 and check off the kinds of data you might want to include in a research project. A web search on any of these keywords will bring up many hits. Look at some of the URLs in Table 9.4 to investigate data of interest to your environmental ideas.

Box 9.2 (War story) When I was teaching urban planning at Boston University in the early 1970s I invited Don Cooke to talk to my class about the arcane topics of geocoding, DIME, and TIGER files.  He stayed in New England and became an innovator in a vast and prosperous industry; I went to Africa to have adventure!

9.5-6 Capturing and managing data (and Conclusion)

As I've said at various points, you may not be managing a large GIS project, but in this class you'll be conducting original research that will require you to acquire and manage data. The exercises and assignments are designed to give you experience in these areas; some of the students in my classes use this research as a "pilot" for later, more elaborate projects. Data collection probably won't be a large part of your own research, but you will spend some time acquiring data from secondary sources and even more time managing and reshaping it for your purposes. Even at a small scale you will become exposed to tradeoffs between speed, quality, price, and scale, 2 dimensions of which were shown in Figure 9.2 and are sketched in Figure 9.17.