DATA to Insight

1 PHENOMENA

1.1 Basics

1.1.1 Measurement

It may seem odd to begin this guide by discussing measurement in the abstract - after all, every measurement is of a WHAT that is WHERE and WHEN, but until we're comfortable describing measurements in the abstract, relating them to one another or to spatial and temporal contexts is difficult. To be meaningful all data must somehow be made visual; we must inspect at least some of the measurements. And before data can be visualized and analyzed you must be clear on the the type of measurements that have been made. There are many taxonomies for classifying measurements, but I adopt a practical approach: the types will evolve out of our attempts to describe and analyze data. Let's begin with the simplest kind of measurement; what we do when we count things (objects, minutes, telephone poles passing on the road). Here are the first 64 integers:

EXHIBIT 1.1

       1    2    3    4    5    6    7    8    9   10
 1:    1    2    3    4    5    6    7    8    9   10
11:   11   12   13   14   15   16   17   18   19   20
21:   21   22   23   24   25   26   27   28   29   30
31:   31   32   33   34   35   36   37   38   39   40
41:   41   42   43   44   45   46   47   48   49   50
51:   51   52   53   54   55   56   57   58   59   60
61:   61   62   63   64

I've given each row a convenient label so you can see the index of the first measurement. This is an easy vector of measurements to analyze: it's minimum is 1 and it's maximum is 64. It doesn't beg to to analyzed in any further way. We could begin with 0 (or any other integer) and keep counting, but our familiarity with this sequence is so great that its quantitative behavior is immediately obvious. But the index sequence is fundamentally important to keeping track of all other measurements.

The easiest way to make a distinction is among 3 kinds of measurement: amounts, counts, and labels.

Amounts are our usual measurements of things: weight (mass), distance (microns), volume (cubic meters), even time (seconds of duration). A few examples:

Weight (or more precisely mass) is a continuous measurement that can be added, and we can consider the differences.
Flow (as of an amount of water per second) is generally positive, though for a tide it can be negative.
Temperatures can be subtracted but not added, and unless you are dealing with Kelvin, an object cannot become 50% hotter.
Air poll ut on is measured by amounts (grams) or proportions: grams/cm², ppm (parts per million).

To experiment, here are some numbers that could be amounts. They've been organized in rows of length 10 simply for visual clarity.

EXHIBIT 1.2

        1     2     3     4     5     6     7     8     9    10
 1:   7.8   7.8   7.2   8.0   7.4   9.1   7.6   7.5   6.5   8.4
11:   9.7   7.3   8.5   6.4   8.4   5.6   7.0  10.5   6.1   5.8
21:   5.6   5.2   7.2  10.7   5.9   9.9   5.0   5.8   7.1   8.9
31:   5.2  10.3  10.6   6.5   7.0   8.8  10.1   9.2   9.3  10.0
41:   8.1  11.4  12.2  12.5   6.4   4.8  13.2  13.3  13.3   3.4
51:  11.0  12.8   3.2  13.4  13.5   2.8   4.8  10.9   1.5  12.2
61:   1.8  13.1   2.8  13.6

Without any formal analysis, you can see that there are 64 values, none appear to be negative, or greater than about 20 (is this true?), and they are reported to 1 decimal place. As given, we don't know if these measurements have been rounded to make them more manageable (e.g. the first measurement could actually have been 7.854234...)

Counts are a second kind of measurement; simple enumerations of the numbers of objects (3 pears, 5 balls, 2 words, ...). When counts become large (beyond about 100) they take on the characteristics of amounts, so if you have a dataset with counts that are more than 2 digits the differences among the values can be so small that the data have the characteristics of amounts. They are still integers, but statistically they lose their integral quality.

Here are 64 measurements that, because they are integers, could be counts, e.g. sunspots per year for 64 years, births in 64 counties, and of course red balls counted in each of 64 drawings from an urn! Note that each of these examples requires a spatial and a temporal specification as well, in keeping with the P/T/S paradigm, but because we are solely concerned with measurement only, we ignore the WHERE and WHEN.

EXHIBIT 1.3

       1    2    3    4    5    6    7    8    9   10
 1:    9   12    6   11   12    8    5   12    3    5
11:   10    6    8    6   10    7   13    5    6    8
21:    8    8    6    7    7    7    7    3   13   13
31:    9    6   13   10    5    9    3   10    5    9
41:    6    8   11   14   11    6    3    6    8    8
51:   11    7   11   13   10   10    8   13    7    6
61:    2    2   14    9

It is a bit easier to characterize these measurements: there clearly is no value > 100, and a casual scan of the first digit suggests that there is no value > 20, and in fact we can get a sense of the frequencies of measurements < 10 (1 digit) and < 100 (2 digits) as well.

It may not be easy to tell exactly how many unique values are represented here, but actually we can tell a lot just by examining the numbers. By the second value we know there are more than 1 unique, and the two values 5 make it obvious that there are fewer than 64 unique values, but we knew that because of the small values overall.

Labels are a third kind of measurement, different from amounts and counts, and usually represented by words, letters, or non-numeric symbols. In statistical literature they may be called factors, indicators, dummy variables. etc. Here are 64 labels, in this case letters. Do you notice anything similar between the pattern of counts above and this pattern of labels?

EXHIBIT 1.4

     1   2   3   4   5   6   7   8   9  10
 1:  S   D   E   N   D   T   H   D   R   H  
11:  I   E   T   E   I   A   O   H   E   T  
21:  T   T   E   A   A   A   A   R   O   O  
31:  S   E   O   I   H   S   R   I   H   S  
41:  E   T   N   L   N   E   R   E   T   T  
51:  N   A   N   O   I   I   T   O   A   E  
61:  C   C   L   S

An important difference between labels versus amounts and counts is that usually labels cannot be ordered. Without any information about what our labels mean, for example, we can't say that S is less than D just because they appeared in the data in that order; or that D is less than S, although they could be alphabetized that way. Some labels are obviously ordered (SLOW < MEDIUM < FAST), but not in the sense (as with 1 < 2 < 3) that differences between them are meaningful. And if the number of possible labels becomes large (say about 2⁵= 32) then the they begin to take on the characteristics of an amount, with lots of different values.

If a label is unique (no more than 1 object can have a given value) then the label can be a key measurement that used to unambiguously identify individual object or cases, and perhaps link objects together. The amounts in the first example aren't an appropriate key because they are decimal (technically rational) numbers and may even be truncated, and the counts and labels certainly won't be identifiers because their data are full of repeats; in fact all of the labels appear at least twice (which one appears twice?) and one value appears 10 times (which?).

The simplest kind of label is binary, which can be represented by 0 and 1 (or M and F or YES and NO, etc). Sometimes this is called a 'dummy' or 'indicator' variable. Combining binary variables creates new labels (first 4, then 8, then 16, etc.) and combining lots of binaries soon results in distinctions that become closer to amounts. Here are the amounts turned into binary measurements based on whether the original number is greater than or equal to 10 (notice how easy it is to detect this from examining the counts themselves!).

EXHIBIT 1.5

      1   2   3   4   5   6   7   8   9  10
 1:   0   1   0   1   1   0   0   1   0   0
11:   1   0   0   0   1   0   1   0   0   0
21:   0   0   0   0   0   0   0   0   1   1
31:   0   0   1   1   0   0   0   1   0   0
41:   0   0   1   1   1   0   0   0   0   0
51:   1   0   1   1   1   1   0   1   0   0
61:   0   0   1   0

So we can see that for small vectors of data rather a lot of insightful information can be gleaned from examining the raw values themselves as they come out of our experiment or measurement project. And if the numbers of values (amounts, counts, labels) is dauntingly large, look at at least a sample of your data to see what next steps to take.

1.1.2 Extensions

You will thoroughly explore any dataset before analyzing it formally, but there are a few cases that require special consideration.

Zeros may or may not be implied in the analysis (you don't see the letter B in the labels above either because it did not or could not appear). And zero cannot be used in a logarithm function.

Negative amounts are possible but negative counts are not. (In accounting there can be negative Euros, but these are amounts, not counts.) Here are examples, of negatives from the realms of Phenomenon/Time/Space:

EXHIBIT 1.6

REALM		EXAMPLE
WHAT	Phenomenon	temperature, overdrawn bank balance
WHEN	Time	before present, BC, negative change
WHERE	Space	below sea level, west of Greenwich

Missing values are often a necessary headache. Using the label 'NA' is helpful, but not -999, etc, as this can always be processed as a number.

Ratios and percents are always the result of comparing measurements, and can be tricky, as they lead to errors such as references to “knots per hour” which technically would be a unit of acceleration or ‘square acres’ which is strictly a measurement of length to the fourth power. In general the word 'rate' should be used in temporal analysis, discussed below.

1.2 Description

This section deals with the description of single sets of measurements (technically vectors) rather than the analysis of matrices of multiple variables, which will be discussed below in Sec. 1.2.2.

1.2.1 Data description

So far we've examined data without regard to the order in which they were reported or organized. If there were accompanying information about time or geographic measurements of space then we could use the temporal or spatial information to visualize the measurements. But for the moment assume as with the second list above the measurements just came out of some detailed report in no particular sequence: they are merely a 'pile' of numbers that can be organized in any fashion. An obvious way to do this is by magnitude from smallest to largest, technically called 'ordering.' (The other way of sorting the data is often called 'ranking' - or ascending and descending ordering - but these terms can be confusing). Here are the counts ordered and incidentally grouped into 4 rows of equal length. (The fact that there are n = 64 measurements is just to make the analysis a bit easier - any n up to about 100 would be fairly easy to visualize.)

EXHIBIT 1.11 REORDER FROM HERE!

        1     2     3     4     5     6     7     8     9    10    11    12    13    14    15    16
 1:   1.5   1.8   2.8   2.8   3.2   3.4   4.8   4.8   5.0   5.2   5.2   5.6   5.6   5.8   5.8   5.9
17:   6.1   6.4   6.4   6.5   6.5   7.0   7.0   7.1   7.2   7.2   7.3   7.4   7.5   7.6   7.8   7.8
33:   8.0   8.1   8.4   8.4   8.5   8.8   8.9   9.1   9.2   9.3   9.7   9.9  10.0  10.1  10.3  10.5
49:  10.6  10.7  10.9  11.0  11.4  12.2  12.2  12.5  12.8  13.1  13.2  13.3  13.3  13.4  13.5  13.6

This ordering gives us direct information about the measurements, including the minimum = 1.5 and the maximum = 13.6. So no values are outside of this range, which is important to know. If these measurements are samples in tons, then it's unlikely we'll encounter anything weighing less than a pound; if distances, then none is likely to be greater than 100 miles. And so forth. Very useful to know.

Because the data are shown in 4 equal rows we can also see that the median or middle value at the end of the second row is about 8, and the quartiles at the end of the first and third rows (1/4 and 3/4 along the ordered data) are about 6 and 10 (or maybe 11?). It's often therefore useful to order measurements and examine them in 4 rows (or columns).

Because of the way I've presented our vector it is easy to see these 5 statistics (minimum, maximum, median, and the two quartiles), but even with a long vector of ordered measurements counting and examining values gives a lot of information about your data. Note that I've also been casual about the values of these statistics: we don't really care if the median is 7.9 (the average of values #16 and #17). And we can see that the median is close to (just a bit larger than) halfway between the minimum and the maximum.

Examining individual data values is always a critical step in analysis, even with 'big data' it's useful to dip a spoon into the measurements and see what's coming up. The exhibits above are all examples of what could be regarded as simple tables, but more formally such exhibits should have some kind of headings, titles, etc.

Here is a frequency table of the labels, such as you'd get if you had run a tally of each value. It helps us see which values occur more frequently than others.

EXHIBIT 1.7

    label:  T   A   S   E   I   H   N   D   O   R   L   C 
frequency:  9   7   5  10   6   5   5   3   6   4   2   2

This table is OK but it could make comparing the frequencies difficult and it doesn't allow for nice formatting. But we have transformed labels into counts by enumerating the frequencies that each label appears. Data visualization should be like that; a restless search for simpler ways to display information. Here's a vertical orientation of the same tabular data, along with a useful statistic, the number of label values in our data.

label   frequency
-----   ---------
  T         9
  A         7
  S         5
  E        10
  I         6
  H         5
  N         5
  D         3
  O         6
  R         4
  L         2
  C         2
          ---
total      64

Note that this table contains its own headings (I could call them 'labels' but we don't need another use of the term!), a total, and a few strategically placed lines. The designers of early statistical 'packages' devoted many hours to refining these kinds of line-printer presentations, and you should honor their hard work. Another (less desirable, I think) way of organizing the above table would be with the frequency column first.

There are at least two other ways to present this tabular data by ordering each of the columns, but I won't do that here because it's more of an analytical question

Almost all research reports feature tables, usually of statistics, but if your dataset is small enough it can be fully presented in a table (the vectors displayed above are simple tables). This display will allow others to examine your data in detail and perhaps do their own analysis.

Another statistic that's useful is the mean, which is the sum of the measurements divided by the number of values, in this case 1008 / 64 = 15.7. Note that I report this statistic to the same 3 significant figures as the measurements themselves. To give the mean as 15.743750 etc. is unwarranted by the precision of the data, and in fact I call this 'spurious precision,' which is a common design flaw of many datagraphics (including maps) and tables - less is more when it comes to precision! Other statistics frequently computed are the standard deviation and variance; you can look these up, but we'll not be using them for datagraphic design.

But a statistic that is worth examining is skewness, not so much for its actual value as for what it tells about the 'shape' of the data, as we'll see below. For the moment, note that the skewness of our dataset is 3.0, a positive number.

Description of counts is handled the same way as for amounts and counts have the same statistics, but description of label measurements is different from the other two. The raw listing of labels introduced in section 1.1 above isn't very informative, but as with numbers there is an obvious ordering, alphabetic:

 1:  BEN BEN BEN BEN CIV CIV CIV CIV CIV CIV CMR CMR CMR CMR CMR CMR
17:  CMR CMR CMR CMR CMR CMR CMR CMR GHA GHA GHA GHA GHA GHA GIN GIN
33:  GIN GIN GIN GIN GIN GIN GNB LBR LBR LBR NGA NGA NGA NGA NGA NGA
49:  NGA NGA NGA NGA NGA NGA NGA NGA NGA NGA NGA SEN SEN SEN SEN SLE

Scanning this alphabetized list gives some sense of which values are most common - especially Nigeria and Cameroon, the areally largest countries - by seeing longer strings of values. A tally of the measurements gives a frequency table:

BEN CMR GHA GIN CIV LBR NGA GNB SEN SLE 
  4  14   6   8   6   3  17   1   4   1

which highlights the frequent (Nigeria) and infrequent (Sierra Leone) values. Although it is easy to see the more or less frequent values above, an even easier way is to order (actually rank from largest to smallest) the table by frequency:

NGA CMR GIN GHA CIV SEN BEN LBR SLE GNB TGO GMB 
 17  14   8   6   6   4   4   3   1   1   0   0

to appreciate how often Nigeria and Cameroon are chosen. In fact, this is quite common with data: a few values often dominate the measurements. This frequency table also shows the 2 countries that weren't selected. Note also with respect to ordering, it's usually most informative to examine raw vectors that are sorted (from smallest to largest) and tables that are ranked (from largest to smallest). Anyway, always ask yourself if the data could be reorganized to reveal more insights!

1.2.2 Visualization

So far the discussion has been pretty non-visual, but the excitement - and what this book is about - comes when we visualize the data: create datagraphics. Begin with the first dataset. The simplest way to see the values is a 'rug plot' that arrays the numbers as lines in empty space. Here are the indices (1 through 64):

sequence shown as rug plot

The graphic doesn't show anything special (except for that annoying fluctuation in the gaps between the lines due to pixelization!) but does conform exactly to what we expect: a regular 'rhythm' of numbers 1, 2, 3, ..., 64.

And here's a rug plot of amount:

amount rug plot

1.1.3 Tables - THIS HAS BEEN MOVED

The data matrix is an organization of the measurements into rows and columns. Here is a simple table of 12 rows (plus a heading) × 5 columns showing each of the kinds of measurements we've discussed, as well as a sequential measurement that orders the data temporally (although the unit of time is unstated). This example is not a real data matrix in the sense either that the measurements are from the phenomenological world or that the columns relate to one another analytically (though we'll do this below).

EXHIBIT 1.8

ID          LABEL   SEQUENCE  COUNT   AMOUNT
Sigma         D        28       2      9.1
Gamma         B        29       7      9.7
Delta         B        30       6      7.6
Kappa         C        31       8      7.5
Mu            D        32       8      9.8
Beta          A        33       3      4.2
Pi            D        34       6      4.7
Epsilon       D        35       4      4.8
Tau           C        36       8      4.2
Lambda        C        37      10      2.0
Alpha         A        38       9      5.8
Rho           A        39       1      4.2

But the table highlights several things should be considered about any table you present.

Ordering is an often-neglected design element. Obviously temporal observations have a natural order, as above, but it could be ordered alphabetically by either of the first 2 columns or numerically by either of the last 2. Always think about how you want the table to be read. If you want to show what's biggest then order by size. If there are many rows, then alphabetical makes sense so readers can find a particular object.

Totals and other summaries should be included if useful; no summary would be shown for the first 3 columns, but the last 2 columns might be totalled, and the last column could also include a mean.

Orientation is fundamental to presentation; because it is easier to compare measurements in columns than in rows, a transposition of the above table would be very difficult to read. Here's a simple little table that's easy to read.

EXHIBIT 1.9

         REGION
         A     B    TOTAL
WELL    175   225    400
SICK     25    75    100
TOTAL   200   300    500

Although there are only 4 numbers in the dataset, the totals improve interpretation. It's fairly easy to see that region B is 'sicker' than A, but this relationship between region and condition would be less obvious if the table were transposed. The marginal and grand totals also make it easy to get a sense of the proportions. Below I discuss causality, but even in this table the 'independent' spatial variable is shown horizontally and the 'dependent' epidemiological variable is vertical.

Here's the above table showing percents:

EXHIBIT 1.10

         REGION
         A     B    TOTAL
   n    200   300    500 
WELL     88    75     80
SICK     12    25     20
TOTAL   100   100    100

Headings are a key element of any table; they show what columns and rows represent: percents, or thousands (1000) or units (parts per million). Other marginal elements (like the totals above) can summarize columns.

In sum, the overall design of a table must be considered. If it's too wide to fit on a page/screen; consider dropping some of the columns. If it's too long, drop some of the rows (or include all the data in an appendix or even online). Summarize the smallest counts (in an OTHER category, or even substitute statistics (totals) for small occurrences.

You should be aware that many other terms are used to refer to the data matrix, including dataset, array, data frame, etc. Here are some of the ways that different fields refer to the rows and columns of data matrices.