Temporal analysis (lung deaths)

Introduction

This lab will examine deaths from lung diseases (bronchitis, emphysema and asthma) in the United Kingdom. To begin, I don’t reveal anything more about the measurements because I want to explore the data first as abstract numbers and only later as appropriately labeled and structured.

This lab exercise was written using Excel 2010 so be aware that some operations may be different for your version, and any of these operations can be performed with any reasonably powerful statistical software.

Data input and description

Read in to Excel and generate descriptive statistics

Download http://ldecola.net/edu/labs/uk_dat.txt and save it locally.

Run Excel and read in the file using File > Open > Type: All Files and File name: uk_dat.txt. Then select Delimited > Next > Delimiters: Space then Finish.

What is the nature of this data matrix? How many rows and columns? Do the variables (columns) appear to be similar?

Save your spreadsheet as an .xlsx file.

Descriptive statistics

This step requires the Excel Data Analysis add-in (see http://ldecola.net/edu/docs/useful_info.html for how to install it).

To generate statistics run Data > Data Analysis > Data Analysis > Descriptive Statistics, OK > Input Range: select all columns then Enter and check Summary statistics then OK.

Expand the columns to see what they contain and delete the redundant columns.

How many rows (‘observations’ or data points) are there?

Move the column labels and remove the redundant columns that identify the statistics.

Several of the statistics have excess significant figures; I suggest you reduce them to 2 or 3 figures to the right of the decimal.

Group the statistics into a few categories: typicality (e.g. mean), variation (standard deviation), and other. A possible grouping: 1) Count, 2) Sum 3) Mean, Median, Mode, 4) Minimum, Maximum, Range, 5) Standard Deviation, Standard Error, 6) Skewness, 7 (other) Kurtosis, Sample Variance.

Discuss the various statistics and what they tell you about these variables.  Which of the variables is more strongly skewed than the others?

  1. EXTRA      Make a nice table with a few of the important statistics.

Plot the data

Do this for each of the columns of uk_dat data in turn:

Select the column and make a scatter diagram. This will plot the variable in the order in which it appears in the worksheet. What do you notice about each of the variables, and how does this correspond with the descriptive statistics you examined above?

This is a convenient way to look at the statistical ‘behavior’ of a variable. It’s easy to see how the 3rd variable ranges between about 1000 and 4000, with at least one ‘outlier’.

Use scatter to plot the third column against the first, as well as against the second. Without concerning yourself about what the variables actually measure, what kind of patterns to you see?

Finally, make a line plot of the 3rd column alone.

Sort and format the data

Sort by the first 2 columns

Sorting is a way to organize data and can be done by a single or by multiple variables. We want to sort by column A and then within it, by column B. In Excel 2010 this is done with ‘levels.’ Select the 3 columns, then Data > Sort > Sort by: Column A then Add Level > Then by: Column B and OK.

Examine the first 2 columns and discuss their values. What regularities to you now see?

Look again at the line plot of the 3rd variable and see what has happened to the pattern.

In light of the fact that the data are a time series, discuss the pattern you now see.

NOTE: Usually before I sort a new dataset I add a column of integers (e.g. 1, 2, …72) so that I can return to the original order, but as these data were randomly shuffled that order is no longer needed!

Format the dates

Insert a blank column to the left of column C. In what is now cell C1 insert the formula =DATE(A1+1973,B1,15).

Copy the formula down to the bottom of the data range.

Then right-click the new column and select Format > Custom > Type then fill in the string of characters yyyy-mmm.

Insert a new row at the top of the data and fill in the labels (headers): yearnum, month, date, and deaths.

Consider what these 2 operations have done to the columns of data. Nothing has been changed except formatting, but now the variables have their intended meaning.

NOTE: you could have created this new year/month column and then have sorted on this column only.

Plot the data as a time series

Format the table

Set the font of the data as a fixed-width face (Courier New or Lucinda Console work well, though for some reason they appear to be about 10% larger than Times New Roman).

View > Freeze the top row and give it a bold face so the column headings stand out. You could use this in a report or as a saved file; a screenshot would not give anyone access to the data, however.

Make an exhibit for presentation

Make a line chart of deaths by date. The default plot shows a temporal pattern fairly clearly but has several deficiencies which you should try to correct.

Format the time axis so that it shows the years only. Getting the years to center on their data will be difficult, but you can change fontsThis will be

Move the v-axis (deaths) to the right of the plot so it is easier to see recent data.

Change the intervals on the vertical axis to something less cluttered.

Remove the redundant legend, but you may want to add something to the title, axis, or other text.

Aspect ratio of about .6 (the ‘golden ratio’) of vertical to horizontal) looks nice to me.

Format the chart for publication

Discuss the temporal behavior of the data in light of the appearance of this visualization. E.g. does there appear to be a trend to the data from year to year? The pattern is clearly seasonal; when do the peaks and troughs appear?

Add a trend line with its formula and R2 and discuss these statistics.

EXTRA: Add a continuous ID representing the sequence of months in the data (1, 2, …, 72) and regress this variable on deaths to estimate the change per month over the period (I get a decrease of 6.6 per month). Discuss this result, as well as the R2.

Analysis by year and by month

Here I assume you’re using Excel 2010, but similar instructions will work for Excel 2003.

Years

Insert > PivotTable > select all the columns of data. When the table worksheet appears put years in the row labels and deaths in the Values, then select Sum of deaths. You should get a simple table with 2 columns: 6 years and the total deaths for each year.

I don’t know how to plot data in a pivot table so do the following. Copy the data from the table and the paste into another part of the worksheet.

Make a column bar plot of the deaths by year. Discuss the trend if any.

EXTRA: discuss the merits of bar v. line v. scatter plots.

Months

Make a pivot table with months in the column and next to it average deaths.

Copy the average deaths and plot these data.

Discuss the seasonality of the average deaths: when maxima and minima occur, and why?

EXTRA: What kind of mathematical model (‘curve’) would be appropriate to model these data?

Conclusion

This is an excellent dataset for the analysis of temporal phenomena, and because the data have been analyzed so thoroughly you can compare your results with more rigorous analyses.

References

Diggle, P. J. (1990) Time Series: A Biostatistical Introduction. Oxford, table A.3.

Venables, W. N. and B. D. Ripley (1999) Modern applied statistics with S-PLUS. New York, Springer-Verlag.