3.1: Statistics – Visualising data

Visualization is critical to data analysis. While tables are necessary to record the data, it is usually very difficult to distinguish pattern in tables of numbers, particularly for large data sets. Graphs, however, allow the reader to see complex data sets simply and concisely. Plots can reveal hidden structure in the data, and outlying or unusual results, and they enable preconceived ideas to be challenged. Visualization of data is best described using an example.

Table IS3.1.1 contains a typical water data set that may have been collected as part of an operational monitoring program. A simple graphical representation of one analyte (iron in this case) from these data is presented in Figure IS3.1.1. The graph makes clear a number of features of the data that are lost in the table. The average of the data can be seen to be close to 0.2 mg/L and the majority of data lie between ± one standard deviation of the mean. Of note are 4 data points that exceed the aesthetic guideline value listed in the Guidelines, as well as periods where consecutive increases or decreases in the data have occurred. Such a chart is often referred to as a quality control or Shewhart chart, and is discussed further in Information Sheet 3.5.

Table IS3.1.1 Example of a water quality data set

Date
Iron (mg/L)
Manganese (mg/L)
Total Hardness as CaCO₃ (mg/L)
Colour (PCU)
Turbidity (NTU)
Electrical Conductivity (µS/cm)

28/11/1999

0.08

0.02

28/11/1999

0.1

0.005

30/01/2000

0.13

0.02

5/03/2000

0.09

0.01

37

15

0.5

125

16/04/2000

0.1

0.01

21/05/2000

0.13

0.01

18

105

25/06/2000

0.14

0.01

29

24

1

110

6/08/2000

0.12

0.005

35

12

0.7

120

3/09/2000

0.42

0.02

15

145

13

75

3/10/2000

0.16

0.01

15

27

1.6

99

6/11/2000

0.21

0.01

26

9

0.6

120

4/12/2000

0.25

0.01

27

18

0.6

125

8/01/2001

0.35

0.03

27

0.5

125

5/02/2001

0.26

0.02

24

0.8

120

5/03/2001

0.29

0.01

14

58

4.6

84

9/04/2001

0.22

0.01

24

0.9

110

15/05/2001

0.21

0.01

25

0.6

115

17/06/2001

0.21

0.01

26

0.6

115

16/07/2001

0.12

0.01

23

0.9

105

12/08/2001

0.11

0.01

23

8

0.7

115

9/09/2001

0.07

0.01

33

0.9

130

15/10/2001

0.16

0.01

24

19

0.8

110

11/11/2001

0.19

0.01

29

5

0.6

120

3/02/2002

0.36

0.03

28

5

0.9

136

11/03/2002

0.22

0.01

32

6

1

155

28/04/2002

0.81

0.04

43

5

2.6

218

26/05/2002

0.07

0.01

28

8

0.3

125

16/06/2002

0.11

0.01

25

6

0.6

116

21/07/2002

0.14

0.01

27

4

0.4

112

11/08/2002

0.24

0.01

18

18

12

82

8/09/2002

0.13

0.01

24

7

1.2

110

13/10/2002

0.12

0.01

31

5

0.4

131

10/11/2002

0.13

0.01

31

5

0.5

130

2/02/2003

0.18

0.01

16

19

1.7

85

16/03/2003

0.2

0.01

21

6

0.6

98

20/04/2003

0.17

0.01

24

5

0.5

108

11/05/2003

0.2

0.01

24

4

0.5

110

22/06/2003

0.17

0.01

31

3

0.9

111

21/07/2003

0.14

0.01

21

10

1

92

24/08/2003

0.14

0.01

27

4

0.5

106

7/09/2003

0.13

0.01

27

5

0.4

108

12/10/2003

0.14

0.01

32

5

0.5

125

16/11/2003

0.13

0.01

27

7

0.4

105

7/03/2004

0.16

0.01

31

4

0.6

128

Figure IS3.1.1 Graphical presentation of iron data from Table IS3.1.1

Another useful graphical representation of data is a frequency histogram (Figure 3.1.2) which can identify individual features of the distribution of the data and their relationship with other analytes. Frequency histograms can provide an indication of the normality of the data (or lack thereof in most cases). For example, compare the frequency histogram for total hardness with that for iron and manganese; the latter two show significant departure from normality. Tests of normality (e.g. using Anderson’s test or Maximum Likelihood) can be considered but are unlikely to be definitive when sample sizes are small, as is usually the case in water quality data. Further aspects of the distribution (e.g. skewness and kurtosis) also need to be considered. In practice, however, water quality data are rarely normally distributed. Accordingly, the use of alternative distributions (most notably, lognormal distributions) should be considered when significant departures from normality are observed.

Figure IS3.1.2 Histogram of data for selected analytes from Table IS3.1.1. Note that most data sets shown are evidently non-normal and skewed

In summary, the first step in data analysis is to present the data graphically, ideally both for the 12-month reporting period and for the full period for which data is available, or for a ten-year period.

If more detailed or involved data analysis techniques are to be considered the advice of a statistician should be sought.

Reference

AS/NZS 5667.1:1998 (1998). Water Quality – Sampling Part 1: Guidance on the design of sampling programs, sampling techniques and the preservation and handling of samples. Standards Association of Australia

Last updated

Logo

Australian Drinking Water Guidelines 6 2011, v3.9

Go back to NHMRC website