3.3: Statistics – Statistical principles

This information sheet sets out some general statistical principles and considerations for designing and interpreting water quality monitoring programs. Ideally, expert statistical advice should be sought in devising and interpreting such a program. Further information and references can be found in the Australian Guidelines for Monitoring and Reporting, National Water Quality Management Strategy Paper No. 7 (ANZECC/ARMCANZ 2000).

Summary statistics

A fundamental task in many statistical analyses is to characterise the location and variability of a data set. This is usually described by the mean (µ) and standard deviation (σ). Most statistical packages can produce these values for a given data set. In addition, percentiles are simple to derive. None of these three statistics require an underlying assumption of normality and all can be derived using simple statistical tools and commonly available spreadsheet packages. However, there are a number of data transformations that are usually required before the statistics can be estimated, as follows.

Outliers and ‘less than’ values

Two persistent problems cause difficulties in the use of the mean in assessing water quality data:

outliers – that is, numbers that appear to be extreme when compared with other data in the data set. These are not numbers generated by some malfunction of measuring equipment or transcription errors, which clearly ought to be discarded. They are numbers that seem anomalous, although there is no obvious explanation and they cannot be discarded on technical grounds; and
values that are recorded as less than the limit of detection.

As an example, consider the following set of data:

<0.5, <0.5, 1.2, 1.4, 1.45, 2.1, 21.3

The first problem is what to do about the less-than values. Should they be ignored, replaced by 0.25, replaced by 0, or should the < symbol be ignored? There is no clear answer except that it can be shown that using L/2, where L is the limit of detection, is effectively a worst-case method and not the even-handed approach it appears to be at first sight (Ellis 1989). If the values below the limit of detection are critical in determining how a supply performs against the Guidelines, then steps should be taken to reduce the limit of detection. Statistical treatment of values below the detection limit is possible but is complex and not entirely satisfactory. In the absence of any alternative, however, it is recommended that detection limit values be replaced by L/2, as a conservative approach. The original data set should be kept intact so that it can be seen which data were substituted in this way, and when presenting results, the substitution should be noted. Further information on dealing with less than values can be found in Croghan and Egeghy (2003) and Smith et al. (2006).

For determining the 95th percentile, up to 95% of the reported results can be less than the limit of detection and the statistic can still be found readily. The lowest 95% of reported values simply identify which value is reported as 95th ranked value and do not arithmetically contribute to it. If more than 95% of the reported values are below the detection limit, the 95th percentile should be reported simply as less than the limit of detection. Percentiles are discussed further in the following section.

For determining averages, it is necessary to substitute the values at less than the detection limit with L/2 and to note that this substitution was made, as well as noting what proportion of data was below the detection limit; for example: Average: 0.3 mg/L (notes: detection limit 0.2 mg/L, 12 samples taken, 3 samples below detection limit which were substituted with a value of 0.1 mg/L). The substitution of censored data will necessarily introduce biases to the calculated means and standard deviations, and this approach should be used only when the proportion of censored data is relatively low (e.g. 3 out of 12 results, in the above example).

The second problem with the data set is the very high 21.3 value. Is it genuine, or an analytical error? If it is genuine, is it valid to include it in the calculation of the mean (and hence the 95th percentile) when it will clearly have a marked effect on the result? The answer is that it must be included in the calculation as it may have an impact on the health of people receiving the water. To remove it would have the same effect as censoring the data set. Only those data points that have been clearly shown to be in error should be removed.

Simple worksheet packages, such as Excel™, will have some difficulty in deriving percentiles and averages if some of the data are shown as <x, since the software may simply ignore these values and see them as non-numerical. Therefore, substitution with numerical values, such as L/2, or manual calculation, may be necessary, to avoid producing misleading results.

If the detection limit is above the guideline value, the assay should be changed to provide a more sensitive one. Detection limits are also discussed in Chapter 9 (Section 9.10.3).

Skewness and kurtosis

A further characterization of the data can include the measure of skewness and kurtosis.

Skewness is a measure of symmetry, or more precisely, the lack of symmetry. A distribution, or data set, is symmetrical if it looks the same to the left and right of the centre point.

The skewness for a normal distribution is zero, and any symmetrical data should have a skewness near zero. Negative values for the skewness indicate data that are skewed left (i.e. the left tail is long compared to the right tail); positive values indicate data that are skewed right (i.e. the right tail is long compared to the left tail). Further advice on how to treat strongly skewed data sets can be found in McBride (2005).

Kurtosis is a measure of whether the data are peaked or flat relative to a normal distribution. Data sets with high kurtosis tend to have a distinct peak near the mean, decline rather rapidly, and have heavy tails. Data sets with low kurtosis tend to have a flat top near the mean rather than a sharp peak. A uniform distribution would be the extreme case.

There are different definitions used for kurtosis. The standard normal distribution has a kurtosis of zero. Which definition of kurtosis is used is a matter of convention. When using software to compute the sample kurtosis, it is necessary to be aware of which convention is being followed.

The histogram (Information Sheet 3.1) is an effective graphical technique for showing both the skewness and kurtosis of a data set. This may be sufficient to indicate whether the data are approximately normal, or are skewed and need further statistical treatment (e.g. transformation) before statistics are calculated and performance assessments undertaken.

Measurement error

A set of results is no more than a series of snapshots of some process over the period of sampling. A statistic calculated from these results, such as a percentile, a mean, or a standard deviation, can never exactly coincide with the true statistic, except by chance. The true statistic could only be determined by continuous error-free measurement of every drop of water – an impossibility in water quality analysis.

Values determined experimentally from a set of measurements are, thus, often referred to as estimates of the true statistic. These estimates may be too high or too low – there is no way of knowing. This uncertainty is known as the measurement error (although the term ‘error’ is unfortunate as it really means ‘small departures from the true result’, not mistakes made in analysis), and quantification of this error is an important component of statistical methods.

If more detailed or involved data analysis techniques are to be considered, the advice of a statistician should be sought.

References

Croghan CW, Egeghy PP (2003). Methods of dealing with values below the limit of detection using SAS. South Eastern SAS Users Group Conference.

Ellis JC (1989). Handbook on the Design and Interpretation of Monitoring Programmes. Water Research Centre, Medmenham, United Kingdom, Technical Report NS29.

McBride GB (2005). Using Statistical Methods for Water Quality Management. Issues, problems and solutions. Wiley Series in Statistics in Practice.

Smith D, Silver E, Harnly M (2006). Environmental samples below the limits of detection – comparing regression methods to predict environmental concentrations. SAS Conference Proceedings: Western Users of SAS Software.

Sokal RR, Rohlf FJ (1969). Biometry. WH Freeman and Company, San Francisco.

Previous3.2: Statistics – Assessing data Next3.4: Statistics – Control charts and trends