LECTURE 2: Additional Notes

 

The Normal Distribution

The Normal Distribution is a mathematical function that defines the distribution of scores in population with respect to two population parameters.  The first parameter is the Greek letter (, mu).  This represents the population mean.  The second parameter is the Greek letter (, sigma) which represents the population standard deviation (the standard deviation is equal to the square root of the variance, so the variance is represented as ).  Different normal distributions are generated whenever the population mean or the population standard deviation are different.

Hint:  Most of the time that Greek letters are used to refer to means, standard deviations and variances they are referring to populations, not samples.

Above: ANOVA assumes that the populations tested have the same variance. ANOVA tests to see if the central tendency (mean) difference between samples (middle diagram).

The normal distribution is used in statistical analysis in order to make standardised comparisons across different populations (treatments).  The kinds of parametric statistical techniques we use assume that a population is normally distributed.  This means that we can compare directly between two populations. 

The major underlying conceptual premise is that the area under the normal distribution represents the proportion of scores that are within a particular range.  Not only that, but it is possible to calculate exactly what proportion of scores fall within a range by standardising the distribution.  Effectively, we can compare two or more populations by discovering how much overlap in area there is between the two normal distributions that describe those populations.

If the shared area is large there is a very small probability that the two populations are different.  If the shared area is small there is a much greater probability that the two populations are different.  In others words we can directly reason about the null hypothesis and what it might mean by comparing two normal distributions.  The null hypothesis states that the two normal distributions have the same mean.  So, if the two means are the same and we assume that both the populations are normally distributed, and we further assume that the standard deviations are the same, then the shared area under each of the population distribution curves will be constituted by all the area under the curves.  If the null hypothesis is false, the shared area will be much smaller.  It is mathematically impossible for the shared area ever to equal zero, but we can make intelligent decisions on the basis of how much area is shared. 

So, when we use certain parametric statistical techniques we are effectively using the normal distribution as a means of standardising the comparisons, such that we can associate particular probabilities to the differences between populations and either accept or reject the null hypothesis.

Examples

 

Note that not all distributions are normal, symmetrical functions. Often data is ³Postively² or ³Negatively² skewed. A Positively skewed distribution has a dense concentration at low values with a few outliers with very large values. This function is commonly found when we test response times (e.g. button presses), where there is a constraint on the minimum time for response (motor time to make a button response), but little limit on the maximum time to make a response. One hint it that a positively skewed distribution looks like a  ³P² on its side. We will discuss skewed data more in a later example. However, here are two graphical examples of distributions that have different distributions:

 

                

 

The example on the left shows two different distributions with the same mean: one distribution is normal and the other is positively skewed. The image on the right also shows two different distributions with the same means: in this case one distribution is positively skewed and the other distribution is negatively skewed.

Summary

Simply on the basis of adopting the notion of shared variance, or area under a normal distribution, or proportion of scores in common (all the same thing) we can see how important it is to assume that the distributions have the same form.  If they do not we cannot make direct comparisons between the groups.

Therefore there are three very important assumptions that are made when using parametric statistical techniques:

 

We are now going to change tack and instead of using common sense to argue why the distributions we look at should be normally distributed and have the same variance and form to showing how exactly the same assumptions follow from the theoretical statistics that underlie the statistical tests that we can use for comparing populations.

The Chi-Square Distribution

Having talked about the importance of the normal distribution we will now move on to discussing how we can use the normal distribution as a basis for making statistical inferences about treatment populations on the basis of samples that we have taken.  This requires an understanding of the chi-square distribution and the way in which it can be used to estimate differences between sample and population variances on the basis of knowledge about population means.  In 1876 F.R. Helmhert derived the chi-squared distribution, but it was Karl Pearson who in 1900 first used it as a means of testing hypotheses.

Let's assume that there is a population of scores that are normally distributed.  If we repeatedly take samples of size one from this population and convert these data points into a standardised form, i.e.

and we then square this number then the distribution of the random variable, z2, is a chi-square distribution, with 1 degree of freedom.  z2 is an unbiased estimate of the population variance, which incorporates knowledge of the actual population variance (square of the standard deviation) and population mean  If two random samples of independent observations are drawn from a normally distributed population then the sum of the square of these also has a chi-square distribution, with 2 degrees of freedom. If n random samples of independent observations are taken from a normally distributed population then the sum of the square of these also has a chi-square distribution, with n degrees of freedom.

Thus, there is a family of chi-squared distributions that are related to each other according to the associated degrees of freedom.   If we knew everything there was to know about a population we could generate the appropriate chi-square distribution and use this as a description of our data.  However, we usually cannot exhaustively sample from a population so the question is what happens when we substitute the values we have obtained from our sample and compare it to what other research has stated about the population? 

Basically, if we treat a sample as s random samples of independent observations, where s represents the number of subjects in the sample, taken from a normally distributed population, then we obtain a chi-square distribution with s-1 degrees of freedom.  In other words, the sample if randomly taken from a population, also generates a distribution that is based on the chi-square distribution.   Given this we can compare directly the sample-based estimate of the population variance with our knowledge of the actual population variance. 

The ratio of the sample-based estimate of the population variance to the actual population variance will range from 1 to positive infinity.  Since we can derive the probability of this ratio being due to a sample being taken from a specific population (as the null hypothesis dictates) then we can test the null hypothesis. 

The test statistic that is used in these circumstances is:

where, is the estimate of the population variance due to the sample and is the variance of the population.   Using the chi-square distribution we can test whether a sample and population variance are the same or different.  However, this requires us to know what the population variance is.   Unfortunately we rarely know the population variance.    This means that we need to adopt a slightly different test.

The F-Distribution

The statistic we will use is based on a different distribution, the F-distribution.  Fortunately, the F-distribution is based on the ratio of two chi-square statistics.  Thus, it is necessarily the case that the assumptions that apply to the chi-square statistic also apply to the F-statistic.

The chi-square random variable and its associated distribution and statistic have been explained in some detail because the F random variable, its distribution and test statistic can be defined as the ratio of two independent chi-square variables, each divided by its degrees of freedom.  The distribution of this ratio was determined by R.A. Fisher (1924) and given the name F in his honour.

Let's imagine that there are two populations of scores each having normal distributions. (Don't forget this is an essential assumption if we are to base this statistic on the chi-square statistic)  Assume also that both populations have the same variance but not necessarily the same means.  Let us further assume that two independent random samples have been taken from the two populations.

We already know that for each random sample there is an associated chi-square statistic that compares it to a known normal population:

For the first sample A:

           

For the second sample B:

           

We can rewrite both these chi-square statistics to single out the sample based population estimates:

           

and

If the two samples have equal variances then the ratio of the sample variances to each other is:

           

If the two populations are the same population then the F-ratio will be 1. If the two population are different then the F-ratio will be greater than 1.  Also the F-ratio will increase still further the greater the difference.  The F-ratio depends then on knowing the sample variances for two samples and the degrees of freedom associated with each sample, where the degrees of freedom are based on the sample sizes.

Summary of Assumptions and Requirements

1)  The reason we do statistics is to make inferences from samples to populations.

2)  To do this we have to assume that the populations have certain properties so that the theoretical statistical models we adopt are appropriate for making these kinds of inferences.

3)  From the chi-square distribution two fundamental assumptions are necessary:

4)  From the F-ratio statistic there are two further fundamental assumptions:

So before we proceed with an analysis of the data we have collected we have to make sure that these assumptions have been met.