The
Normal Distribution is a mathematical function that defines the distribution of
scores in population with respect to two population parameters. The first parameter is the Greek letter
(
, mu). This
represents the population mean. The
second parameter is the Greek letter (
, sigma) which represents the population standard deviation
(the standard deviation is equal to the square root of the variance, so the
variance is represented as
). Different
normal distributions are generated whenever the population mean or the
population standard deviation are different.
Hint: Most of the time that Greek letters are
used to refer to means, standard deviations and variances they are referring to
populations, not samples.

Above:
ANOVA assumes that the populations tested have the same variance. ANOVA tests
to see if the central tendency (mean) difference between samples (middle
diagram).
The
normal distribution is used in statistical analysis in order to make
standardised comparisons across different populations (treatments). The kinds of parametric statistical
techniques we use assume that a population is normally distributed. This means that we can compare directly
between two populations.
The
major underlying conceptual premise is that the area under the normal
distribution represents the proportion of scores that are within a particular
range. Not only that, but it is
possible to calculate exactly what proportion of scores fall within a range by
standardising the distribution.
Effectively, we can compare two or more populations by discovering how
much overlap in area there is between the two normal distributions that
describe those populations.
If
the shared area is large there is a very small probability that the two
populations are different. If the
shared area is small there is a much greater probability that the two
populations are different. In
others words we can directly reason about the null hypothesis and what it might
mean by comparing two normal distributions. The null hypothesis states that the two normal distributions
have the same mean. So, if the two
means are the same and we assume that both the populations are normally
distributed, and we further assume that the standard deviations are the same,
then the shared area under each of the population distribution curves will be
constituted by all the area under the curves. If the null hypothesis is false, the shared area will be
much smaller. It is mathematically
impossible for the shared area ever to equal zero, but we can make intelligent
decisions on the basis of how much area is shared.
So,
when we use certain parametric statistical techniques we are effectively using
the normal distribution as a means of standardising the comparisons, such that
we can associate particular probabilities to the differences between
populations and either accept or reject the null hypothesis.
Note that not all
distributions are normal, symmetrical functions. Often data is ³Postively² or
³Negatively² skewed. A Positively skewed distribution has a dense concentration
at low values with a few outliers with very large values. This function is
commonly found when we test response times (e.g. button presses), where there
is a constraint on the minimum time for response (motor time to make a button
response), but little limit on the maximum time to make a response. One hint it
that a positively skewed distribution looks like a ³P² on its side. We will discuss skewed data more in a later
example. However, here are two graphical examples of distributions that have
different distributions:

The example on the left
shows two different distributions with the same mean: one distribution is
normal and the other is positively skewed. The image on the right also shows
two different distributions with the same means: in this case one distribution
is positively skewed and the other distribution is negatively skewed.
Simply
on the basis of adopting the notion of shared variance, or area under a normal
distribution, or proportion of scores in common (all the same thing) we can see
how important it is to assume that the distributions have the same form. If they do not we cannot make direct
comparisons between the groups.
Therefore
there are three very important assumptions that are made when using parametric
statistical techniques:
We
are now going to change tack and instead of using common sense to argue why the
distributions we look at should be normally distributed and have the same
variance and form to showing how exactly the same assumptions follow from the
theoretical statistics that underlie the statistical tests that we can use for
comparing populations.
Having
talked about the importance of the normal distribution we will now move on to
discussing how we can use the normal distribution as a basis for making
statistical inferences about treatment populations on the basis of samples that
we have taken. This requires an
understanding of the chi-square distribution and the way in which it can be
used to estimate differences between sample and population variances on the
basis of knowledge about population means. In 1876 F.R. Helmhert derived the chi-squared distribution,
but it was Karl Pearson who in 1900 first used it as a means of testing
hypotheses.
Let's
assume that there is a population of scores that are normally distributed. If we repeatedly take samples of size
one from this population and convert these data points into a standardised
form, i.e.
![]()
and
we then square this number then the distribution of the random variable, z2, is a chi-square distribution, with 1 degree of
freedom. z2 is an unbiased estimate of the population variance, which incorporates
knowledge of the actual population variance (square of the standard deviation)
and population mean If two random
samples of independent observations are drawn from a normally distributed
population then the sum of the square of these also has a chi-square
distribution, with 2 degrees of freedom. If n random samples of independent observations are taken
from a normally distributed population then the sum of the square of these also
has a chi-square distribution, with n degrees of freedom.
Thus,
there is a family of chi-squared distributions that are related to each other
according to the associated degrees of freedom. If we knew everything there was to know about a
population we could generate the appropriate chi-square distribution and use
this as a description of our data.
However, we usually cannot exhaustively sample from a population so the
question is what happens when we substitute the values we have obtained from
our sample and compare it to what other research has stated about the
population?
Basically,
if we treat a sample as s random
samples of independent observations, where s represents the number of subjects in the sample,
taken from a normally distributed population, then we obtain a chi-square
distribution with s-1 degrees of
freedom. In other words, the
sample if randomly taken from a population, also generates a distribution that
is based on the chi-square distribution. Given this we can compare directly the sample-based
estimate of the population variance with our knowledge of the actual population
variance.
The
ratio of the sample-based estimate of the population variance to the actual
population variance will range from 1 to positive infinity. Since we can derive the probability of
this ratio being due to a sample being taken from a specific population (as the
null hypothesis dictates) then we can test the null hypothesis.
The
test statistic that is used in these circumstances is:

where,
is the estimate of the population variance due to the sample
and
is the variance of the population. Using the chi-square distribution we can test whether
a sample and population variance are the same or different. However, this requires us to know what
the population variance is.
Unfortunately we rarely know the population variance. This means that we need to
adopt a slightly different test.
The
statistic we will use is based on a different distribution, the
F-distribution. Fortunately, the
F-distribution is based on the ratio of two chi-square statistics. Thus, it is necessarily the case that
the assumptions that apply to the chi-square statistic also apply to the
F-statistic.
The
chi-square random variable and its associated distribution and statistic have
been explained in some detail because the F random variable, its distribution
and test statistic can be defined as the ratio of two independent chi-square
variables, each divided by its degrees of freedom. The distribution of this ratio was determined by R.A. Fisher
(1924) and given the name F in his honour.
Let's
imagine that there are two populations of scores each having normal
distributions. (Don't forget this is an essential assumption if we are to base
this statistic on the chi-square statistic) Assume also that both populations have the same variance but
not necessarily the same means.
Let us further assume that two independent random samples have been
taken from the two populations.
We
already know that for each random sample there is an associated chi-square
statistic that compares it to a known normal population:
For
the first sample A:

For
the second sample B:

We
can rewrite both these chi-square statistics to single out the sample based
population estimates:

and

If
the two samples have equal variances then the ratio of the sample variances to
each other is:

If
the two populations are the same population then the F-ratio will be 1. If the
two population are different then the F-ratio will be greater than 1. Also the F-ratio will increase still
further the greater the difference.
The F-ratio depends then on knowing the sample variances for two samples
and the degrees of freedom associated with each sample, where the degrees of
freedom are based on the sample sizes.
1) The reason we do statistics is to make
inferences from samples to populations.
2) To do this we have to assume that the
populations have certain properties so that the theoretical statistical models
we adopt are appropriate for making these kinds of inferences.
3) From the chi-square distribution two
fundamental assumptions are necessary:
4) From the F-ratio statistic there are
two further fundamental assumptions:
So before we proceed with an
analysis of the data we have collected we have to make sure that these
assumptions have been met.