|
BACK TO 125 EXAMPLES in Engineering, Operations Management
and Computer Information Systems |
|
Do it once, do it right, and do it now. |
|
Back to Lawson Computing Homepage |
Apply as needed,
when needed. |
APPENDIX
D
MORE
ABOUT STATISTICS
APPENDIX
D
MORE ABOUT STATISTICS
In
this Appendix we will discuss, in additional detail, some statistical topics
including error, sample size, the Student's "t" distribution, the
"f" distribution, and confidence intervals.
STATISTICAL INFERENCE
In
statistical inference we make generalizations based on samples, and,
traditionally, such inferences have been divided into problems of estimation
and hypothesis testing. In estimation we assign a numerical value to a
population parameter on the basis of sample data. In any CER, we are attempting
to predict the population behavior(s) from a sample. We need to ask ourselves
the question, "How good is our estimation?" When we test a
hypothesis, we accept or reject assumptions concerning the parameters or the
form of a population. When we use a CER, we need to be confident of the CERs ability to predict the future - without this confidence, the CER should not be used.
Z-scores
In
general, if X is a measurement belonging to a set of data having the mean
(or
for a population) and the
standard deviation s (or
for a population), then its value in standard units, denoted by
Z, is
or, ![]()
depending on whether the data constitute a sample or a
population. In these units, Z tells us how many standard deviations a value
lies about or below the mean of the set of data to which it belongs.
Error
An
estimate is generally called a point estimate, since it consists of a single
number, or a single point on the real number scale. Although this is the most
common way in which estimates are expressed, it leaves room for many questions.
For instance, it does not tell us on how much information the estimate is
based, and it does not tell us anything about the possible size of the error.
And, of course, we must expect an error. An estimate's reliability depends upon
two things - the size of the sample and the size of the population standard
deviation,
. Any statistics textbook will show that the
error term is,
where,
Z(
/2) represents the number of standard deviations from the mean
that we are willing to allow our estimate to be "off" either way by
probability of
. This result applies when n is large and the population is
infinite. The two values which are most commonly used for
are 0.95 and 0.99, with
corresponding Z scores, Z ( 0.025) = 1.96 (standard
deviations) for
= 0.95 and Z (0.005) = 2.575 for
= 0.99, respectively.
There
is one complication with this result. To be able to judge the size of the error
we might make when we use
as an estimate of
, we must know the value
of the population standard deviation,
. Since this is not the case in most
practical situations, we have no choice but to replace
with an estimate, usually the
sample standard deviation, s. In general, this is considered to be reasonable
provided the sample is sufficiently large (n ³ 30).
Sample Size
The
formula for E can also be used to determine the sample size that is needed to
attain a desired degree of precision. Suppose that we want to use the mean of a
large random sample to estimate the mean of a population, and we want to be
able to assert with probability
that the error of this estimate will be less
than some prescribed quantity E. Solving the the
previous equation for n, we get,
![]()
Confidence Intervals
For
large random samples from infinite populations, the sampling distribution of
the mean is approximately normal with the mean
and the standard deviation
,
namely, that,
![]()
is a random variable having approximately the standard
normal distribution. Since the probability is
that a random variable having the standard
normal distribution will take on a value between -Z(
) and Z(
), namely, that
< Z <
, we can substitute into
this inequality the foregoing expression for z and it yeilds,
![]()
Using some algebraic manipulation, we get
![]()
and we can assert with probability
that it will be satisfied
for any given sample. In other words, we can assert with (
)%
confidence that the interval, above, determined on the basis of a large random
sample, contains the population mean we are trying to estimate. When s is unknown and n is at least 30, we replace s by the sample standard deviation, s.
An
interval such as this is called a confidence interval, its endpoints are called
confidence limits, and the probability
is called the degree of confidence. Again, the
values most commonly used for
are 0.95 and 0.99, the corresponding values of
are 1.96 and 2.575, and
the resulting confidence intervals are referred to as 95% and 99% confidence
intervals for m .
Confidence Intervals for Means (Small
Samples)
To
develop corresponding theory which applies also to small samples, it will be
necessary to assume that the population we are sampling has roughly the shape
of a normal distribution. We can then base our methods on the statistic
, whose sampling distribution is a continuous distribution
called the t distribution. This distribution is symetrical
and bell-shaped with zero mean. The exact shape of the t distribution depends
as a parameter called the number of degrees of freedom, given by n-1, the
sample size less one. For the t distribution we define t(a /2) in the same way in which Z(a /2) was defined.
However, t(a /2) depends on n-1
(degrees of freedom) and its value must be looked up in a table of values. In
the same way as before we can arrive at the following small sample confidence
interval for m :
![]()
The
degree of confidence is
and the only difference between this formula and the large
sample formula is the t(a /2) takes
the place of Z(a /2).
Analysis of Variance and the F Statistic
The
F statistic is a statistic for a test concerning the differences among means.
It is defined as:
F
= estimate of s 2 based on the variation among the
's
estimate of s 2 based on the
variation within the samples
and is called a variance ratio. The F distribution is a
theoretical distribution which depends on two parameters called the numerator
and denominator degrees of freedom. When the F statistic is used to compare the
means of k samples of size n, the numerator and denominator degrees of freedom
are, respectively, k-1 and k(n-1).
This
is a simple form of an analysis of variance. The basic idea of an analysis of
variance is to express a measure of the total variation of a set of data as a
sum of terms, which can be attributed to specific sources, or causes of
variation. Two such sources of variation could be 1) actual differences, and,
2) chance differences. As a measure of the total variation of an observation
consisting of k samples of size n, we use the total sum of squares,
,
where Xij is the jth observation of the ith sample,
i = 1, 2, ... , k, and j = 1, 2, ... , n, and
, the mean of all
the k measurements or observations is called the grand mean. If we divide the
total sum of squares by kn-1, we get the variance of the combined data.
Letting
devote the
mean of the ith sample, i =
1, 2, ..., k, we can write the following identity:
![]()
Looking
closely at the two terms into which the total sum of squares SST has been
partitioned, we find that the first term is a measure of the variation among
the means. Similarly, the second term is a measure of the variation within the
individual samples, or chance variation. Dividing the first term by k-1 and the
second by k (n-1), we get the numerator and the denominator of the F Statistic
as defined, above. The first term is often referred to as the treatment sum of
squares, SST and the second term as the error sum of squares, SSE, experimental
error, or chance.
Refer
to a statistics textbook for a further explanation of these and other
statistical subjects.
|
BACK TO 100 examples in Business, Operations and
Engineering. |
|
Do it once, do it right, and do it now. |
|
Back to Lawson Computing Homepage |
Apply as needed,
when needed. |