STATISTICAL_HYPOTHESIS_TESTING

(Redirected from Hypothesis test)
One may be faced with the problem of making a definite decision with respect to an uncertain hypothesis which is known only through its observable consequences.
A 'statistical hypothesis test', or more briefly, ''hypothesis test'', is an algorithm to state the alternative (for or against the hypothesis) which minimizes certain risks.
This article describes the commonly used frequentist treatment of hypothesis testing.
From the Bayesian point of view,
it is appropriate to treat hypothesis testing as a special case of normative decision theory (specifically a model selection problem) and it is possible to accumulate evidence in favor of (or against) a hypothesis using concepts such as likelihood ratios known as Bayes factors.
There are several preparations we make before we observe the data.
#The hypothesis must be stated in mathematical/statistical terms that make it possible to calculate the probability of possible samples assuming the hypothesis is correct. For example: ''The mean response to treatment being tested is equal to the mean response to the placebo in the control group. Both responses have the normal distribution with this unknown mean and the same known standard deviation ... (value).''
#A test statistic must be chosen that will summarize the information in the sample that is relevant to the hypothesis. Such a statistic is known as a sufficient statistic. In the example given above, it might be the numerical difference between the two sample means, ''m1 − m2''.
#The distribution of the test statistic is used to calculate the probability sets of possible values (usually an interval or union of intervals). In this example, the difference between sample means would have a normal distribution with a standard deviation equal to the ''common standard deviation'' times the factor sqrt{ rac{1}{n_1} + rac{1}{n_2}} where ''n''1 and ''n''2 are the sample sizes.
#Among all the sets of possible values, we must choose one that we think represents the most extreme evidence ''against'' the hypothesis. That is called the 'critical region' of the test statistic. The probability of the test statistic falling in the critical region when the hypothesis is correct is called the ''alpha'' value (or ''size'') of the test.
#The probability that a sample falls in the critical region when the parameter is heta, where heta is for the alternative hypothesis, is called the ''power'' of the test at heta. The ''power function'' of a critical region is the function that maps heta to the power of heta.
After the data is available, the test statistic is calculated and we determine whether it is inside the critical region.
If the test statistic is inside the critical region, then our conclusion is one of the following:
#The hypothesis is incorrect, therefore reject the null hypothesis. (Therefore the critical region is sometimes called the 'rejection region', while its complement is the 'acceptance region'.)
#An event of probability less than or equal to ''alpha'' has occurred.
The researcher has to choose between these logical alternatives.
In the example we would say: the observed response to treatment is statistically significant.
If the test statistic is outside the critical region, the only conclusion is that

★ ''There is not enough evidence to reject the hypothesis.''
This is ''not'' the same as evidence in favor of the hypothesis. That we cannot obtain using these arguments, since lack of evidence against a hypothesis is not evidence for it. On this basis, statistical research progresses by eliminating error, not by ''finding the truth''.

Contents
Common test statistics
Criticism
See also
External links

Common test statistics


{| border=1 cellspacing=0 cellpadding=5
|Name
|Formula
|Assumptions
|-
|One-sample z-test
|z= rac{overline{x}-mu_0}{ rac{sigma}{sqrt{n}}}
|(Normal distribution 'or' ''n'' > 30) 'and' σ known.

(z is the distance from the mean in standard deviations. It is possible to calculate a minimum proportion of a population that falls within n standard deviations(see Chebyshev's inequality).
|-
|Two-sample z-test
|z= rac{(overline{x}_1 - overline{x}_2) - (mu_1 - mu_2)}{sqrt{ rac{sigma_1^2}{n_1} + rac{sigma_2^2}{n_2}}}
|Normal distribution 'and' independent observations 'and' (σ₁ AND σ₂ known)
|-
|One-sample t-test
|t= rac{overline{x}-mu_0}{ rac{s}{sqrt{n}}},

df=n-1
|(Normal population 'or' ''n'' > 30) 'and' σ unknown
|-
|Two-sample pooled t-test
|t= rac{(overline{x}_1 - overline{x}_2) - (mu_1 - mu_2)}{s_psqrt{ rac{1}{n_1} + rac{1}{n_2}}},

s_p^2= rac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 - 2},

df=n_1 + n_2 - 2
|(Normal populations 'or' ''n''₁ + ''n''₂ > 40) 'and' independent observations 'and' σ₁ = σ₂ 'and' (σ₁ and σ₂ unknown)
|-
|Two-sample unpooled t-test
|t= rac{(overline{x}_1 - overline{x}_2) - (mu_1 - mu_2)}{sqrt{ rac{s_1^2}{n_1} + rac{s_2^2}{n_2}}},

df= rac{(n_1 - 1)(n_2 - 1)}{(n_2 - 1)c^2 + (n_1 - 1)(1 - c^2)},

c= rac{ rac{s_1^2}{n_1}}{ rac{s_1^2}{n_1} + rac{s_2^2}{n_2}}

'or' df=min{n_1,n_2}
|(Normal populations 'or' ''n''₁ + ''n''₂ > 40) 'and' independent observations 'and' σ₁ ≠ σ₂ 'and' (σ₁ 'and' σ₂ unknown)
|-
|Paired t-test
|t= rac{overline{d}-d_0}{s_d},

df=n-1
|(Normal population of differences 'or' ''n'' > 30) 'and' σ unknown
|-
|One-proportion z-test
|z= rac{hat{p} - p}{sqrt{ rac{p(1-p)}{n}}}
|''np'' > 10 'and' ''n''(1 − ''p'') > 10
|-
|Two-proportion z-test, equal variances
|z= rac{(hat{p}_1 - hat{p}_2) - ({p}_1 - {p}_2)}{sqrt{hat{p}(1 - hat{p})( rac{1}{n_1} + rac{1}{n_2})}}
hat{p}= rac{x_1 + x_2}{n_1 + n_2}
|n₁p₁ > 5 AND ''n''₁(1 − ''p''₁) > 5 'and' ''n''₂''p''₂ > 5 'and' ''n''₂(1 − ''p''₂) > 5 'and' independent observations
|-
|Two-proportion z-test, unequal variances
|z= rac{(hat{p}_1 - hat{p}_2) - (p_1 - p_2)}{sqrt{ rac{hat{p}_1(1 - hat{p}_1)}{n_1} + rac{hat{p}_2(1 - hat{p}_2)}{n_2}}}
|''n''₁''p''₁ > 5 'and' ''n''₁(1 − ''p''₁) > 5 'and' ''n''₂''p''₂ > 5 'and' ''n''₂(1 − ''p''₂) > 5 'and' independent observations
|}
The statistics for some other tests have their own page on Wikipedia, including the Wald test and the likelihood ratio test.

Criticism


Some statisticians have commented that pure "significance testing" has what is actually a rather strange goal of detecting the existence of a "real" difference between two populations. In practice a difference can almost always be found given a large enough sample, what is typically the more relevant goal of science is a determination of causal effect size. The amount and nature of the difference, in other words, is what should be studied. Many researchers also feel that hypothesis testing is something of a misnomer. In practice a single statistical test in a single study never "proves" anything.
"A little thought reveals a fact widely understood among statisticians: The null hypothesis, taken literally (and that's the only way you can take it in formal hypothesis testing), is almost always false in the real world.... If it is false, even to a tiny degree, it must be the case that a large enough sample will produce a significant result and lead to its rejection. So if the null hypothesis is always false, what's the big deal about rejecting it?" (Cohen 1990) [Please cite exact paper]
(The above criticism only applies to point hypothesis tests. If one were testing, for example, whether a parameter is greater than zero, it would not apply.)
Yet the argument in the previous paragraph misses the point. Of course, if the hypothesis is true, then for any significance level (p-value), if we had a large enough sample we could fail the test. The whole purpose of significance testing is not to show the truth or falseness of claims. The purpose of such tests is show how confident we can be in a claim that is based solely on a sample-based estimate. It is, in effect, a reflection of the reliabilty of our sample design and estimation strategy. It is not concerned with some impractical or costly sample design which would yield a near-perfect estimate, but with practical and cost-effective situations that can be carried out by real people in the real world.
People don't care about what they could do if the world was perfect, they want to know what we can say, given what we know NOW, given our limited time, money, resources, and the fact that even if we had a census of the population, there are still plenty of sources of error. We have this imperfect, incomplete, set of data, and we want to make useful claims to allow us to progress. We want actionable intelligence. These hypothesis tests in effect define minimum thresholds that, if met, allow us to be reasonably confident that our claims are true or false. It does not allow us to PROVE they are true or false.
One should not get so hung up on hypotheses tests per se, but try and think more in terms of sample-error. There is some true value out there, unknown to us, and a (theoretical) distribution of all possible sample-estimates (hopefully tightly clustered around the true value, otherwise, what good are they?). We want to know how likely it is for any given sample estimate to be unacceptably far from the true value. If it is highly unlikely, then we are confident that for the vast majority of samples our estimate has acceptable error, and hence it is highly likely the single sample we actually took yields an estimate with acceptable error. Note, of course, that "acceptable error" is a matter of personal opinion. One person's acceptable error is another person's life-ending tragedy.
"... surely, God loves the .06 nearly as much as the .05." (Rosnow and Rosenthal 1989)[Please cite exact paper]
"How has the virtually barren technique of hypothesis testing come to assume such importance in the process by which we arrive at our conclusions from our data?" (Loftus 1991)[Please cite exact paper]
"Despite the stranglehold that hypothesis testing has on experimental psychology, I find it difficult to imagine a less insightful means of transiting from data to conclusions." (Loftus 1991)[Please cite exact paper]
Even when you reject null hypothesis, effect sizes should be taken into consideration. If the effect is statistically significant but the effect size is very small, then it is a stretch to consider the effect theoretically important.

See also




Comparing means test decision tree

Counternull

Multiple comparisons

Omnibus test

Behrens-Fisher problem

Bootstrapping (statistics)

Checking if a coin is fair

Falsifiability

Fisher's method for combining independent tests of significance

Null hypothesis

P-value

Statistical theory

Statistical significance

Type I error, Type II error

External links



Bayesian critique of classical hypothesis testing

Critique of classical hypothesis testing highlighting long-standing qualms of statisticians

Analytical argumentations of probability and statistics

Laws of Chance Tables - used for testing claims of success greater than what can be attributed to random chance

This article provided by Wikipedia. To edit the contents of this article, click here for original source.

psst.. try this: add to faves