Statistical Tests | What, when, how?

Demystifying hypothesis testing - Blog | luminousmen

Inferential Statistics

This term is nothing but a set of statistical tests that the researchers and data scientists use to make some inferences about the data that they are working on. The tests give them a certain level of confidence about whether the observed pattern in the data is due to the intervention of just by chance! Now, what type to test can be used in a certain situation depends a lot on the distribution of the data, the type of variable under the test, and the overall research design.

There are a lot of statistical tests out there, but I'll focus on those that I've learned and implemented in my professional and academic journey. Please feel free to add more in comments if you'd like.

In broad terms, the statistical tests are divided into 2 categories:
  1. Parametric tests
  2. Non-parametric tests
Parametric Tests: They are most commonly used when the distribution of your data looks normal.
Non-parametric Tests: They are used when the data is not really normal.

Before I get into any of those tests, I will walk through some key variables that we all need in all the tests.

Population

The population is ideally your whole data. For example:
  1. You want to know the average weight of people around the world. You cannot go ahead and measure it for each and every person. But still, the world population is the population that we're talking about.
  2. You have over 100 M records in your dataset. You most probably won't be able to use the whole dataset for building any models, but still, make some analysis on it. 
How do you make that analysis?

Sample

A sample distribution is randomly (or not) taken sample from the whole population. For example:
  1. You took weights of 10,000 people from all over the world.
  2. you randomly selected 100,000 records from the 100 M records in the overall dataset.

Hypothesis

In statistical testing, we define 2 hypothesis
NULL Hypothesis (Ho): This is something that has always been. Like someone is innocent until proven guilty.
ALTERNATE Hypothesis (Ha): This is what you claim it to be. Like someone is guilty.

What you do is that you conduct those tests to understand if the new claim is actually valid or not. Now in a more technical way, if you take 2 random samples from the population, you could define your hypothesis as:
Ho = There is no significant difference between the two samples taken
Ha = There is a significant difference between the two samples taken

Critical Value

Critical value is a point(s) on the scale of test statistics beyond which we REJECT the Ho with a certain confidence level. For example, in the test below:
AndersonResult(statistic=1.383562257554786,
               critical_values=array([0.574, 0.654, 0.785, 0.916, 1.089]), 
               significance_level
significance_level=array([15. , 10. , 5. , 2.5, 1. ]))
There are 5 critical values with 5 significance levels. This is how you interpret it: 
  • to reject the Ho with 85% confidence (1-significance level), the test statistic should be greater than 0.574
  • to reject the Ho with 90% confidence (1-significance level), the test statistic should be greater than 0.654
and so on... In this case, the test statistic is overall the critical values, so we can reject the Ho with 99% confidence.


p-value

For this, just remember that p-value or probability value is compared with a significance level. If p-value is less than the significance level (0.05 for 95% confidence), you can reject the Ho. Else, you fail to reject it.

Now let's go through the most common tests out there:

Correlational tests

These tests look for any association/correlation between two variables. For example, the higher the salary, the higher are your expenses. Hence, there's a good correlation between salary and expense variables.

 What?When? How? (in Python) 
Pearson CorrelationTests for the correlation and strength between two continuous variables. Like salary and expense.
DataFrame.corr(method='pearson'
              min_periods=1)
or
scipy.stats.pearsonr(xy)
Spearman Correlation (non-parametric)Tests for the correlation and strength between two ordinal categorical variables. Like grades and IQ.
DataFrame.corr(method='spearman'
              min_periods=1) 
or
scipy.stats.spearman(xy)
Chi-Square TestTests for the correlation and strength between two nominal categorical variables. Like Gender and Color.
scipy.stats.chisquare(f_obsf_exp=None)

'Mean' comparison tests

What?When? How? (In Python) 
Z-test This test compares the sample with its population to see if the sample truly represents the actual population or that is is actually taken from the given population.

Ho = The sample is taken from the population i.e. sample mean is same as the population mean
Ha = The sample is NOT taken from the population i.e. sample mean is NOT the same as the population mean

This is how you calculate the Z-statistic:
z = (x — μ) / (σ / √n), where

x= sample mean
μ = population mean
σ / √n = population standard deviation

NOTE: Sample size should be > 30
statsmodels.stats.weightstats.ztest(
x1
x2=None
value=0
alternative='two-sided'
usevar='pooled'
ddof=1.0) 
 T-testA T-test is done to compare the mean of two samples. This is used when the population parameters (mean and standard deviation are NOT known) 

This works when sample is small (< 30) and not really normally distributed.
scipy.stats.ttest_ind(a
baxis=0
equal_var=True
nan_policy='propagate') 
Paired T-testThis is used to compare the means of two variables from the same population. For example: if the IQ of the person increased after taking a particular course.  
Independent T-test (Two Sampled T-test)This is used to understand if there is a significant difference between the mean values of two totally unrelated variables. Like the scores of boys and girls. 
One sampled T-testOne sampled t-test is used to compare the mean value of a sample with a fixed value. 
scipy.stats.ttest_1samp(a
popmeanaxis=0)
 ANOVA Analysis of Variance 
One way ANOVAThis test is used when you have a categorical independent variable with 2 or more categories, along with a normally distributed continuous target (dependent) variable and you want to compare the means of the target variable broken down in those categories to see if they are same or if any category is significantly different.

For example: If there's a difference between the average score of students given that there are 3 different schools and a student went to only 1 of them.
scipy.stats.f_oneway(sample1,
sample2, sample_n) 


I'll add more as I learn!

Cheers!







Comments

Post a Comment