Statistics for Data Science

Pearson Correlation

  • Pearson Correlation Coefficient
    The Pearson Correlation Coefficient is a number between -1 and 1 that determines if two series of data are related to each other. A correlation coefficient of 1 represents a perfect confident positive linear correlation and a correlation coefficient of -1 represents a perfect confident negative linear correlation. When the correlation coefficient is close to 0, there is no evidence of any relationship between the two series of data.
    The problem when you only look at the Pearson Correlation Coefficient is that you can get a perfect linear correlation with only a few samples. Therefore we analysis a second measurement that represents the confidence of the correlation coefficient.
  • Two-Tailed p-value
    The p-value is the standard method in statistic to measure the significance of an empirical analysis. For every relationship we start with the null hypothesis which is that the two series of data is unrelated. The p-value is a number between 0 and 1 representing the probability that this data would have arisen if the null hypothesis were true. Therefore a low p-value (such as 0.01) is taken as evidence that the null hypothesis can be rejected. and that the relationship between the two sets of data is highly significant.

OLS Regression Results

  • R-squared [0,1]
    represents the percentage of variance for a dependent variable that is explained by the independent variables. Mayor drawback is that R-squared increases with the number variables. Therefore the adjusted R-squared only increases when an added variable adds to the explanatory power of the regression.
  • Prob(F-Statistic)
    is the overall significance of the regression. The null hypothesis is “all the regression coefficients are equal to zero”. The Prob(F-Statistic) is the probability that the null hypothesis is true. If the Prob(F-Statistic) is close to zero, we know that the overall regression is meaningful.
  • AIC (Akaike’s Information Criteria) / BIC (Bayesian information criteria)
    are useful for model selection, because they are measurements to penalize when a new variable is added. A lower AIC or BIC implies a better model.
  • Prob(Omnibus)
    tests the assumption of OLS that the errors are normally distributed. The null hypothesis is that the errors are normally distributed. Therefore a Prob(Omnibus) close to 1 satisfies the OLS assumption.
  • Durbin-Watson
    tests the assumption that OLS is homoscedasticity (the variance of errors is constant). A value between 1..2 satisfies the OLS assumption.
  • Prob(Jarque-Bera)
    tests the assumption of OLS that the errors are normally distributed. The null hypothesis is that the errors are normally distributed. Therefore a Prob(Jarque-Bera) close to 1 satisfies the OLS assumption.
  • Regression Coefficient (coef)
    The regression coefficient shows if there is a positive or negative correlation between the independent variable and the dependent variable. The coefficient value signifies how much the mean of the dependent variable changes given a one-unit shift in the independent variable while holding other variables in the model constant.
  • Standard Error (std err)
    The standard error is an estimate of the standard deviation of the coefficient, the amount it varies across cases. It can be thought of as a measure of the precision with which the regression coefficient is measured.
  • t
    The t statistic is the regression coefficient divided by its standard error.
  • P>|t|
    the p value shows the significance of the regression analysis for the variable. A value <0.05 indicates that the regression is statistically significant. For example if 95% of the t distribution is closer to the mean than the t-value on the coefficient you are looking at, then you have a P value of 5% (significance level of 5%).

OLS Regression Plots

  • Y and Fitted vs X
    A fit plot shows predicted values of the response variable versus actual values of Y. If the linear regression model is perfect, the predicted values will exactly equal the observed values.
  • Residuals versus Variable
    The residuals are defined as Residual = Observed – Predicted and therefore a measurement for the error of the prediction. If the residual is a positive value (on the y-axis) the prediction was too low, and a negative value mean the prediction was too high. The objective is that all measurements are on a horizontal 0 line. From the plot you can see if there are some value ranges that have a high error and could be declared as outlier.
  • Partial Regression Plot
    show the effect of adding a new variable to an existing model by controlling for the effect of the predictors already in use. They’re useful for spotting points with high influence.
  • CCPR Plot
    The CCPR plot provides a way to judge the effect of one regressor on the response variable by taking into account the effects of the other independent variables (Source).

Leave a Comment