Principal Component Analysis (PCA) and Ordinary Least Squares (OLS) are two important statistical methods. They are even better when performed together. We will explore these methods using matrix operations in R and introduce a basic Principal Component Regression (PCR) technique.
We will generate a simple data set of four highly correlated exploratory variables from the Gaussian distribution, and a response variable which will be a linear combination of them with added random noise.
> sigma=matrix(.9, nrow=4, ncol=4) + diag(4)*0.1> set.seed(2021)
> data <- as.data.frame(mvrnorm(20, mu = mu, Sigma = sigma),
+ empirical = T)>…
Bootstrap is a method of random sampling with replacement. Among its other applications such as hypothesis testing, it is a simple yet powerful approach for checking the stability of regression coefficients. In our previous article, we explored the permutation test, which is a related concept but executed without replacement.
Linear regression relies on several assumptions, and the coefficients of the formulas are presumably normally distributed under the CLT. It shows that on average if we repeated the experiment thousands and thousands of times, the line would be in confidence intervals. The bootstrap approach does not rely on those assumptions*, but…
To compare outcomes in experiments, we often use Student’s t-test. It assumes that data are randomly selected from the population, arrived in large samples (>30), or normally distributed with equal variances between groups.
If we do not happen to meet these assumptions, we may use one of the simulation tests. In this article, we will introduce the Permutation Test.
Rather than assuming underlying distribution, the permutation test builds its distribution, breaking up the associations between or among groups. Often we are interested in the difference of means or medians between the groups, and the null hypothesis is that there is…
When we perform traditional AB testing, we need a randomized environment for the experiment. But what if we cannot randomly choose the participants?
In this article, we will explore two powerful techniques for estimating the effect in non-randomized experiments: difference in differences and propensity score matching. We will briefly introduce these methods using a classical case study David Card and Alan B. Krueger conducted in 1994.
In our previous example, we estimated the effect of different sales approaches in a randomized environment.
Suppose now that we have a retail chain with presentence in different cities or countries. We want to…
From this article, we will learn how to run a Decision Tree classifier using Python and sklearn package. Also, you will understand how to split the dataset on training and testing sets, and how to measure the accuracy of the model. Finally, we will plot the model.
For this algorithm, there are fewer than 25 lines of code, which is typical for some of the Machine Learning algorithms.
These are the economies of South Korea, Taiwan, Singapore, and Hong Kong. The World Bank database (https://data.worldbank.org) for some reason does not have the data for Taiwan, therefore, we will look into the economies of only Three Tigers.
Although these Tigers are in the same club, in this article we will explore some of their differences. For our analysis, we have obtained various features such as GDP, Electric power consumption, Patent applications, Researchers in R&D, Export/Import, and many others. …
Data Analytics in R and Python, Machine Learning, Management, Innovations, Mathematics, Chess, Kafka, Joyce, Stravinsky, Rossini, Bubble Tea and Ice Cream