Updated: Aug 28, 2019
First Principles Series
*Juptyer notebook for the code in this post can be downloaded here.
I have always been the kind of learner that needs a foundational understanding before I can grasp the concept and move on. I can "understand what you're telling me" without that foundation, but I can't understand it's implications and practical use. In the end, what good is the information if you can't use it. So, I usually take the time to try to build that foundational knowledge that allows me to view the topic from a holistic point of view (even when it's a tedious and ambiguous journey).
When I was learning about linear regressions, the curriculum teaches that using a log transformation could make my data less skewed. If I take that at face value, that's easy to understand. Where I started struggling was why does log transforming data make our data more normal and what exactly does that mean for our interpretation of our model. This point is where I hit that infamous foundational knowledge gap. Things seemed to get more complex as I started to try to pull back the curtains on this mess.
We'll start with how a log transformation works. To log transform a column of your data, you replace each datum with the logarithm of itself. Most commonly it's the natural logarithm, where e is the base number, so that's what I'll be using in this post, but there are other variations as well. e is a mathematical constant which is an irrational number and approximately equal to 2.71828.
"The natural logarithm of x is the power to which e would have to be raised to equal x." (Wikipedia)
The number that we are returning is how many times our base number, e, has to be multiplied by itself to reach our input number. Because we're dealing with exponents, getting to the next integer in output takes progressively larger steps in the input. Below is a graph that shows this effect more clearly.
Log-transforming our data pulls the more significant data points closer to the smaller ones. The outcome is that it removes our issue of skewness to a degree. It's not a sure fire way to meet assumptions you need for your linear regression model, but in the right circumstances, it can be constructive.
We'll be using the King County housing data which is a standard dataset for beginner projects. The dataset can be found on Kraggle or in the Github repo for this blog post. For those who are interested, King County has an easy to use open platform to download a wide variety of county datasets for people to use.
We need a couple of libraries to get started, Pandas, Matplotlib.pyplot, Statsmodels.api, and Numpy. Once we have our dataset loaded, we can use plt to display a histogram of the data to assess normality quickly. We'll also use pd.Series.kurtosis() and pd.Series.skew() to calculate the level of kurtosis and skewness we see in the data. These numbers help us see how it compares to after the log transformations. Both of these functions return a single number where zero is considered normal and the further you get from zero, the more kurtosis or skewness your data shows.
As we can see from the histogram in image 2, our data is not normally distributed. Our data is skewed right. Our kurtosis and skewness number are both higher than one. So to solve this problem, we can perform a log transformation on the Series. Numpy makes this a simple task.
Just looking at our visualization in image 3, the data looks much more normally distributed. It doesn't look like a perfectly normal distribution, but there is more of a bell shape than previously. Our numbers prove this as well. While our kurtosis is now negative, both of our numbers are within 1 from a normal distribution, which passes our normality assumption.
Just for demonstration purposes, let's look at some left skewed data. The first couple of lines is reversing the dataset to be left-skewed and making sure there are no negative or zero values as that cases an error when we go to log transform it.
Once we have our left skewed data, displayed in image 4, we repeat the same steps as before. Our data is closer to a normal distribution than before, but we still have some issues that we can try to solve with a log transformation. I had to increase the number of bins in our histogram for us to be able to see what's happening.
Looking at this histogram in image 5, we can tell our data didn't get closer to a normal distribution and the numbers show much worse it got. Because we were working with a lot more numbers on the high range of our data set, our log transformation pulled a majority of the values together, and we get extreme positive kurtosis.
As we can see from the last example, there are a few issues that can arise from log transforming your data. It's far from a one size fits all solution to skewness. The first problem is to figure out is what to do with negative or zero values. You can't log them because no number can be in the exponent and output a negative number or zero. So, you have to make sure your column doesn't contain observations that fit that criteria or Python throws an error. Additionally, if the data has a wide range of values, there is the potential of creating more skewness.
The second issue is that log transforming a column can create high positive kurtosis. High positive kurtosis may affect your assumption of normality to a level that you find unacceptable. You'll have to make sure you don't end up with similar results that we saw in our left skewed data transformation.
Now that you know when and why to use a log transform, you need to know what it means for your results. The best resource I found has been an article written by University of Virginia Senior Research Data Scientist, Clay Ford. He describes easy to implement formulas for interpretation when you've log-transformed the dependent variable, the independent variable, or both. It's a great article to read at length, but the formulas in plain English are in the section called rules for interpretation.
So there are the first principles that I felt were missing from when I was learning to use log transformations in my linear regression model. The hope is that you are now able to know when to use log transformations and then more confidently implement and interpret your results.