• Joe Wilkinson

Northwind Hypothesis Tests

Introduction


For this project, I worked with the Northwind database--a free, open-source dataset created by Microsoft containing data from a fictional company. The goal of this project was to gather information from a real-world database and use my knowledge of statistical analysis and hypothesis testing to generate analytical insights that can be of value to the company. I tested four separate hypotheses and analyzed the results.


My first step was to import the necessary libraries and to query the database for the tables I might need. I used SQLAlchemy to connect to the database and run queries using Python. Using this connection, I created a few helper functions to gather information about the individual tables, so I could get a better feel for the data I had to work with. This exploration allowed me to identify what hypotheses we had the data to answer. The hypotheses that I wanted to test were:


1. Do discounts have a statistically significant effect on the number of products customers order? If so, at what level(s) of discount?

2. Is the business seasonal?

3. Which supply regions have lower reorder levels?

4. Which customers generate statistically higher revenues?


Hypothesis Tests


1. Discount Effect


The first hypothesis I tested was, "do discounts have a statistically significant effect on the number of products customers order?". I started by querying the OrderDetail table. Filtering by OrderId showed that each record in the OrderDetail table describes one product within one order. Thus, each observation in the Order table has one or more observations in the OrderDetail table. Meaning, we don't need the Orders table as it mainly describes details about the shipment and not what's in the order.


When looking at all the observations of the same product in the OrderDetail table shows that the company didn't always use the price listed in the Product table's UnitPrice. The quantities and discount levels don't seem to be causing the difference, and it's not the average price. So we'll assume the Product table's UnitPrice to be the suggested retail price and therefore isn't relevant for what we're trying to accomplish.


The only table that has pertinent data is the OrderDetail table. I queried the database to return the columns with the data we need and place it into a DataFrame. Looking at the value counts for the 'Discount' column:


0.00 1317

0.05 185

0.10 173

0.20 161

0.15 157

0.25 154

0.03 3

0.02 2

0.01 1

0.04 1

0.06 1


There are a few discount rates that the company gave to so few customers that I won't be able to come to any statistically significant conclusions and thus dropped any rows with a discount of 0.01, 0.02, 0.03, 0.04, and 0.06. Now that I had the data, I needed to run the tests, and I split the data into two groups, control (without a discount given) and treatment (with a discount given). To avoid sampling bias and get the true mean for each group, I created a function that took a DataFrame and k (optional) as arguments. This function takes the mean of k random samples and returns the overall mean. While it may not be the exact population mean, it's going to be very close because of the law of large numbers.

Visually looking at the mean distributions for both the control and treatment group, we see a clear difference between the two. The difference does look significant, but we'll run a hypothesis test to check to see if that is the case.


Now that we have our treatment and control datasets, we can execute our hypothesis test.

We'll take a sample of 100 observations from our control and treatment groups and run a one-sided, two-sample t-test with an alpha level of 0.05. Our results are very significant, and we reject the null hypothesis and say that the discounted quantities are higher than the non-discounted quantities.


Now that we know there is a statistical difference, we need to determine at what level the discounts are most productive. We'll use ANOVA to determine if there is more variation between the different levels of discount or within each level of discount.


Form Hypothesis:

I created a new DataFrame where each column is a sample of a different discount rate (0.05, 0.10, 0.15, 0.20, 0.25). The new DataFrame allows me to run the calculations easily.

Looking at the density plot, we can see that there is very minimal difference between each discount level. If the level of the discount rate affected, we would expect to see a lot more variation. I ran a one-way ANOVA test to make sure the numbers showed the same result that the graph seemed to.


Our F-statistic of 0.2258 suggests the between-groups variance is roughly 1/4 of the within-group variance. We fail to reject the null hypothesis of equal mean values due to the p-value being > 0.05. These results aren't surprising considering the density plots of our samples. They are visually close to identical, so our recommendation is to stick with low levels of discounts as there is no evidence to say that increasing the discount level increases the quantity bought, so the outcome is reduced revenue earned.


2. Seasonality


Next, we'll look at the seasonality of the business. We want to see if there is a time of year where sales are significantly higher. If a season does have higher sales, this gives us more insight when we're forecasting sales and planning marketing campaigns.


Hypothesis:

For this test, we'll need data from the 'Order' and 'OrderDetail' tables. The 'Order' table has the date of each transaction, and the 'OrderDetail' table has the unit price, quantity, and discount, which are used to calculate the total revenue of the transaction. Once I retrieved the relevant data from the database and converted the 'OrderDate' column into the DateTime datatype, I wanted to see how the data looked. I created a bar graph of the value counts for the number of orders by month.

This initial graph does seem to indicate that there might be some seasonality to the number of orders. Even though the number of orders differs by month, that doesn't mean that there aren't other factors at play and what we're seeing isn't actual seasonality.

By looking at the revenue each month, there is a clear trend of growth. This trend distorts the revenue numbers. I don't want a month to have higher numbers simply because it's chronologically the last one to have an observation and therefore, the company had time to grow. To compensate for this, I had to calculate a regression line for the expected revenue. This regression line allows me to calculate the variance between what I expect and the revenues we observed. A positive variation indicates that a month has high revenue and a negative variation indicating low revenues. The variation is how I will determine seasonality while accounting for the overall growth of the company.

The last month we have observations for only has observations for the first few days of the month. The lack of observations for that month skews the data, so we'll drop this partial month from our dataset. Another problem is that the dataset we have is from a small window of time. The small window means there is an unequal number of months (i.e., May and June only have one full month of observations while all other months have two). I want to group the data into seasons (spring, summer, fall, and winter), this allows me to run an ANOVA test to determine if there is a difference between the seasons in terms of revenue or if the difference is just variation within the seasons.


When we run a one-way ANOVA test on the variance in revenues by season, we get an F-statistic of 0.3051. The F-statistic indicates that most of the variation in the dataset is within the seasons and not between them. A high p-value of 0.8213 indicates that there isn't enough statistical proof to indicate that there is a difference in revenue between the seasons.


3. Supply Region Reorder Level


I wanted to look at product reorder levels by region. I wanted to be able to see if any regions required higher or lower reorder levels. If we find a statistically different region, this might give the company a clue on what a specific region does differently that helps or hinders the company's ability to appropriately stock products.


Hypothesis:

To run this test, we'll need Id, CompanyName, and Region from the 'Supplier' table. As well as Id, ProductName, SupplierId, ReorderLevel, Discontinued from the 'Product' table. I dropped any discounted products because they are no longer relevant to our business. From there, I was able to group by the 'Region' and average all of the products' reorder levels within each region.

After dropping South America because it doesn't have any active products, I could see that most region's average reorder level looks relatively similar. The one exception being Eastern Asia, which looks meaningfully smaller. To make sure that we have a statistically significant result and that there aren't other regions that are different from the general population, I ran a one sample t-test for each region to see if that specific region meets our criteria for being different.

As we expected, Eastern Asia has a significantly lower reorder level than other regions. I can reject the null hypothesis and accept the alternative hypothesis that Eastern Asia is different from other regions in terms of reorder level. No other regions had a statistically significant difference. In future work, we'll want to look into what makes Eastern Asia so different. Are they able to deliver items faster? Do their products sell infrequently enough not to affect the bottom line? Is there an opportunity to generate more revenues if we reorder their product earlier?


4. Biggest Customers


We want to determine which customers generate a higher portion of revenue than expected so we can focus our efforts more on customers that make us the most money. If they choose to use another supplier, it would have a much more profound effect on our bottom line than an average customer.


Hypothesis:

Three database tables contain the data I need, Customer, Order, OrderDetail. Once I joined these tables, I was able to calculate the revenue for each product transaction by using the quantity, unit price, and discount columns. The result is a new column called 'Revenue.' When looking at the data, there were a handful of rows that had null values. I felt comfortable dropping any row with null values because it made up such a small percentage of our observations.


My next step was to group our observations by OrderId and summing the revenue column, giving us total revenue for each transaction, which is what we want to know. Merging this data with our customer info table gave us OrderId, Revenue, and CompanyName for each transaction.


By creating a list of DataFrames of each customer's purchases, I was able to run a one sample t-test on each company's orders. If the t-statistic associated with the company's revenue is higher than the critical t-statistic and the p-value is lower than our predetermined alpha, 0.05, we'll reject the null hypothesis and assume the revenue is higher than the general population of our customers.



After running the test on each of the 80+ customers we have, we find that four customers generate significantly high enough revenues to reject the null hypothesis, Ernst Handel ($398,626.32), QUICK-Stop ($393,821.00), Hungry Owl All-Night Grocers ($136,219.55), and Save-a-lot Markets ($431,146.19) compared to the average revenue generated of $45,537.33.


Conclusion


1. Discount Effect

We discovered that giving a discount did increase the number of items sold but that there was no differentiation between the levels of discounts. My recommendation is to give discounts at the same rate but decrease the level of discount to increase profit.


2. Seasonality

Most of the variation in the dataset is within the seasons and not between them. In the future, our sales and cash flow projections won't have to try to adjust for seasonality.


3. Supply Region's Effect on Reorder Level

Eastern Asia has a significantly lower reorder level than in other regions. No other regions had a statistically significant difference.


4. Customer's Generating the Most Revenue

After running the test on each of the 80+ customers we have, we find that four customers generate significantly high enough revenues to reject the null hypothesis, Ernst Handel ($398,626.32), QUICK-Stop ($393,821.00), Hungry Owl All-Night Grocers ($136,219.55), and Save-a-lot Markets ($431,146.19) compared to the average revenue generated of $45,537.33.


Future Work


In future work, I'll want to look into what makes Eastern Asia so different in terms of reorder level. Are they able to deliver items faster? Do their products sell infrequently enough not to affect the bottom line? Is there an opportunity to generate more revenues if we reorder their product earlier?


I'll also want to identify which customers generated the least amount of revenue and see if I can tie any features to these two groups of customers. This information will help inform us of which marketing channels are yielding the highest revenues.