
Milk Basket EDA Question on Dairy DataSet
Question:
You are given a dataset with the following tables: Dairy_Products
, Dairy_Farms
, Milk_Production
, Suppliers
, and Dairy_Sales
. Perform an advanced exploratory data analysis (EDA) to answer the following questions:
Product Performance Analysis:
- Identify which dairy product has the highest sales (in terms of quantity sold and total revenue). Visualize the performance of each product over time.
- Analyze the average price per unit for each product type (Liquid vs. Solid) and see if there is any significant difference.
Farm Production & Performance Analysis:
- Identify which farm produces the most milk in a given period and analyze how production correlates with product sales.
- Determine whether there is any correlation between milk quality rating and the sales of products derived from that milk.
Sales Analysis:
- Find out if there are any seasonal patterns in the sales of dairy products.
- Identify the top customers based on total amount spent and quantity purchased.
Supplier Performance Analysis:
- Analyze the supplier data and determine if there is any correlation between the suppliers’ products and sales. Which supplier’s products are the highest-selling?
- Investigate if product expiration dates are affecting sales, and whether products closer to expiry sell more quickly.
Time-based Trends:
- Visualize the monthly trends in production and sales of dairy products over the last year.
Example
Dairy_Farms:
Milk_Production:
Suppliers:
Dairy_Sales:
Solution:
Here’s a step-by-step Python solution using pandas, matplotlib, seaborn, and other relevant libraries.
First, let’s assume that you have loaded the data from your SQL database into Pandas DataFrames (let’s call them df_dairy_products, df_dairy_farms, df_milk_production, df_suppliers, df_dairy_sales).
Dairy_Products
:
product _id | product _name | product _type | quantity | unit _price | production _date | expiration _date | supplier _id |
---|---|---|---|---|---|---|---|
1 | Milk | Liquid | 500 | 1.20 | 2025-01-10 | 2025-02-10 | 1 |
2 | Cheese | Solid | 200 | 5.50 | 2025-01-12 | 2025-04-12 | 2 |
3 | Butter | Solid | 150 | 3.00 | 2025-01-15 | 2025-02-15 | 3 |
farm _id | farm _name | farm _location | owner _name | established _year | contact _number |
---|---|---|---|---|---|
1 | Green Valley Dairy | Texas, USA | John Doe | 1998 | 123-456-7890 |
2 | Sunny Acres Dairy | California, USA | Jane Smith | 2005 | 987-654-3210 |
production_id | farm_id | date_of_production | liters_produced | quality_rating |
---|---|---|---|---|
1 | 1 | 2025-01-10 | 1000 | A |
2 | 1 | 2025-01-12 | 1200 | A |
3 | 2 | 2025-01-14 | 800 | B |
supplier_id | supplier_name | contact_info | |
---|---|---|---|
1 | Dairy Direct | 123 Dairy St, Texas, USA | rinku@gmail.com |
2 | CheeseWorld | 456 Cheese Ln, Wisconsin, USA | rahul@gmail.com |
3 | ButterBest | 789 Butter Rd, Ohio, USA | jalpa@gmail.com |
sale_id | product_id | sale_date | quantity_sold | total_amount | customer_id |
---|---|---|---|---|---|
1 | 1 | 2025-01-17 | 50 | 60.00 | 101 |
2 | 2 | 2025-01-17 | 30 | 165.00 | 102 |
3 | 3 | 2025-01-17 | 20 | 60.00 | 103 |
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
# Sample DataFrames assumed to be loaded from SQL
# Example: df_dairy_products, df_dairy_farms, df_milk_production, df_suppliers, df_dairy_sales
1. Product Performance Analysis
a. Identify the highest-selling products (quantity and revenue):
# Merge dairy products with sales to calculate total sales
product_sales = pd.merge(df_dairy_products, df_dairy_sales, on='product_id')
# Calculate total revenue for each sale
product_sales['total_revenue'] = product_sales['quantity_sold'] * product_sales['unit_price']
# Group by product to get total quantity sold and total revenue
product_performance = product_sales.groupby('product_name').agg(
total_quantity_sold=('quantity_sold', 'sum'),
total_revenue=('total_revenue', 'sum')
).reset_index()
# Sort products by total revenue and quantity sold
top_selling_products = product_performance.sort_values(by=['total_revenue', 'total_quantity_sold'], ascending=False)
# Display the top-selling products
print(top_selling_products.head())
# Visualization
plt.figure(figsize=(10, 6))
sns.barplot(data=top_selling_products, x='product_name', y='total_revenue')
plt.title('Top-Selling Dairy Products by Revenue')
plt.xticks(rotation=45)
plt.show()
b. Average price comparison between Liquid and Solid products:
# Group by product type (Liquid vs Solid) and calculate average unit price
product_type_price = df_dairy_products.groupby('product_type').agg(
avg_price=('unit_price', 'mean')
).reset_index()
# Visualize average prices for each product type
plt.figure(figsize=(8, 5))
sns.barplot(data=product_type_price, x='product_type', y='avg_price')
plt.title('Average Price per Unit for Each Product Type')
plt.show()
2. Farm Production & Performance Analysis
a. Identify which farm produces the most milk:
b. Correlation between milk quality rating and sales of derived products:
# Merge milk production data with farms
farm_production = pd.merge(df_milk_production, df_dairy_farms, on='farm_id')
# Group by farm to calculate total milk produced
farm_performance = farm_production.groupby('farm_name').agg(
total_milk_produced=('liters_produced', 'sum')
).reset_index()
# Sort farms by total milk produced
top_farms = farm_performance.sort_values(by='total_milk_produced', ascending=False)
# Display top farms
print(top_farms.head())
# Merge sales data with farm production to check correlation
product_sales_with_quality = pd.merge(product_sales, df_milk_production[['production_id', 'quality_rating']], on='production_id')
# Group by quality rating and calculate total revenue and quantity sold
quality_sales_analysis = product_sales_with_quality.groupby('quality_rating').agg(
total_quantity_sold=('quantity_sold', 'sum'),
total_revenue=('total_revenue', 'sum')
).reset_index()
# Visualize the analysis
plt.figure(figsize=(10, 6))
sns.barplot(data=quality_sales_analysis, x='quality_rating', y='total_revenue')
plt.title('Sales Performance Based on Milk Quality Rating')
plt.show()
3. Sales Analysis
a. Seasonal patterns in sales:
b. Identify top customers based on total amount spent:
# Convert sale_date to datetime if not already
df_dairy_sales['sale_date'] = pd.to_datetime(df_dairy_sales['sale_date'])
# Extract month from sale date
df_dairy_sales['month'] = df_dairy_sales['sale_date'].dt.month
# Group by month to get total sales
monthly_sales = df_dairy_sales.groupby('month').agg(
total_sales=('total_amount', 'sum')
).reset_index()
# Plot monthly sales trends
plt.figure(figsize=(10, 6))
sns.lineplot(data=monthly_sales, x='month', y='total_sales', marker='o')
plt.title('Monthly Sales Trend')
plt.xticks(ticks=np.arange(1, 13), labels=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
plt.show()
# Group by customer_id to calculate total amount spent
customer_spending = df_dairy_sales.groupby('customer_id').agg(
total_spent=('total_amount', 'sum'),
total_quantity=('quantity_sold', 'sum')
).reset_index()
# Sort by total amount spent
top_customers = customer_spending.sort_values(by='total_spent', ascending=False)
# Display the top customers
print(top_customers.head())
4. Supplier Performance Analysis
a. Correlation between suppliers’ products and sales:
b. Product expiration and sales correlation:
# Merge sales data with suppliers' product information
product_sales_with_suppliers = pd.merge(df_dairy_products, df_dairy_sales, on='product_id')
product_sales_with_suppliers = pd.merge(product_sales_with_suppliers, df_suppliers[['supplier_id', 'supplier_name']], on='supplier_id')
# Group by supplier and calculate total sales
supplier_sales = product_sales_with_suppliers.groupby('supplier_name').agg(
total_sales=('total_amount', 'sum'),
total_quantity_sold=('quantity_sold', 'sum')
).reset_index()
# Visualize supplier performance
plt.figure(figsize=(12, 6))
sns.barplot(data=supplier_sales, x='supplier_name', y='total_sales')
plt.title('Supplier Performance Based on Sales')
plt.xticks(rotation=45)
plt.show()
# Calculate the difference between current date and expiration date
df_dairy_products['days_to_expiry'] = (pd.to_datetime(df_dairy_products['expiration_date']) - pd.to_datetime('today')).dt.days
# Merge sales data with product expiry data
product_sales_with_expiry = pd.merge(df_dairy_sales, df_dairy_products[['product_id', 'days_to_expiry']], on='product_id')
# Analyze the relationship between days to expiry and sales
plt.figure(figsize=(10, 6))
sns.scatterplot(data=product_sales_with_expiry, x='days_to_expiry', y='total_amount')
plt.title('Sales and Days to Expiry Correlation')
plt.show()
5. Time-based Trends
a. Visualize the monthly trends in production and sales:
# Create a column for month in both milk production and dairy sales
df_milk_production['month'] = pd.to_datetime(df_milk_production['date_of_production']).dt.month
df_dairy_sales['month'] = pd.to_datetime(df_dairy_sales['sale_date']).dt.month
# Monthly production data
monthly_production = df_milk_production.groupby('month').agg(
total_production=('liters_produced', 'sum')
).reset_index()
# Monthly sales data
monthly_sales = df_dairy_sales.groupby('month').agg(
total_sales=('total_amount', 'sum')
).reset_index()
# Merge production and sales for comparison
monthly_trends = pd.merge(monthly_production, monthly_sales, on='month')
# Plot the comparison of production and sales trends
plt.figure(figsize=(10, 6))
sns.lineplot(data=monthly_trends, x='month', y='total_production', label='Milk Production', marker='o')
sns.lineplot(data=monthly_trends, x='month', y='total_sales', label='Sales', marker='o')
plt.title('Monthly Production vs Sales')
plt.xticks(ticks=np.arange(1, 13), labels=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
plt.legend()
plt.show()