Milk Basket Python interview questions

Question:

You are given a dataset with the following tables: Dairy_Products, Dairy_Farms, Milk_Production, Suppliers, and Dairy_Sales. Perform an advanced exploratory data analysis (EDA) to answer the following questions:

Product Performance Analysis:
- Identify which dairy product has the highest sales (in terms of quantity sold and total revenue). Visualize the performance of each product over time.
- Analyze the average price per unit for each product type (Liquid vs. Solid) and see if there is any significant difference.
Farm Production & Performance Analysis:
- Identify which farm produces the most milk in a given period and analyze how production correlates with product sales.
- Determine whether there is any correlation between milk quality rating and the sales of products derived from that milk.
Sales Analysis:
- Find out if there are any seasonal patterns in the sales of dairy products.
- Identify the top customers based on total amount spent and quantity purchased.
Supplier Performance Analysis:
- Analyze the supplier data and determine if there is any correlation between the suppliers’ products and sales. Which supplier’s products are the highest-selling?
- Investigate if product expiration dates are affecting sales, and whether products closer to expiry sell more quickly.
Time-based Trends:
- Visualize the monthly trends in production and sales of dairy products over the last year.

Example Dairy_Products:

product _id	product _name	product _type	quantity	unit _price	production _date	expiration _date	supplier _id
1	Milk	Liquid	500	1.20	2025-01-10	2025-02-10	1
2	Cheese	Solid	200	5.50	2025-01-12	2025-04-12	2
3	Butter	Solid	150	3.00	2025-01-15	2025-02-15	3

Dairy_Farms:

farm _id	farm _name	farm _location	owner _name	established _year	contact _number
1	Green Valley Dairy	Texas, USA	John Doe	1998	123-456-7890
2	Sunny Acres Dairy	California, USA	Jane Smith	2005	987-654-3210

Milk_Production:

production_id	farm_id	date_of_production	liters_produced	quality_rating
1	1	2025-01-10	1000	A
2	1	2025-01-12	1200	A
3	2	2025-01-14	800	B

Suppliers:

supplier_id	supplier_name	contact_info	email
1	Dairy Direct	123 Dairy St, Texas, USA	rinku@gmail.com
2	CheeseWorld	456 Cheese Ln, Wisconsin, USA	rahul@gmail.com
3	ButterBest	789 Butter Rd, Ohio, USA	jalpa@gmail.com

Dairy_Sales:

sale_id	product_id	sale_date	quantity_sold	total_amount	customer_id
1	1	2025-01-17	50	60.00	101
2	2	2025-01-17	30	165.00	102
3	3	2025-01-17	20	60.00	103

Solution: Here’s a step-by-step Python solution using pandas, matplotlib, seaborn, and other relevant libraries. First, let’s assume that you have loaded the data from your SQL database into Pandas DataFrames (let’s call them df_dairy_products, df_dairy_farms, df_milk_production, df_suppliers, df_dairy_sales).

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Sample DataFrames assumed to be loaded from SQL

# Example: df_dairy_products, df_dairy_farms, df_milk_production, df_suppliers, df_dairy_sales

1. Product Performance Analysis

a. Identify the highest-selling products (quantity and revenue):

# Merge dairy products with sales to calculate total sales
product_sales = pd.merge(df_dairy_products, df_dairy_sales, on='product_id')

# Calculate total revenue for each sale
product_sales['total_revenue'] = product_sales['quantity_sold'] * product_sales['unit_price']

# Group by product to get total quantity sold and total revenue
product_performance = product_sales.groupby('product_name').agg(
    total_quantity_sold=('quantity_sold', 'sum'),
    total_revenue=('total_revenue', 'sum')
).reset_index()

# Sort products by total revenue and quantity sold
top_selling_products = product_performance.sort_values(by=['total_revenue', 'total_quantity_sold'], ascending=False)

# Display the top-selling products
print(top_selling_products.head())

# Visualization
plt.figure(figsize=(10, 6))
sns.barplot(data=top_selling_products, x='product_name', y='total_revenue')
plt.title('Top-Selling Dairy Products by Revenue')
plt.xticks(rotation=45)
plt.show()

b. Average price comparison between Liquid and Solid products:

# Group by product type (Liquid vs Solid) and calculate average unit price
product_type_price = df_dairy_products.groupby('product_type').agg(
    avg_price=('unit_price', 'mean')
).reset_index()

# Visualize average prices for each product type
plt.figure(figsize=(8, 5))
sns.barplot(data=product_type_price, x='product_type', y='avg_price')
plt.title('Average Price per Unit for Each Product Type')
plt.show()

2. Farm Production & Performance Analysis

a. Identify which farm produces the most milk:

b. Correlation between milk quality rating and sales of derived products:

# Merge milk production data with farms
farm_production = pd.merge(df_milk_production, df_dairy_farms, on='farm_id')

# Group by farm to calculate total milk produced
farm_performance = farm_production.groupby('farm_name').agg(
    total_milk_produced=('liters_produced', 'sum')
).reset_index()

# Sort farms by total milk produced
top_farms = farm_performance.sort_values(by='total_milk_produced', ascending=False)

# Display top farms
print(top_farms.head())

# Merge sales data with farm production to check correlation
product_sales_with_quality = pd.merge(product_sales, df_milk_production[['production_id', 'quality_rating']], on='production_id')

# Group by quality rating and calculate total revenue and quantity sold
quality_sales_analysis = product_sales_with_quality.groupby('quality_rating').agg(
    total_quantity_sold=('quantity_sold', 'sum'),
    total_revenue=('total_revenue', 'sum')
).reset_index()

# Visualize the analysis
plt.figure(figsize=(10, 6))
sns.barplot(data=quality_sales_analysis, x='quality_rating', y='total_revenue')
plt.title('Sales Performance Based on Milk Quality Rating')
plt.show()

3. Sales Analysis

a. Seasonal patterns in sales:

b. Identify top customers based on total amount spent:

# Convert sale_date to datetime if not already
df_dairy_sales['sale_date'] = pd.to_datetime(df_dairy_sales['sale_date'])

# Extract month from sale date
df_dairy_sales['month'] = df_dairy_sales['sale_date'].dt.month

# Group by month to get total sales
monthly_sales = df_dairy_sales.groupby('month').agg(
    total_sales=('total_amount', 'sum')
).reset_index()

# Plot monthly sales trends
plt.figure(figsize=(10, 6))
sns.lineplot(data=monthly_sales, x='month', y='total_sales', marker='o')
plt.title('Monthly Sales Trend')
plt.xticks(ticks=np.arange(1, 13), labels=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
plt.show()

# Group by customer_id to calculate total amount spent
customer_spending = df_dairy_sales.groupby('customer_id').agg(
    total_spent=('total_amount', 'sum'),
    total_quantity=('quantity_sold', 'sum')
).reset_index()

# Sort by total amount spent
top_customers = customer_spending.sort_values(by='total_spent', ascending=False)

# Display the top customers
print(top_customers.head())

4. Supplier Performance Analysis

a. Correlation between suppliers’ products and sales:

b. Product expiration and sales correlation:

# Merge sales data with suppliers' product information
product_sales_with_suppliers = pd.merge(df_dairy_products, df_dairy_sales, on='product_id')
product_sales_with_suppliers = pd.merge(product_sales_with_suppliers, df_suppliers[['supplier_id', 'supplier_name']], on='supplier_id')

# Group by supplier and calculate total sales
supplier_sales = product_sales_with_suppliers.groupby('supplier_name').agg(
    total_sales=('total_amount', 'sum'),
    total_quantity_sold=('quantity_sold', 'sum')
).reset_index()

# Visualize supplier performance
plt.figure(figsize=(12, 6))
sns.barplot(data=supplier_sales, x='supplier_name', y='total_sales')
plt.title('Supplier Performance Based on Sales')
plt.xticks(rotation=45)
plt.show()

# Calculate the difference between current date and expiration date
df_dairy_products['days_to_expiry'] = (pd.to_datetime(df_dairy_products['expiration_date']) - pd.to_datetime('today')).dt.days

# Merge sales data with product expiry data
product_sales_with_expiry = pd.merge(df_dairy_sales, df_dairy_products[['product_id', 'days_to_expiry']], on='product_id')

# Analyze the relationship between days to expiry and sales
plt.figure(figsize=(10, 6))
sns.scatterplot(data=product_sales_with_expiry, x='days_to_expiry', y='total_amount')
plt.title('Sales and Days to Expiry Correlation')
plt.show()

5. Time-based Trends

a. Visualize the monthly trends in production and sales:

# Create a column for month in both milk production and dairy sales
df_milk_production['month'] = pd.to_datetime(df_milk_production['date_of_production']).dt.month
df_dairy_sales['month'] = pd.to_datetime(df_dairy_sales['sale_date']).dt.month

# Monthly production data
monthly_production = df_milk_production.groupby('month').agg(
    total_production=('liters_produced', 'sum')
).reset_index()

# Monthly sales data
monthly_sales = df_dairy_sales.groupby('month').agg(
    total_sales=('total_amount', 'sum')
).reset_index()

# Merge production and sales for comparison
monthly_trends = pd.merge(monthly_production, monthly_sales, on='month')

# Plot the comparison of production and sales trends
plt.figure(figsize=(10, 6))
sns.lineplot(data=monthly_trends, x='month', y='total_production', label='Milk Production', marker='o')
sns.lineplot(data=monthly_trends, x='month', y='total_sales', label='Sales', marker='o')
plt.title('Monthly Production vs Sales')
plt.xticks(ticks=np.arange(1, 13), labels=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
plt.legend()
plt.show()

Milk Basket EDA Question on Dairy DataSet

1. Product Performance Analysis

b. Average price comparison between Liquid and Solid products:

2. Farm Production & Performance Analysis

3. Sales Analysis

4. Supplier Performance Analysis

b. Product expiration and sales correlation:

5. Time-based Trends

Recent Posts

Recent Post

Top 10 Python Script

Amazon SQL interview questions

Deloitte SQL interview questions 2

Deloitte SQL interview questions

AI Learning

Resourses

Quick Links

Support