# Import Libraries

In [None]:
import numpy as np
import pandas as pd
from datetime import timedelta
from matplotlib import pyplot as plt
import seaborn as sns

import warnings
# Ignore all warnings
warnings.filterwarnings("ignore")

# Context:

A large e-commerce company has contracted us to provide delivery services from several of their warehouses(or sites) to customers' doorsteps at several locations across the country.

The dataset contains information about the performance of our delivery agents working at our client's sites across cities in India.

The idea of this exercise is to gain insights into their performance and make data driven strategic recommendations for imporvement.

***NOTE: The dataset contains information for only one month, i.e June 2023***

# Features in the Dataset
* Site Code - A unique ID belonging to the site at which our delivery agents work.
* City - Name of the city in which the site is located.
* Vehicle Type - "Van DCD" means the delivery agent drives his own van and delivers packages. "Bike" means the delivery agent rides his own bike and delivers packages. "Van D+DA" means the delivery agent doesn't drive, but is instead driven around the city by a van driver.
* Cluster - Cities are grouped into clusters based on certain characteristics.
* Date - The date that the delivery agent reported for work.
* Delivery Agent ID - A unique identifier assigned to each delivery agent who works with us.
* Shift - "A" indicates morning, "B" indicates afternoon, "C" indicates evening.
* Unsuccessful_Attempts - The number of packages that the delivery agent attempted to deliver, but the delivery was not possible for various reasons
* Process_Deviations - The number of times the delivery agent deviated from the process during the shift.
* Delivered - The number of packages that the delivery agent delivered during the shift.
* Customer_Rejects - The number of packages that customers rejected when the delivery agent went to deliver them.
* Picked_up_Customer_Returns - The number of packages the delivery agent picked up from customers who wanted to return something that they had ordered earlier.
* Picked_up_Seller - The number of packages that the delivery agent picked up from sellers on the platform. These packages would be sent out for delivery to customers the next day.
* billing_amt - The amount that we bill our clients for the services rendered by our delivery agents.


# Question 1

Load the dataset into a dataframe and name it "daily_df". Display a sample of 10 random rows, display the shape of the dataset, the datatypes of each column, and check for missing values.

# Question 2

**Part-A**

You will notice that there are delivery agents who have used different vehicle types in different shifts.

Find what vehicle type was used for majority of the shifts, then overwrite all the other minority values with the majority values.

For example - if DA_1 used a bike in 20 shifts and used a van in 2 shifts, you need to make sure that the "Vehicle Type" for all 22 shifts is set to "Bike" as that was used for majority of the shifts.

**Part-B:**

Total assigned packages can be calculated using the following formula:

Total Assigned = Unsuccessful_Attempts + Delivered + Customer_Rejects + Picked_up_Customer_Returns + Picked up_Seller

Use this formula and add a new column in the dataframe called "Total Assigned"

**Part-C:**

"Productivity" is a metric that aims at measuring how much work a delivery agent is doing during his shift. This can later be compared with the rest of the workforce to identify delivery agents who are both - outperforming as well as under performing.  

It can be calculated using the following formula:

Productivity = Delivered + Customer_Rejects + Picked_up_Customer_Returns + Picked up_Seller.

Calculate "Productivity" and show it in a new column in the daily_df.


**Part-D:**

Analyze the data and report the ranges of productivity that different vehicle types have in different types of shifts. What are your observations?

**Part E:**

Low productivity adversely affects both - service levels as well as unit level economics of the business. Curbing this is quintessential to running a good operation.

Now that you have a sense of the productivity levels of different vehicle types in each shift, use this information to formulate logic to classify all the samples across all shifts as either "productivity-ok" or "productivity-low". Add this information into a column called "Productivity Category".

* Keep in mind that the productivity of one vehicle type is not comparable to the other. A bike may travel faster when there is more traffic. A van may be able to carry a much larger number of packages, etc...

* Keep in mind that productivity from Shift A is not comparable to productivity from Shifts B or C or vice versa. Each shift should have it's own threshold for defining low productivity.

* Most importantly, keep in mind that the ideal solution would be to identify a smaller set of people who contribute to the largest part of the problem.

* You are not being given any specific formula here to do this. You will need to think creatively and also give an explanation of your method and why you chose to do things that way.

# Question 3

**Part A:**

The daily_df contains information of how each delivery agent performed on each day that they reported in June 2023. Create a new dataframe that summarizes the performance of each delivery agent for the entire month of June 2023, using the data from the daily_df.

Name this new dataframe "monthly_df".

The new dataframe should contain the following columns:
* Delivery Agent ID
* Vehicle Type
* Cluster
* City
* Site Code
* Shifts Worked
* Total Assigned
* Unsuccessful_Attempts
* Process_Deviations
* Productivity
* billing_amt

**Part B:**

"Delivery Success Rate(DSR)" is a metric that measures the quality of a delivery agent's work. If an angent has low DSR, it means that he/she would have a higher number of unsuccessful attempts.

DSR can be calculated using the following formula:

DSR = Productivity / Total Assigned.

Using this formula, calculate and add a new column called "DSR" in the monthly_df. Round the values to 2 decimal placecs.

**Part C:**

The productivity in the monthly_df is an aggregate of the whole month's productivity. Create a new column called "Avg_Productivity" that contains the average productivity per shift of each delivery agent. The values in this column should be expressed as integers.

Do the same with Process_Deviations. Calculate the average number of deviations per shift worked and put it into a column called "Avg_Deviations". Here the values should be rounded to 2 decimal places

**Part D:**

Examine the distributions of the features in the monthly_df in both forms - tabular and plotted.

All columns that contain categorical information (data that has no ordinal value) should be shown as bar graphs. All columns that contain numerical data (data with ordinal value) should be shown as distribution curves or histograms.

State your observations and inferences.

**Part-E**

Now that you have understood the distributions of the data in the monthly_df, you need to:

1.   Classify each agent on the basis of their DSR as either "DSR-ok" or "DSR-low"
2.   Classify each agent on the basis of the number of deviations they've had as either "deviations-ok" or "deviations-high"

This classification should be represented in two new columns in the monthly_df called "DSR Category" and "Deviations Category"

Note that DSR and Deviations are comparable across all vehicle types, all shifts and all geographical areas. For example - there is no valid reason for vans in Mumbai to have lower DSR or higher deviations when compared to bikes in Hyderabad or Banagalore or Pune. The same benchmark of quality applies to all vehicle types working all shifts in all cities.  

Once again, you are not being given any explicit thresholds to use for this classification. You are expected to use your analysis to decide the thresholds.

An effective solution would be one which identifies a smaller group of people that contribute to the largest part of the problem.


# Question 4

**Part-A**

Create a new dataframe that shows the number of A, B, and C shifts each delivery agent has done in the whole month. Also include in a separate column the total number of low productivity shifts each delivery agent has had in the whole month. Call this new dataframe "shifts_df". Make sure that there are no nan values in the new dataframe.

To be more clear, the new dataframe(shifts_df) should have the following columns:

* Delivery Agent ID
* Total no. of 'A' Shifts
* Total no. of 'B' Shifts
* Total no. of 'C' Shifts
* Total no. of low productivity shifts

**Part B**

Merge the monthly_df and the shifts_df so that all the features are in one single dataframe. Call this new dataframe "final_df".

Then calculate and create a new column called "%_low_prod_shifts". For the values in this column divide the total number of low productivity shifts by the total no. of shifts worked and round the result to two decimal places.

**Part C**

Create three new columns - "%_shifts_A", "%_shifts_b", and "%_shifts_C".

If a delivery agent had a total of 10 shifts, and 5 out of them were Shift A, then "%_shifts_A" should be 0.5.

All values in the new columns should be rounded to 2 decimal places.

# Question 5

**Part-A:**

*Irregularity:*

When we say a delivery agent is irregular to work, we mean that they were associated with us for a certain period, however, during this period they were absent frequently.

For example - A delivery agent's first day of work was 5th June and the last day of work was 25th June. However during this period, the agent worked for only 10 out of the possible 20 days. We would then say that this delivery agent is "irregular".

Let's look at another scenario where a delivery agent has worked for only 5 out of the possible 30 days in the whole month, and those 5 days are towards the end of the month, it probably means that they joined us late, and hence cannot be classified as "irreguar", and they should ideally be classified as "new".

Keeping this context in mind, classify each delivery agent in the dataset as "regular" or "irregular" or "new". This classification should be shown in a new dataframe called "regularity_df". This new dataframe should contain only two columns - "Delivery Agent ID" and "Regularity Classification"

In this question, we are not explicity prescribing the logic nor the threshold of working days to be used for this classification. You are expected to get creative and take a calculated decision on how to go about this.

Once all the delivery agents have been classified, explain the logic and reasoning behind the method you have chosen.

**Part-B**

Merge the regularity_df into the final_df so that all the features are available in one single dataframe.

# Question 6

**Part-A**

Keeping in mind that the final_df contains a mix of categorical and numerical variables, find out if any of the features are correlated to each other using an appropriate method. Find out and report if any variables have a strong positive or negative correlation.

Explain which method(s) you have chosen and why.

**Part-B**

Explain what you have gathered and understood about this data after having examined the correlations.

# Question 7

Following a comprehensive analysis of the final_df, the next phase involves categorizing delivery agents into distinct clusters. The primary aim of establishing these clusters is to uncover specific and noteworthy attributes within each group, enabling the formulation of targeted strategies for improvement.

In this scenario, the objective is to segment delivery agents into various groups, facilitating an understanding of which group requires enhancements in specific metrics. Typically, exceptional performing delivery agents exhibit elevated DSR (Delivery Success Rate), minimal deviations, heightened productivity, and consistent work attendance. Conversely, underperforming agents display contrasting characteristics. Furthermore, there will be individuals who fall within intermediate ranges.

During the creation of agent groups, it's essential to facilitate the breakdown into smaller subsets within the population. This breakdown aids in pinpointing precise factors contributing to subpar performance.

**Part-A**

* Keeping the context in mind, choose and implement a method to perform this grouping. Create a new column to indicate which group each delivery agent belongs to. Explain why you have chosen this method.
* How many groups have you chosen to create and why?
______________________________________________________________________________
*IMPORTANT NOTES:*
* You don't necessarily have to use every single feature for the grouping. You may choose to leave out some variables and explain why you have chosen to leave them out.

* Remember that the final_df contains several features which were created from original features in the raw data so that analysis and grouping becomes easier. Carefully go through all the features we currently have and discard what you think is not required when grouping on the basis of performance.

* Remember that performances are comparable across geographical areas. There is no reason to have different standards in different cities.

* You may choose to use hard coded rules for grouping the delivery agents (OR) you may even choose to use ML algorithms. Both approaches are valid as long as you are able to justify the logic and reasoning behind your approach.

* If you're finding it difficult to decide how many groups to divide them into, don't worry, this is common when dealing with data in the real world. Finding an appropriate solution requires trying out multiple approaches and analyzing results to select the method that seems to make most sense or fetches the highest scores on certain metrics.

**Part-B**

Now that the delivery agents are grouped, explore the descriptive statistics or distributions across all the features of each group. Based on that, make your recommendations for what areas need to be improved for each group.

If you'd like to, you may use the same methods used in previous questions to explore distributions.