Exploratory data analysis using ggplot

whois Projects Exploratory data analysis using ggplot

Exploratory data analysis using ggplot

3rd Mar 2019

R ggplot data-analysis exploratory-data-analysis

In this post, I try to perform the exploratory data analysis on a transactional dataset, using R and ggplot

This is part of my series of documenting my small experiments using R & solving Data Analysis problems. These experiements might be redundant and may have been already written and blogged about by various people, but this is more of a personal diary and personal learning process.And, in this process, if anyone gets inspired or learns something new then, thats the best thing that could happen. If a more knowledgeable person than me, stumbles upon this blog and thinks there is a much better way to do things or i have erred somewhere, please feel free to share the feedback and help everyone grow.

EDA: Exploratory Data Analysis, It means taking a deep dive into the dataset and trying to understand the patterns and deriving prima facia hypothesis from the data. Its a important skill to develop and is the first step for any data scientist. In this post, I tried to explore the Instacart Dataset which was made available online by Instacart, here is the Link.

Link to my GitHub repo.

My objective was to explore the dataset and sharpen my R skills (mainly ggplot2).

This analysis is incomplete because my laptop couldn’t handle the no of transactions in the dataset and my R studio stopped working when i was performing certain actions.

I will pick it up later and publish a part 2 of this post later. Lets consider this as the part 1 of my little project.

About the Dataset:

This dataset contains data about 3 million grocery orders from 2 hundred thousand users of instacart, all the data is anonymised for privacy. For each user, there are between 4 and 100 of their orders, with the sequence of products purchased in each order. Also, following details are present: the week and hour of day the order was placed, and a relative measure of time between orders.

Dataset consist of following tables:

Orders: (3.4 million rows, 206k Users):
- order_id: order identifier
- user_id: customer identifier
- eval_set: which evaluation set this order belongs in (see SET described below)
- order_number: the order sequence number for this user (1 = first, n = nth)
- order_dow: the day of the week the order was placed on
- order_hour_of_day: the hour of the day the order was placed on
- days_since_prior: days since the last order, capped at 30 (with NAs for order_number = 1)

Products: (50k rows)
- product_id: product identifier
- product_name: name of the product
- aisle_id: foreign key
- department_id: foreign key

3.Aisles: (134 rows)

aisle_id: aisle identifier
aisle: the name of the aisle

4. Departments: (21k rows)

department_id: department identifier
department: the name of the department

5. Order_products_SET: (30 million + rows)

These files specify which products were purchased in each order. order_products__prior.csv contains previous order contents for all customers. ‘reordered’ indicates that the customer has a previous order that contains the product. Note that some orders will have no reordered items.

order_id: foreign key
product_id: foreign key
add_to_cart_order: order in which each product was added to cart
reordered: 1 if this product has been ordered by this user in the past, 0 otherwise

where SET is one of the four following evaluation sets (eval_set in orders):

“prior”: orders prior to that users most recent order (~3.2m orders)
“train”: training data supplied (~131k orders)
“test”: test data reserved for machine learning models (~75k orders)

Preparation:

I started with the generally looking at all the tables in the datset, to understand what kind of fields and data is present.

I realised that there are some fields which are categorical in nature and should be treated as such. eg: order_hour_of_day, order_dow, eval_set fields from orders table are definitely categorical fields, so we should convert them to categorical variables. Similarly, Product_name, department, aisle fields are also categorical in nature and are treated as such.

when i had some idea about the dataset i started thinking about what i would like to know from a dataset like this. Easier questions that come to mind are: Which are the most sold products? Which are the most reordered products? When is the order placed? how many varietes of products? etc.

Analysis:

So the way i want to analyse this is first, look into the orders and try to asnwer as many questions as possible, then look into the customers and then look into the Aisles and Departments.

So lets dig in: First things first, Lets plot the top 50 ordered products. (No significance to the number 50)

Turns out, bananas, Strawberries, Spinach & Avocado are the most ordered products from the dataset.

All the top 50 products are farm produce either Organic or non-orgnanic. Well, it makes sense if you think about it, Vegetables/ fruits are bought regularly than the most products so that shouldnt be surprising to us.

Building on that thought, If i were to think about it, then produce should also be the most re-ordered products, remember : re-ordered flag is set 1 if the product has been ordered by the user in the past, 0 otherwise

Highest Re-ordered products (top 50)

So this tells us that, almost the same set of products are re-ordered by users, as the same products appear in re-ordered list.

So, can there be some products which even though they are in the highest ordered list of 50, but are not in the highest re-ordered list, which means not many users thought of ordering those products again, maybe because of the quality or some other reason.

Organic Italian Parsley Bunch & Blueberries are 2 products which are part of highest ordered products but are not part of the top 50 re-ordered products

                              Var1  Freq
## 1 Organic Italian Parsley Bunch 60621
## 2                   Blueberries 55946

So, Organic Italian Parsley and Blueberries are not really re-ordered by many users (compared to the top 50 list)

Lets check what %ge of the products are actually reordered, so that it tells us whether this information is insightful or not.

%ge of products reordered

reordered count     percentage
        0 13307953  0.4103025
        1 19126536  0.5896975

Well roughly 59% of products are reordered again from the total number of orders. So, it seems our information is crucial.

Our dataset contains the information about when the customers place their orders, both day and the time is recorded. lets see when do the customers order the most.

So, from the above graph, its clear that maximum no of orders are placed between 10:00AM and 4:00PM on days 0 & 1, I’m assuming here that day 0 is Saturday and day 1 is Sunday (its not clear from dataset what day of the week it is), which would make the most sense, people buy groceries over the weekends. Though if you notice there is a good chunk of people who order groceries on Thursday and Friday

Our dataset also contains a variable which records days passed since the customers last order, analysing this variable would tell is frequency of orders from the customers.

Days since last order

Looking at the above graph we can easily see that among the regular customers of instacart, maimum reorders are on the 7th day since the first order, which again sort of tie into the earlier finding that people shop for groceries on a weekly basis. The 30th day is also towering tall because instacart capped this variable at day 30, so we are not sure how many of the orders actually occurred on 30th day.

I’ll continue this analysis in the next blog post, where ill go deep into the Aisles, Departments and customers.

Previous Post Next Post

Exploratory data analysis using ggplot