Predicting Customer Churn in Telecom Industry

In this post, my focus is to try and build a simple model to predict whether a customer will churn or not given a dataset.

This is part of my series of documenting my small experiments using R or Python & solving Data Analysis / Data Science problems. These experiments might be redundant and may have been already written and blogged about by various people but this is more of a personal diary and my personal learning process. I hope I’m able to engage and inspire anyone else who is going through the similar process as mine. If a more knowledgeable person stumbles upon this blog and thinks there is a much better way to do things or I have erred somewhere, please feel free to share the feedback and help not just me, but everyone, to grow together as a community.

Customer attrition is a big issue in any industry. Not surprisingly, one of the major focus of a data scientist is to reduce customer attrition and increase customer retention. It is relatively easier to predict and detect in the industries where monthly billing service exists Eg: telecom, internet, streaming service etc. From an organizational perspective, it is always cheaper to retain existing customer than to spend money to acquire new customer.

In this post, my focus is to try and build a simple model to predict whether a customer will churn or not given a dataset.

You can find the dataset that I have used here and You can find the code in my GitHub Repo here.

Dataset

This dataset has total 11 columns including a column called churn which is our dependent variable, and 10 columns which are our predictor variables.

We will try and identify the variables which are significant in predicting customer churn and try to build a logistic regression model which will accurately predict the churn.

Dataset Sample

Step 1:Import the dataset

First things first, Lets import the dataset in R and check the variables and check if there are any null or missing values in the dataset. while we are at it, we should check the five point summary of the dataset as well.

No Missing values

Step 2: Convert as factor variables and perform outlier treatment

As you can notice variables, ‘Churn’, ‘ContractRenewal’, ‘DataPlan’ are all actually factor variables, so lets convert them to factor.

Converting to Factor variables

Next step is to treat for outliers, as the outliers may affect our model performance. In our five point summary above I noticed that for the variables AccountWeeks , DataUsage,CustServCalls, there is a huge difference between 3rd quadrant value and the max value, this suggests that there may be outliers for those variables, lets check and treat them accordingly.

Outlier Treatment

So, as suspected there were few outliers present in those variables, I limited the variables to the 99th percentile so as to have a uniform distribution of the data.

Reset of the variables DayMins, DayCalls, MonthlyCharge, RoamMins, OverageFee, seem to be in acceptable ranges without any outliers, so I don’t think we need to limit any of them.

Step 3: Creating dummy Variables

Next step is to create dummy variables for all our categorical variables, if you don’t know what dummy variable is then here is the explanation from socialresearchmethods.net

A dummy variable is a numerical variable used in regression analysis to represent subgroups of the sample in your study. In research design, a dummy variable is often used to distinguish different treatment groups. In the simplest case, we would use a 0,1 dummy variable where a person is given a value of 0 if they are in the control group or a 1 if they are in the treated group. Dummy variables are useful because they enable us to use a single regression equation to represent multiple groups. This means that we don’t need to write out separate equation models for each subgroup. The dummy variables act like ‘switches’ that turn various parameters on and off in an equation. Another advantage of a 0,1 dummy-coded variable is that even though it is a nominal-level variable you can treat it statistically like an interval-level variable

Creating Dummy Variables

Next step is to remove one dummy variable from each categorical variables, so that we have (n-1) dummy variables in total.

Step 4: Divide the data

Lets split the dataset into training and testing dataset, as is the convention we will do a 70:30 split.

Divide the dataset

Step 5: Time for first model

Okay, lets build our first logistic regression model with all the variables.

First model

So, this model shows 3 variables are significant in predicting the Churn, ContractRenewal_Not_Renewed (Negative impact), CustServCalls & RoamMins

Also, note the AIC score of 1383.2. The model having least AIC Score would be the most preferred and optimized one.

AIC: The Akaike information criterion (AIC) is a measure of the relative quality of statistical models for a given set of data. Given a collection of models for the data, AIC estimates the quality of each model, relative to each of the other models.

Now since we have built our initial Regression Model considering all the Predictors, let’s check its significance

Step 6: Check Model Significance and robustness

Next, we check the model significance using Log likelihood ratio

Interpretation of Log Likelihood ratio test: H0: All betas are zero H1: At least 1 beta is nonzero From the log likelihood, we can see that, intercept only model -859.04 variance was unknown to us. When we take the full model, -680.61 variance was unknown to us. So we can say that, 1 -(-680.61 /- 859.04)= 20.77% of the uncertainty inherent in the intercept only model is calibrated by the full model.Chisq likelihood ratio is significant. Also the p value suggests that we can accept the Alternate Hypothesis that at least one of the beta is not zero. So Model is significant.

Model Robustness: Next up we check whether our model is robust or not? ****by using Mcfadden’s pseudo R squared Test

The McFadden’s pseudo-R Squared test suggests that atleast 20.77% variance of the data is captured by our Model, which suggests it’s a robust model.

Going back to the output of our first model, we noticed that 3 variables are significant in predicting the Churn,

ContractRenewal_Not_Renewed , CustServCalls & RoamMins

Also, Contract Renewal, which is a categorical variable has negative impact on the Customer Churn.

However we need to find out if there are more significant variables that we need to consider?

Let’s find out the power of Odds and Probability of the variables impacting on Customer Churn.

Step 7: Odds Explanatory Power

Odds Ratio

Odds Probability

Interpretation:If a particular Variable as shown in following table is increased by ‘One Unit’, the odds of customer churn (Vs. not churning ) and the probability of Customer Churn is shown in the following table.

For Categorical Variables, e.g. when the Customer renews Contract the odds customer will churn is 0.12 compared to when the customer does renew. Similarly, when the customer opt for Data Plan the odds the customer will churn is 0.32 compared to when the customer doesn’t opt. What this tells us is that there are additional significant variables that we should consider in our model.

Step 8: Model Accuracy: Confusion Matrix Now that we have confirmed there are additional significant variables, let’s check performance of our Model using a Classification Table / Confusion Matrix.

confusion matrix

Interpretation: 1821 out of (287+1821) Customers identified correctly which have been churned out. This translates to 86.3% of Positive Predictive Value. 10 out of (10+13) Customers identified correctly which have not been churned out. This translates to 76.9% of Negative Predictive Values. At 86.3 %, the Model provides good accuracy measures. Sensitivity is 0.99 and Specificity is 0.033

confusion matrix test

Interpretation: 1024 out of (181+1024) Customers identified correctly which have been churned out. This translates to 84.9% of Positive Predictive Value. 5 out of (5+2) Customers identified correctly which have not been churned out. This translates to 71.4% of Negative Predictive Values. At 84.9 %, the Model provides good accuracy measures. Sensitivity is 0.998 and Specificity is 0.026

Thus, the model shows pretty much similar performance on both Training as well as Testing datasets.

Further Steps: Model Refining

So far, we checked model’s significance, robustness, accuracy and we are pretty happy with what we have so far. We can conclude our model here and move on, but we shouldn’t. we can further refine our model to improve our model accuracy. When we checked for Odds ratios, earlier, we discovered variables which could be significant in predicting whether the customer will churn or not. Our next steps should be:

  1. Variable Selection (we can use step() function)

  2. Checking for Model Significance, robustness, accuracy

  3. Finally interpreting results

Advanced topics included, understanding variable interactions and further refining the models.

I hope this was helpful to anyone who reads :). Let me know if there is any doubts or clarification that you seek.

Previous Post