In this project we use this data from Kaggle. The main goal of the project is to predict whether a customer will change telco provider.
The training dataset contains 4250 samples. Each sample contains 19 features and 1 boolean variable "churn" which indicates the class of the sample. The 19 input features and 1 target variable are:
- state, string. 2-letter code of the US state of customer residence
- account_length, numerical. Number of months the customer has been with the current telco provider
- area_code, string="area_code_AAA" where AAA = 3 digit area code.
- international_plan, (yes/no). The customer has international plan.
- voice_mail_plan, (yes/no). The customer has voice mail plan.
- number_vmail_messages, numerical. Number of voice-mail messages.
- total_day_minutes, numerical. Total minutes of day calls.
- total_day_calls, numerical. Total minutes of day calls.
- total_day_charge, numerical. Total charge of day calls.
- total_eve_minutes, numerical. Total minutes of evening calls.
- total_eve_calls, numerical. Total number of evening calls.
- total_eve_charge, numerical. Total charge of evening calls.
- total_night_minutes, numerical. Total minutes of night calls.
- total_night_calls, numerical. Total number of night calls.
- total_night_charge, numerical. Total charge of night calls.
- total_intl_minutes, numerical. Total minutes of international calls.
- total_intl_calls, numerical. Total number of international calls.
- total_intl_charge, numerical. Total charge of international calls
- number_customer_service_calls, numerical. Number of calls to customer service
- churn, (yes/no). Customer churn - target variable.
- Exploratory Data Analysis (EDA)
- Inferential Statistics
- Data Visualisation
- Oversampling & Undersampling for Class Imbalance
- Feature Engineering
- Feature Selection
- Cross Validation
- Clustering
- Predictive Modeling
- Machine Learning
- Hyperparameter Tuning
- Python, Jupyter Notebook
- Pandas, numpy
- Seaborn, matplotlib
- ImbLearn
- Scikit-Learn (SkLearn), AutoSklearn
- MLFlow
- SHAP
- XGBoost, LightGBM, Catboost
- HyperOpt
For simplicity, brief information on each Python file and each Notebook is given in the table below. For more complete information, you can look into them. The order is kept as it was developed and tried.
Notebook | Description |
---|---|
Research | First notebook with EDA and Data Visualisation |
Preprocessing | Python script with functions for Data Handling |
Feature Engineering | Notebook with experiments for feature engineering (all necessary engineering techniques were included in preprocessing.py ) |
KMeans Research | Notebook with first clustering approach (fails) |
KMeans + SVM | Cluster as a feature + Support Vector Machine Classifier (no handling with imbalance) |
Undersample + KMeans + SVM | Undersampling technique + everything else as in above notebook |
SMOTE + KMeans + SVM | Oversampling and Undersampling techniques (SMOTE, SMOTETomek, SMOTEENN) |
Logistic Regression | SMOTE, SMOTEENN and no handling with imbalance with basic Logistic regression |
SkLearn Models + XGB | Different basic models and XGB were tried on SMOTEENN & SMOTE data (best so far is XGB with SMOTEENN) |
AutoSkLearn | Auto SkLearn implementation (works only in Google Colab) |
Feature Selection | Notebook with different feature selection techniques (final selection function was included in preprocessing.py ) |
XGB Tuning | Tuning of xgb model, with the help of hyperopt for parameters and MLFlow for tracking |
CatBoost Tuning | Tuning of catboost model |
LightGBM Tuning | Tuning of lightgbm model |
Train | Final script for XGB model training and saving |
Model Inference | Final script for test data prediction (and save to submission.csv for Kaggle) |
A lot of different models, methods, as well as frameworks were tried and used. The final score, accuracy, on Kaggle Platform is 0.88 both for public and private leaderboards.
To sum up, having such a result in real life, the company will save itself a lot of money by using the developed model.