Customer Churn Prediction

Objective: Predict telecom customer churn with XGBoost + SHAP explainability, deployed as a live API and dashboard.

Stack: Python · XGBoost · FastAPI · Streamlit · Docker · HuggingFace

Dataset: IBM Telco Customer Churn — 7,043 customers, 21 features

Result: 0.85+ AUC-ROC · End-to-end deployed on HuggingFace Spaces

Source Code Live Dashboard API Docs HF API Space HF Dashboard Space

Project Overview

Customer churn prediction is a classic binary classification problem in the telecom domain. The goal is to predict whether a customer will stop using a service (churn = 1) or continue (churn = 0) based on their usage patterns, demographics, and subscription details.

"It costs 5–25x more to acquire a new customer than to retain an existing one. Even a 5% reduction in churn can increase profits by 25–95%. Every prediction in this project comes with a SHAP explanation so business teams understand WHY a customer is at risk."

Dataset: IBM Telco Customer Churn (Kaggle) — 7,043 customers, 21 features. Target distribution: 73.5% Stay, 26.5% Churn — class imbalanced. Features include 3 numerical (tenure, MonthlyCharges, TotalCharges) and 16 categorical (Contract, PaymentMethod, InternetService, etc.).

Phase 1 — Exploratory Data Analysis

EDA is the process of understanding data before modelling — exploring patterns, distributions, correlations, and anomalies.

Key steps: Loaded dataset with pandas, checked shape/dtypes/missing values. Found TotalCharges was stored as string due to 11 blank entries — converted using pd.to_numeric(errors='coerce'). Visualized churn distribution and plotted distributions for all features.

Key findings from EDA:

Low tenure customers churn more — newer customers are less loyal.
High MonthlyCharges customers churn more — price sensitivity.
Month-to-month contracts had ~43% churn vs ~11% for one-year and ~3% for two-year contracts.
Fiber optic internet users churned more than DSL users.
Customers without TechSupport or OnlineSecurity churned significantly more.
Senior citizens churned more than non-senior citizens.

Phase 2 — Data Preprocessing

Preprocessing converts raw data into a format ML models can understand. Models can't work with strings or missing values — everything must be numerical and clean.

Pipeline steps: Removed customerID (no predictive power). Fixed TotalCharges. Encoded target (Yes → 1, No → 0). Train/Test split 80/20 with stratify=y. Built a ColumnTransformer: StandardScaler for numerical, OneHotEncoder for categorical. Applied SMOTE only on training data to handle class imbalance.

Technique	What It Does	Why Used Here
StandardScaler	Normalizes features to mean=0, std=1	Prevents TotalCharges from dominating the model
OneHotEncoder	Converts categories to binary columns	ML models need numbers, not strings
SMOTE	Creates synthetic minority class examples	Balances 4,100 Stay vs 1,400 Churn → 4,100 vs 4,100
Pipeline	Bundles preprocessing + model together	Ensures same transformations in training and production

"SMOTE is applied ONLY on training data — never on test data. Applying SMOTE to test data causes data leakage, where the model 'sees' test data during training and reports fake accuracy."

Phase 3 — Model Training & Selection

Four models were trained and compared. XGBoost was selected as the final model based on AUC-ROC performance.

Model	Characteristics	Role in Project
Logistic Regression	Simple, interpretable, fast	Baseline
Random Forest	Ensemble of trees, handles non-linearity	Comparison
XGBoost	Gradient boosted trees, regularization, handles missing values	Final model (best AUC)
LightGBM	Faster than XGBoost, histogram-based	Comparison

Hyperparameter tuning with Optuna: 50 trials using Bayesian optimization, optimizing for AUC-ROC via cross-validation. Parameters tuned: n_estimators, max_depth, learning_rate, subsample, colsample_bytree, reg_alpha, reg_lambda. All experiments tracked with MLflow.

Why AUC-ROC and not accuracy? With 73.5% non-churners, a model that predicts "never churn" achieves 73.5% accuracy but is completely useless. AUC-ROC measures the model's ability to rank churners above non-churners regardless of threshold. Our model achieved 0.85+ AUC-ROC.

Phase 4 — SHAP Explainability

SHAP (SHapley Additive exPlanations) explains WHY the model made a specific prediction. It assigns an importance value to each feature for each individual prediction, grounded in game theory (Shapley values).

How SHAP values work: A positive SHAP value pushes the prediction toward Churn. A negative SHAP value pushes toward No Churn. The sum of all SHAP values plus the base value equals the final prediction score.

Visualizations created:

Summary Plot: All features ranked by global importance across all predictions.
Bar Plot: Mean absolute SHAP values per feature.
Waterfall Plot: Step-by-step explanation for one individual customer's prediction.
Dependence Plot: How one feature's SHAP value changes across different values of that feature.

Key SHAP findings: Low tenure is the strongest predictor of churn. High MonthlyCharges increases churn risk. Month-to-month contract is a strong churn indicator. Absence of TechSupport or OnlineSecurity significantly increases churn risk.

Phase 5 — FastAPI Backend

FastAPI is a modern Python web framework for building REST APIs. The ML model lives in the API — the dashboard sends customer data and receives predictions back. This separation allows the model to be used by any frontend.

Endpoint	Method	Purpose
/health	GET	Returns API status — used for monitoring
/	GET	Root endpoint, confirms API is running
/predict	POST	Single customer prediction + SHAP explanation
/predict/batch	POST	Batch predictions for multiple customers

"Critical bug fixed: the model was loaded BEFORE the FastAPI app was created. HuggingFace's health check hit /health immediately on startup, but the app wasn't ready yet, causing a timeout crash. Fix: create app = FastAPI() first, register /health second, then load the model third."

Phase 6 — Streamlit Dashboard

Streamlit is a Python library for building interactive web apps without writing HTML/CSS/JavaScript. Three pages were built: a Home overview, a Single Prediction form (19 customer features → churn probability + SHAP), and a Batch Prediction page (CSV upload → download results).

How the dashboard talks to the API: User fills form → app.py sends POST request to FastAPI → FastAPI returns JSON → app.py displays result. The dashboard requires no ML libraries — all ML work happens in the API.

Environment fix: Local machine uses http://127.0.0.1:8000, HuggingFace uses the live URL. os.getenv() was used so the URL can be set via environment variable without changing code.

Phase 7 — Docker & Deployment

Docker packages the application and all dependencies into a container — a portable, self-contained unit that runs the same everywhere. Each HuggingFace Space is a Git repository: push code and it automatically builds and deploys.

HuggingFace Spaces specs: Free tier offers 512MB RAM, 0.1 CPU. Port 7860 is the only exposed port. Spaces sleep after 48h inactivity and wake on visit (1–2 min cold start).

Key Docker issues resolved:

Typo --upgrad instead of --upgrade in pip command caused build failure.
numpy version conflict — shap 0.51.0 needs numpy≥2 but streamlit 1.32.0 needs numpy<2. Fixed by upgrading to streamlit 1.45+.
Port mismatch — HuggingFace only exposes port 7860, API was on 8000.
scikit-learn version mismatch between training (1.8.0) and Docker (1.3.2). Fixed by pinning version in requirements.

Problems Faced & Solutions

Problem	Cause	Fix
Streamlit blank screen	Python 3.14 incompatibility	Downgraded to Python 3.11.9
Windows path errors	Hardcoded Windows paths broke on other OS	Used pathlib.Path throughout
Git push 403	GitHub no longer accepts passwords	Used Personal Access Token (PAT)
HF push rejected	Binary files (.png, .pkl) too large for HF git	Used git filter-branch to remove history
numpy conflict	shap needs numpy≥2, streamlit 1.32 needs numpy<2	Upgraded to streamlit 1.45+
scikit-learn mismatch	Model trained on 1.8.0, Docker installed 1.3.2	Pinned scikit-learn==1.8.0 in requirements
API timeout on HF	Model loaded before FastAPI app was created	Moved app = FastAPI() to top of main.py
Port blocked on HF	API ran on 8000, HF only exposes 7860	Changed port to 7860 in Dockerfile CMD

Designed and crafted with ❤️ by Biswajit Pradhan