Customer churn prediction is a classic binary classification problem in the telecom domain. The goal is to predict whether a customer will stop using a service (churn = 1) or continue (churn = 0) based on their usage patterns, demographics, and subscription details.
Dataset: IBM Telco Customer Churn (Kaggle) — 7,043 customers, 21 features. Target distribution: 73.5% Stay, 26.5% Churn — class imbalanced. Features include 3 numerical (tenure, MonthlyCharges, TotalCharges) and 16 categorical (Contract, PaymentMethod, InternetService, etc.).
EDA is the process of understanding data before modelling — exploring patterns, distributions, correlations, and anomalies.
Key steps: Loaded dataset with pandas, checked shape/dtypes/missing values. Found TotalCharges was stored as string due to 11 blank entries — converted using pd.to_numeric(errors='coerce'). Visualized churn distribution and plotted distributions for all features.
Key findings from EDA:
Preprocessing converts raw data into a format ML models can understand. Models can't work with strings or missing values — everything must be numerical and clean.
Pipeline steps: Removed customerID (no predictive power). Fixed TotalCharges. Encoded target (Yes → 1, No → 0). Train/Test split 80/20 with stratify=y. Built a ColumnTransformer: StandardScaler for numerical, OneHotEncoder for categorical. Applied SMOTE only on training data to handle class imbalance.
| Technique | What It Does | Why Used Here |
|---|---|---|
| StandardScaler | Normalizes features to mean=0, std=1 | Prevents TotalCharges from dominating the model |
| OneHotEncoder | Converts categories to binary columns | ML models need numbers, not strings |
| SMOTE | Creates synthetic minority class examples | Balances 4,100 Stay vs 1,400 Churn → 4,100 vs 4,100 |
| Pipeline | Bundles preprocessing + model together | Ensures same transformations in training and production |
Four models were trained and compared. XGBoost was selected as the final model based on AUC-ROC performance.
| Model | Characteristics | Role in Project |
|---|---|---|
| Logistic Regression | Simple, interpretable, fast | Baseline |
| Random Forest | Ensemble of trees, handles non-linearity | Comparison |
| XGBoost | Gradient boosted trees, regularization, handles missing values | Final model (best AUC) |
| LightGBM | Faster than XGBoost, histogram-based | Comparison |
Hyperparameter tuning with Optuna: 50 trials using Bayesian optimization, optimizing for AUC-ROC via cross-validation. Parameters tuned: n_estimators, max_depth, learning_rate, subsample, colsample_bytree, reg_alpha, reg_lambda. All experiments tracked with MLflow.
Why AUC-ROC and not accuracy? With 73.5% non-churners, a model that predicts "never churn" achieves 73.5% accuracy but is completely useless. AUC-ROC measures the model's ability to rank churners above non-churners regardless of threshold. Our model achieved 0.85+ AUC-ROC.
SHAP (SHapley Additive exPlanations) explains WHY the model made a specific prediction. It assigns an importance value to each feature for each individual prediction, grounded in game theory (Shapley values).
How SHAP values work: A positive SHAP value pushes the prediction toward Churn. A negative SHAP value pushes toward No Churn. The sum of all SHAP values plus the base value equals the final prediction score.
Visualizations created:
Key SHAP findings: Low tenure is the strongest predictor of churn. High MonthlyCharges increases churn risk. Month-to-month contract is a strong churn indicator. Absence of TechSupport or OnlineSecurity significantly increases churn risk.
FastAPI is a modern Python web framework for building REST APIs. The ML model lives in the API — the dashboard sends customer data and receives predictions back. This separation allows the model to be used by any frontend.
| Endpoint | Method | Purpose |
|---|---|---|
| /health | GET | Returns API status — used for monitoring |
| / | GET | Root endpoint, confirms API is running |
| /predict | POST | Single customer prediction + SHAP explanation |
| /predict/batch | POST | Batch predictions for multiple customers |
Streamlit is a Python library for building interactive web apps without writing HTML/CSS/JavaScript. Three pages were built: a Home overview, a Single Prediction form (19 customer features → churn probability + SHAP), and a Batch Prediction page (CSV upload → download results).
How the dashboard talks to the API: User fills form → app.py sends POST request to FastAPI → FastAPI returns JSON → app.py displays result. The dashboard requires no ML libraries — all ML work happens in the API.
Environment fix: Local machine uses http://127.0.0.1:8000, HuggingFace uses the live URL. os.getenv() was used so the URL can be set via environment variable without changing code.
Docker packages the application and all dependencies into a container — a portable, self-contained unit that runs the same everywhere. Each HuggingFace Space is a Git repository: push code and it automatically builds and deploys.
HuggingFace Spaces specs: Free tier offers 512MB RAM, 0.1 CPU. Port 7860 is the only exposed port. Spaces sleep after 48h inactivity and wake on visit (1–2 min cold start).
Key Docker issues resolved:
--upgrad instead of --upgrade in pip command caused build failure.| Problem | Cause | Fix |
|---|---|---|
| Streamlit blank screen | Python 3.14 incompatibility | Downgraded to Python 3.11.9 |
| Windows path errors | Hardcoded Windows paths broke on other OS | Used pathlib.Path throughout |
| Git push 403 | GitHub no longer accepts passwords | Used Personal Access Token (PAT) |
| HF push rejected | Binary files (.png, .pkl) too large for HF git | Used git filter-branch to remove history |
| numpy conflict | shap needs numpy≥2, streamlit 1.32 needs numpy<2 | Upgraded to streamlit 1.45+ |
| scikit-learn mismatch | Model trained on 1.8.0, Docker installed 1.3.2 | Pinned scikit-learn==1.8.0 in requirements |
| API timeout on HF | Model loaded before FastAPI app was created | Moved app = FastAPI() to top of main.py |
| Port blocked on HF | API ran on 8000, HF only exposes 7860 | Changed port to 7860 in Dockerfile CMD |
Designed and crafted with ❤️ by Biswajit Pradhan