Telco Customer Churn Prediction Pipeline

This project emphasizes engineering discipline as much as model quality. The core work was not just fitting a classifier, but structuring the pipeline so preprocessing, evaluation, and inference all remain inspectable.

View Repository

Case Study

Overview

Project Overview

What I built, the problem, and the solution

Reworked exploratory notebook experimentation into a cleaner pipeline architecture that separates data preparation, model training, inference, and artifact tracking while preserving reproducibility and reviewability.

Challenge Framing

Churn modeling often starts in notebooks and becomes difficult to reproduce, compare, or operationalize once preprocessing and training logic spread across experiments.

Solution Strategy

I centralized configuration, separated data and training stages, used PySpark for scalable preparation, and logged experiments plus artifacts with MLflow.

Project Highlights

PySpark-first preprocessing flow with reusable train/test artifacts.MLflow tracking for metrics, parameters, artifacts, and model versions.Streaming-style inference telemetry to surface latency and batch behavior.Config-driven execution that replaces one-off notebook assumptions.

Gallery

Product screens and workflow snapshots

Selected screens show the visible product experience and the operational surfaces behind each project. Projects without captured assets keep a structured placeholder until real screenshots are available.

Module 02Frame 01

Telco Customer Churn Prediction Pipeline

PythonPySpark

Pipeline decomposition

The repository is split into preparation, training, inference, and utility layers for reproducible execution.

Module 02Frame 02

Telco Customer Churn Prediction Pipeline

PythonPySpark

MLflow experiment surface

Metrics, models, and processed outputs are logged as first-class artifacts rather than buried in notebook outputs.

Module 02Frame 03

Telco Customer Churn Prediction Pipeline

PythonPySpark

Inference telemetry

Batch-level prediction summaries keep latency and output behavior visible after training ends.

Tech Stack

Built with tools chosen for reliability and iteration speed

ML pipeline engineering, traceability, artifact lineage, experiment management, and model evaluation for imbalanced classification.

AI / ML

Python
PySpark
scikit-learn
XGBoost
pandas
NumPy

DevOps

MLflow

Key Features

Reusable preprocessing

Cleaning, encoding, scaling, and split generation are orchestrated as pipeline code rather than notebook cells.

Model experimentation

Multiple classical baselines stay comparable through consistent evaluation and shared artifact outputs.

Experiment visibility

MLflow captures model parameters, metrics, processed datasets, and serialized artifacts.

Inference observability

Prediction latency and batch-level analytics are surfaced through a streaming-style inference wrapper.

Architecture

System architecture designed as a readable engineering story

Each layer stays explicit so reviewers can quickly understand where interface, orchestration, persistence, and service responsibilities live.

Preparation Layer

Raw telco data is transformed into consistent train/test artifacts with PySpark-backed cleaning and feature prep.

PySparkpandasconfig.yaml

Training Layer

Classifiers are trained, evaluated, and versioned with a shared experiment workflow.

scikit-learnXGBoostMLflow

Inference Layer

Saved models and preprocessing assets are reused for prediction and telemetry logging.

PythonMLflowserialized artifacts

System Flow

Key stages broken down as a readable execution path

The pipeline section keeps the most important engineering steps visible without collapsing them into generic bullet lists.

Ingest

Load the telco churn dataset and normalize schema-level issues such as incomplete numeric values.

CSVPySpark

Prepare

Handle missing values, remove outliers, encode categoricals, and scale key numeric features.

PySparkconfig.yaml

Train

Benchmark classical ML baselines and log metrics, parameters, and artifacts into MLflow.

scikit-learnXGBoostMLflow

Infer

Load serialized assets for downstream prediction while recording latency and batch-level telemetry.

PythonMLflow

Timeline

A case-study flow that explains how the system took shape

This timeline keeps the implementation story concise: what was framed first, what was hardened next, and what ultimately made the project production-ready.

Phase 01

Notebook consolidation

Moved scattered preparation and evaluation logic into a coherent repository structure.

Phase 02

Pipeline hardening

Created modular data, training, and inference stages with shared configuration.

Phase 03

Experiment tracking

Added MLflow-based lineage so metrics and model artifacts remain reviewable across runs.

Challenges

Technical constraints, decisions, and the reasoning behind them

Each challenge is tied to a concrete design choice and a specific outcome.

Solution

Centralized logic under scriptable modules and orchestrated pipelines with shared config.

Outcome

Reduced manual repetition and improved auditability.

Solution

Used PySpark for deterministic transformation, then converted to pandas for estimator compatibility.

Outcome

Kept the system scalable without sacrificing model tooling.

Solution

Elevated F1, precision, and recall beside raw accuracy during evaluation and reporting.

Outcome

Made the model evaluation more transparent and technically defensible.

Results

Metrics and outcomes presented for quick technical review

The emphasis here is signal, not decoration: key numbers, verifiable outcomes, and the context needed to interpret them responsibly.

0.845

Best CV F1

Random Forest baseline during notebook benchmarking.

0.735

Holdout Recall

Kept churn detection visible instead of hiding behind accuracy.

80/20

Train/Test Split

Consistent split strategy for reproducible evaluation.

4stages

Pipeline Scope

Data prep, training, evaluation, and inference telemetry.

Key Results

Established a reproducible churn workflow from raw CSV to tracked model artifact.
Benchmarked Random Forest as the strongest baseline at roughly 0.845 cross-validated F1.
Surfaced class-aware metrics so churn detection quality stays visible.
Created a stronger foundation for future API serving and scheduled retraining.

Business Impact

Business value

Supports retention-risk analysis with a pipeline that can evolve into batch scoring or service-based predictions.

Engineering value

Shows ML systems maturity through configuration, artifacts, reproducibility, and observability rather than just raw metrics.

Continue