Back To Projects
Module 02

A reproducible churn workflow designed to bridge notebook experimentation and deployable ML systems.

Telco Customer Churn Prediction Pipeline

This project emphasizes engineering discipline as much as model quality. The core work was not just fitting a classifier, but structuring the pipeline so preprocessing, evaluation, and inference all remain inspectable.

Overview

Reworked exploratory notebook experimentation into a cleaner pipeline architecture that separates data preparation, model training, inference, and artifact tracking while preserving reproducibility and reviewability.

Problem

Churn modeling often starts in notebooks and becomes difficult to reproduce, compare, or operationalize once preprocessing and training logic spread across experiments.

Approach

I centralized configuration, separated data and training stages, used PySpark for scalable preparation, and logged experiments plus artifacts with MLflow.

Project Overview

A modern engineering case study, structured for both recruiters and builders

Reworked exploratory notebook experimentation into a cleaner pipeline architecture that separates data preparation, model training, inference, and artifact tracking while preserving reproducibility and reviewability.

Challenge Framing

Churn modeling often starts in notebooks and becomes difficult to reproduce, compare, or operationalize once preprocessing and training logic spread across experiments.

Solution Strategy

I centralized configuration, separated data and training stages, used PySpark for scalable preparation, and logged experiments plus artifacts with MLflow.

Project Highlights

  • PySpark-first preprocessing flow with reusable train/test artifacts.
  • MLflow tracking for metrics, parameters, artifacts, and model versions.
  • Streaming-style inference telemetry to surface latency and batch behavior.
  • Config-driven execution that replaces one-off notebook assumptions.
Tech Stack

Built with tools chosen for reliability and iteration speed

ML pipeline engineering, traceability, artifact lineage, experiment management, and model evaluation for imbalanced classification.

Core Stack

  • Python
  • PySpark
  • MLflow
  • scikit-learn
  • XGBoost
  • pandas
  • NumPy

Key Features

Reusable preprocessing

Cleaning, encoding, scaling, and split generation are orchestrated as pipeline code rather than notebook cells.

Model experimentation

Multiple classical baselines stay comparable through consistent evaluation and shared artifact outputs.

Experiment visibility

MLflow captures model parameters, metrics, processed datasets, and serialized artifacts.

Inference observability

Prediction latency and batch-level analytics are surfaced through a streaming-style inference wrapper.

Architecture

System architecture designed as a readable engineering story

Each layer stays explicit so reviewers can quickly understand where ingestion, orchestration, persistence, and model-serving responsibilities live.

01

Preparation Layer

Raw telco data is transformed into consistent train/test artifacts with PySpark-backed cleaning and feature prep.

PySparkpandasconfig.yaml
02

Training Layer

Classifiers are trained, evaluated, and versioned with a shared experiment workflow.

scikit-learnXGBoostMLflow
03

Inference Layer

Saved models and preprocessing assets are reused for prediction and telemetry logging.

PythonMLflowserialized artifacts
AI Pipeline

Pipeline stages broken down as a readable execution path

The pipeline section keeps the most important engineering steps visible without collapsing them into generic bullet lists.

01

Ingest

Load the telco churn dataset and normalize schema-level issues such as incomplete numeric values.

CSVPySpark
02

Prepare

Handle missing values, remove outliers, encode categoricals, and scale key numeric features.

PySparkconfig.yaml
03

Train

Benchmark classical ML baselines and log metrics, parameters, and artifacts into MLflow.

scikit-learnXGBoostMLflow
04

Infer

Load serialized assets for downstream prediction while recording latency and batch-level telemetry.

PythonMLflow
Timeline

A case-study flow that explains how the system took shape

This timeline keeps the implementation story concise: what was framed first, what was hardened next, and what ultimately made the project production-ready.

Phase 01

Notebook consolidation

Moved scattered preparation and evaluation logic into a coherent repository structure.

Phase 02

Pipeline hardening

Created modular data, training, and inference stages with shared configuration.

Phase 03

Experiment tracking

Added MLflow-based lineage so metrics and model artifacts remain reviewable across runs.

Challenges

Technical constraints, decisions, and the reasoning behind them

This section is intentionally recruiter-friendly and engineer-friendly at the same time: each challenge is tied to a concrete design choice and a specific outcome.

Challenge

Turning exploratory notebooks into reusable production-style code.

Solution

Centralized logic under scriptable modules and orchestrated pipelines with shared config.

Outcome

Reduced manual repetition and improved auditability.

Challenge

Balancing scalable preprocessing with Python modeling libraries.

Solution

Used PySpark for deterministic transformation, then converted to pandas for estimator compatibility.

Outcome

Kept the system scalable without sacrificing model tooling.

Challenge

Keeping imbalanced-class performance honest.

Solution

Elevated F1, precision, and recall beside raw accuracy during evaluation and reporting.

Outcome

Made the model discussion more credible for reviewers and recruiters.

Results

Metrics and outcomes presented for quick technical review

The emphasis here is signal, not decoration: key numbers, verifiable outcomes, and the context needed to interpret them responsibly.

Best CV F1

0.845

Random Forest baseline during notebook benchmarking.

Holdout Recall

0.735

Kept churn detection visible instead of hiding behind accuracy.

Train/Test Split

80/20

Consistent split strategy for reproducible evaluation.

Pipeline Scope

4 stages

Data prep, training, evaluation, and inference telemetry.

Key Results

  • Established a reproducible churn workflow from raw CSV to tracked model artifact.
  • Benchmarked Random Forest as the strongest baseline at roughly 0.845 cross-validated F1.
  • Surfaced class-aware metrics so churn detection quality stays visible.
  • Created a stronger foundation for future API serving and scheduled retraining.

Research + Business Impact

Business value

Supports retention-risk analysis with a pipeline that can evolve into batch scoring or service-based predictions.

Engineering value

Shows ML systems maturity through configuration, artifacts, reproducibility, and observability rather than just raw metrics.