Big Data Machine Learning with MLlib and scikit-learn on Distributed Data
From the big data models python curriculum · Updated May 31, 2026
Big Data Machine Learning with MLlib and scikit-learn on Distributed Data
TL;DR
You'll learn how to train machine learning models on datasets too big for one machine using Spark MLlib and distributed scikit-learn. You'll understand when to use each framework and how to handle the unique challenges of distributed ML. By the end, you'll be able to build and deploy scalable ML pipelines that work across clusters.
1. The Mental Model
When your dataset doesn't fit in memory on one machine, you need distributed machine learning. MLlib runs natively on Spark clusters and handles partitioning automatically, while scikit-learn can be distributed using libraries like Dask or joblib. The key insight: some algorithms distribute naturally (like linear regression), others don't (like k-means clustering). That's the whole idea.
2. The Core Material
MLlib: Spark's Native ML Library
MLlib is built for distributed computing from the ground up. It operates on Spark DataFrames and RDDs, automatically distributing computation across your cluster.
from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator
# Initialize Spark
spark = SparkSession.builder.appName("MLlibExample").getOrCreate()
# Load distributed data
df = spark.read.parquet("hdfs://large_dataset.parquet")
# Feature engineering on distributed data
assembler = VectorAssembler(inputCols=["feature1", "feature2", "feature3"],
outputCol="features")
feature_df = assembler.transform(df)
# Train model - computation happens across cluster
lr = LinearRegression(featuresCol="features", labelCol="target")
model = lr.fit(feature_df)
MLlib's strength is seamless integration with Spark's ecosystem. You can read from HDFS, process with Spark SQL, train with MLlib, and write results back—all in one pipeline.
Scikit-learn with Distributed Computing
Scikit-learn wasn't designed for distributed computing, but you can distribute it using Dask or joblib backends.
import dask.dataframe as dd
from dask_ml.model_selection import train_test_split
from dask_ml.linear_model import LinearRegression
from sklearn.linear_model import SGDRegressor
from joblib import parallel_backend
# Option 1: Dask-ML (scikit-learn compatible)
df = dd.read_parquet("large_dataset.parquet")
X = df[["feature1", "feature2", "feature3"]]
y = df["target"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Distributed linear regression
model = LinearRegression()
model.fit(X_train, y_train)
# Option 2: Distributed scikit-learn with joblib
with parallel_backend('dask'):
sgd = SGDRegressor()
sgd.fit(X_train.compute(), y_train.compute()) # compute() brings to memory
Key Differences Between MLlib and Distributed Scikit-learn
Data Handling:
- MLlib: Works directly on Spark DataFrames, keeps data distributed
- Dask-ML: Works on Dask DataFrames, similar distributed approach
- Joblib + sklearn: Often requires computing results to memory
Algorithm Coverage:
- MLlib: Fewer algorithms but all truly distributed
- Scikit-learn: More algorithms, varying levels of distributed support
graph TD
A["Large Dataset"] --> B["MLlib Pipeline"]
A --> C["Dask-ML Pipeline"]
B --> D["Spark DataFrame"]
C --> E["Dask DataFrame"]
D --> F["Native Distributed Training"]
E --> G["Distributed Training"]
F --> H["Spark Model"]
G --> I["Sklearn-compatible Model"]
Handling Different Algorithm Types
Embarrassingly Parallel (Easy to Distribute):
- Linear/Logistic Regression
- Random Forest
- Gradient Boosting
Iterative Algorithms (Harder to Distribute):
- K-means clustering
- Neural networks
- SVM
# Random Forest - naturally parallel
from pyspark.ml.classification import RandomForestClassifier
rf = RandomForestClassifier(numTrees=100, maxDepth=5)
rf_model = rf.fit(train_data) # Trees trained in parallel
# K-means - requires coordination between iterations
from pyspark.ml.clustering import KMeans
kmeans = KMeans(k=5, maxIter=20)
kmeans_model = kmeans.fit(data) # Centroids updated synchronously
3. Worked Example
Let's build a complete distributed ML pipeline to predict customer churn on a 10GB dataset.
from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler, StringIndexer
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml import Pipeline
# Step 1: Initialize Spark with cluster resources
spark = SparkSession.builder \
.appName("ChurnPrediction") \
.config("spark.executor.memory", "4g") \
.config("spark.executor.cores", "2") \
.getOrCreate()
# Step 2: Load large dataset (automatically partitioned)
df = spark.read.csv("hdfs://customer_data.csv", header=True, inferSchema=True)
print(f"Dataset size: {df.count()} rows across {df.rdd.getNumPartitions()} partitions")
# Step 3: Preprocessing pipeline
indexer = StringIndexer(inputCol="customer_type", outputCol="customer_type_indexed")
assembler = VectorAssembler(
inputCols=["age", "monthly_charges", "total_charges", "customer_type_indexed"],
outputCol="features"
)
# Step 4: Model training
lr = LogisticRegression(featuresCol="features", labelCol="churn")
# Step 5: Create and fit pipeline (all steps distributed)
pipeline = Pipeline(stages=[indexer, assembler, lr])
train_data, test_data = df.randomSplit([0.8, 0.2], seed=42)
model = pipeline.fit(train_data) # Training distributed across cluster
# Step 6: Evaluation
predictions = model.transform(test_data)
evaluator = BinaryClassificationEvaluator(labelCol="churn", metricName="areaUnderROC")
auc = evaluator.evaluate(predictions)
print(f"AUC: {auc:.3f}")
The entire pipeline runs distributed—data stays partitioned across the cluster, preprocessing happens in parallel, and model training uses all available cores.
4. Production Pitfalls & Best Practices
4.1 Real-World Best Practices
Partition Management: Keep partitions between 100MB-1GB for optimal performance. Too small creates overhead, too large causes memory issues.
Data Persistence: Cache intermediate DataFrames you'll reuse multiple times using .cache() or .persist().
Resource Allocation: Size your Spark executors based on your data—start with 4-8GB executor memory and 2-4 cores per executor.
Algorithm Selection: Choose algorithms that parallelize well for distributed training—Random Forest over SVM, SGD over batch gradient descent.
Model Serialization: Use MLlib's native model saving for Spark models, pickle for scikit-learn models with version pinning.
Monitoring: Track Spark UI metrics during training—look for data skew, GC pressure, and task failures.
Feature Engineering: Push feature transformations into the ML pipeline so they're applied consistently during training and inference.
4.2 Common Bugs & Anti-Patterns
Problem: Collecting large datasets to driver
# BAD - brings entire dataset to single machine
predictions_list = model.transform(large_df).collect()
# GOOD - keep processing distributed
model.transform(large_df).write.parquet("predictions_output/")
Problem: Creating too many small partitions
# BAD - creates thousands of tiny partitions
df = spark.read.csv("data/*.csv") # 5000 small files
# GOOD - repartition for optimal size
df = spark.read.csv("data/*.csv").repartition(200)
Problem: Not handling data skew
# BAD - some partitions much larger than others
df.groupBy("user_id").count() # Power users dominate partitions
# GOOD - salt keys for even distribution
from pyspark.sql.functions import rand, concat, lit
df_salted = df.withColumn("salted_key", concat(lit("salt_"), (rand() * 10).cast("int")))
Problem: Inefficient cross-validation
# BAD - sequential cross-validation
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X.compute(), y.compute(), cv=5)
# GOOD - distributed cross-validation
from dask_ml.model_selection import cross_val_score as dask_cv
scores = dask_cv(model, X, y, cv=5)
4.3 When to Use What (Decision Matrix)
| Data Size | Algorithm Type | Team Expertise | Recommended Approach |
|---|---|---|---|
| < 1GB | Any | Sklearn familiar | Scikit-learn (single machine) |
| 1-100GB | Tree-based | Mixed | MLlib or Dask-ML |
| 100GB+ | Any distributed | Spark ecosystem | MLlib |
| < 10GB | Deep learning | Python/sklearn | Scikit-learn + joblib |
| 10GB+ | Linear models | Spark experience | MLlib |
| Streaming data | Online learning | Real-time needs | MLlib Structured Streaming |
5. Now Try It
Build a distributed recommendation system using the MovieLens 25M dataset (262MB unzipped). Download it, load into Spark, and train an ALS (Alternating Least Squares) model to predict movie ratings.
Your task: Create a complete pipeline that loads the ratings data, splits it into train/test, trains an ALS model with MLlib, and evaluates RMSE on the test set. Include proper error handling for missing data and optimize the ALS hyperparameters (rank, iterations, regularization).
Success looks like: A working Spark application that outputs "Test RMSE: X.XX" and saves the trained model to disk. The entire process should handle the data in a distributed fashion without collecting large datasets to the driver.
Get the full big data models python curriculum
Clone the complete plan to your dashboard for unlimited AI-generated notes, practice quizzes, and a personalised revision schedule.
Create Free Account