Big Data Storage, Graph Processing, and Advanced Topics

From the big data models python curriculum · Updated May 31, 2026

Big Data Storage, Graph Processing, and Advanced Topics

TL;DR

You'll learn how to store massive datasets efficiently using columnar formats like Parquet, process graph data with NetworkX and distributed systems, and handle advanced big data challenges like streaming and machine learning at scale. These tools let you work with datasets that don't fit in memory and solve complex network problems. By the end, you'll know when to use each approach and how to avoid common performance traps.

1. The Mental Model

Big data isn't just "lots of data" — it's data that breaks your normal tools. Storage becomes about compression and column-oriented access patterns instead of rows. Graph processing means thinking in nodes and edges, not tables. That's the whole idea.

2. The Core Material

2.1 Columnar Storage with Parquet

Traditional CSV files store data row by row, but Parquet stores it column by column. This matters because most analytics queries only need a few columns from wide tables.

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import numpy as np

# Create sample data
data = {
    'user_id': np.arange(1000000),
    'timestamp': pd.date_range('2023-01-01', periods=1000000, freq='1min'),
    'revenue': np.random.exponential(10, 1000000),
    'category': np.random.choice(['A', 'B', 'C'], 1000000)
}
df = pd.DataFrame(data)

# Write to Parquet with compression
df.to_parquet('sales_data.parquet', compression='snappy', index=False)

# Read only specific columns (this is where Parquet shines)
revenue_data = pd.read_parquet('sales_data.parquet', columns=['revenue', 'category'])

Parquet files are typically 10-20x smaller than equivalent CSVs and read much faster when you only need some columns.

2.2 Graph Data Structures and NetworkX

Graphs represent relationships: social networks, web links, transportation routes. NetworkX makes this intuitive in Python.

import networkx as nx
import matplotlib.pyplot as plt

# Create a directed graph
G = nx.DiGraph()

# Add nodes and edges
G.add_nodes_from([1, 2, 3, 4, 5])
G.add_edges_from([(1, 2), (2, 3), (3, 4), (4, 5), (5, 1), (2, 4)])

# Basic graph metrics
print(f"Nodes: {G.number_of_nodes()}")
print(f"Edges: {G.number_of_edges()}")
print(f"Is connected: {nx.is_weakly_connected(G)}")

# Find shortest path
shortest = nx.shortest_path(G, source=1, target=4)
print(f"Shortest path from 1 to 4: {shortest}")

# Calculate centrality measures
betweenness = nx.betweenness_centrality(G)
print(f"Betweenness centrality: {betweenness}")

2.3 Distributed Graph Processing with Dask

When graphs get too big for NetworkX, you need distributed processing. Dask extends pandas and NumPy patterns to multiple machines.

import dask.dataframe as dd
from dask.distributed import Client

# Connect to Dask cluster (or use local threads)
client = Client('localhost:8786')  # or Client() for local

# Load large dataset as Dask DataFrame
df = dd.read_parquet('large_graph_edges.parquet')

# Graph operations on distributed data
edge_counts = df.groupby('source_node').size().compute()
node_degrees = df['target_node'].value_counts().compute()

# Dask delayed for custom graph algorithms
import dask
from dask import delayed

@delayed
def process_subgraph(partition):
    # Custom processing on each partition
    return partition.groupby('community').agg({'weight': 'sum'})

results = [process_subgraph(partition) for partition in df.to_delayed()]
final_result = dask.compute(*results)

2.4 Streaming Data Processing

Real-world big data often arrives as streams. Apache Kafka handles the messaging, while stream processors handle the computation.

from kafka import KafkaConsumer, KafkaProducer
import json
from collections import defaultdict
import time

# Simple streaming aggregation
class StreamingAggregator:
    def __init__(self):
        self.window_size = 60  # seconds
        self.counts = defaultdict(int)
        self.window_start = time.time()

    def process_message(self, message):
        current_time = time.time()

        # Check if we need a new window
        if current_time - self.window_start > self.window_size:
            self.emit_results()
            self.reset_window()

        # Process the message
        event_type = message.get('event_type')
        self.counts[event_type] += 1

    def emit_results(self):
        print(f"Window results: {dict(self.counts)}")

    def reset_window(self):
        self.counts.clear()
        self.window_start = time.time()

# Consumer setup
consumer = KafkaConsumer(
    'events',
    bootstrap_servers=['localhost:9092'],
    value_deserializer=lambda m: json.loads(m.decode('utf-8'))
)

aggregator = StreamingAggregator()
for message in consumer:
    aggregator.process_message(message.value)

3. Worked Example

Let's build a complete pipeline that processes social network data to find influential users.

import pandas as pd
import networkx as nx
import pyarrow.parquet as pq

# Step 1: Load edge data from Parquet
edges_df = pd.read_parquet('social_network_edges.parquet')
print(f"Loaded {len(edges_df)} edges")

# Step 2: Create NetworkX graph
G = nx.from_pandas_edgelist(
    edges_df, 
    source='from_user', 
    target='to_user', 
    edge_attr='weight',
    create_using=nx.DiGraph()
)

# Step 3: Calculate influence metrics
pagerank_scores = nx.pagerank(G, weight='weight')
betweenness_scores = nx.betweenness_centrality(G)
in_degree_scores = dict(G.in_degree(weight='weight'))

# Step 4: Combine metrics into influence score
influence_data = []
for node in G.nodes():
    influence_score = (
        0.5 * pagerank_scores.get(node, 0) +
        0.3 * betweenness_scores.get(node, 0) +
        0.2 * (in_degree_scores.get(node, 0) / max(in_degree_scores.values()))
    )
    influence_data.append({
        'user_id': node,
        'pagerank': pagerank_scores.get(node, 0),
        'betweenness': betweenness_scores.get(node, 0),
        'weighted_in_degree': in_degree_scores.get(node, 0),
        'influence_score': influence_score
    })

# Step 5: Save results
results_df = pd.DataFrame(influence_data)
results_df = results_df.sort_values('influence_score', ascending=False)
results_df.to_parquet('user_influence_scores.parquet', index=False)

print("Top 5 most influential users:")
print(results_df.head())

4. Production Pitfalls & Best Practices

4.1 Real-World Best Practices

  • Partition Parquet files by date/category — enables query pruning and faster reads
  • Use appropriate compression — snappy for speed, gzip for size, lz4 for balance
  • Batch graph updates — don't rebuild NetworkX graphs for every edge change
  • Set memory limits in Dask — prevents OOM kills with memory_limit='4GB'
  • Use connection pooling — reuse Kafka producers/consumers instead of creating new ones
  • Monitor partition skew — uneven data distribution kills distributed performance
  • Cache intermediate results — store computed centrality measures for reuse

4.2 Common Bugs & Anti-Patterns

Bad: Loading entire Parquet file to get metadata

# This loads all data into memory
df = pd.read_parquet('huge_file.parquet')
print(f"Shape: {df.shape}")

Good: Use parquet metadata

# This reads only metadata
parquet_file = pq.ParquetFile('huge_file.parquet')
print(f"Rows: {parquet_file.metadata.num_rows}")

Bad: Converting large graphs to pandas then NetworkX

# Memory explosion for big graphs
df = pd.read_csv('10million_edges.csv')
G = nx.from_pandas_edgelist(df)  # OOM!

Good: Stream construction or use igraph

# Build graph incrementally
G = nx.DiGraph()
for chunk in pd.read_csv('10million_edges.csv', chunksize=10000):
    edges = list(zip(chunk['source'], chunk['target']))
    G.add_edges_from(edges)

Bad: Synchronous message processing

# Blocks on slow operations
for message in consumer:
    slow_database_write(message)  # Creates backlog

Good: Async processing with batching

batch = []
for message in consumer:
    batch.append(message)
    if len(batch) >= 100:
        asyncio.create_task(process_batch(batch))
        batch = []

4.3 When to Use What (Decision Matrix)

Data Size Structure Use Case Best Tool
< 1GB Tabular Analytics queries Pandas + Parquet
> 1GB Tabular Distributed analytics Dask DataFrame
< 1M nodes Graph Network analysis NetworkX
> 1M nodes Graph Distributed graphs igraph or GraphX
Real-time Streaming Event processing Kafka + custom processors
Batch Any ETL pipelines Apache Airflow + Dask

5. Now Try It

Build a recommendation system using graph data:

  1. Create a bipartite graph with users and items they've interacted with
  2. Use NetworkX to calculate item similarity based on shared users
  3. Generate recommendations for a target user
  4. Save the similarity matrix as Parquet for fast lookup
  5. Bonus: Stream new interactions and update recommendations incrementally

What success looks like: You can input a user ID and get back the top 10 recommended items with similarity scores, and the whole pipeline runs in under 30 seconds for 10,000 users and 1,000 items.


Get the full big data models python curriculum

Clone the complete plan to your dashboard for unlimited AI-generated notes, practice quizzes, and a personalised revision schedule.

Create Free Account