Big Data Storage, Graph Processing, and Advanced Topics
TL;DR
You'll learn how to store massive datasets efficiently using columnar formats like Parquet, process graph data with NetworkX and distributed systems, and handle advanced big data challenges like streaming and machine learning at scale. These tools let you work with datasets that don't fit in memory and solve complex network problems. By the end, you'll know when to use each approach and how to avoid common performance traps.
1. The Mental Model
Big data isn't just "lots of data" — it's data that breaks your normal tools. Storage becomes about compression and column-oriented access patterns instead of rows. Graph processing means thinking in nodes and edges, not tables. That's the whole idea.
2. The Core Material
2.1 Columnar Storage with Parquet
Traditional CSV files store data row by row, but Parquet stores it column by column. This matters because most analytics queries only need a few columns from wide tables.
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import numpy as np
# Create sample data
data = {
'user_id': np.arange(1000000),
'timestamp': pd.date_range('2023-01-01', periods=1000000, freq='1min'),
'revenue': np.random.exponential(10, 1000000),
'category': np.random.choice(['A', 'B', 'C'], 1000000)
}
df = pd.DataFrame(data)
# Write to Parquet with compression
df.to_parquet('sales_data.parquet', compression='snappy', index=False)
# Read only specific columns (this is where Parquet shines)
revenue_data = pd.read_parquet('sales_data.parquet', columns=['revenue', 'category'])
Parquet files are typically 10-20x smaller than equivalent CSVs and read much faster when you only need some columns.
2.2 Graph Data Structures and NetworkX
Graphs represent relationships: social networks, web links, transportation routes. NetworkX makes this intuitive in Python.
```python
import networkx as nx
import matplotlib.pyplot as plt
Create a directed graph
G = nx.DiGraph()
Add nodes and edges
G.add_nodes_from([1, 2, 3, 4, 5])
G.add_edges_from([(1, 2), (2, 3), (3, 4), (4, 5), (5, 1), (2, 4)])
Basic graph metrics
print(f"Nodes: {G.number_of_nodes()}")
print(f"Edges: {G.number_of_edges()}")
print(f"Is connected: {nx.is_weakly_connected(G)}")
Find shortest path
shortest = nx.shortest_path(G, source=1, target=4)
print(f"Shortest path from 1 to 4: {shortest}")
Calculate centrality measures
betweenness = nx.betweenness