Processing Large Smart-Meter CSVs in Python in a Memory-Efficient method
How Energy Companies actually Read Electricity Data
Smart Meter Datasets Grow Fast
A single household at hourly resolution is manageable; an entire city at hourly resolution can quickly become a multi‑gigabyte (or larger) CSV.
If you try to load in Python a 100 GB CSV with a simple df = pd.read_csv(...), you’re effectively asking your machine to materialize the entire dataset in memory at once, often ending in a MemoryError.
This post walks through a clean, tutorial-friendly pattern:
Generate a dummy smart-meter CSV (small enough for demos, but realistic).
Compute total electricity consumption using chunked reads (streaming through the file instead of loading it all at once).
Add a few production-minded improvements that make chunking even more effective.
Step 1: Creating a Dummy Electricity Consumption Dataset
The demo file has three columns:
timestamp: hourly time pointsmeter_id: unique smart meter identifierkwh: electricity consumption per hour
Pandas makes it easy to generate hourly timestamps using pd.date_range, which returns equally spaced time points according to a specified frequency (e.g., hourly).
Here’s a polished version:
import pandas as pd
import numpy as np
# Optional: make results reproducible
np.random.seed(42)
data = {
"timestamp": pd.date_range(start="2026-01-01", periods=500, freq="H"),
"meter_id": ["METER_001"] * 500,
"kwh": np.random.uniform(0.5, 5.0, 500).round(2), # hourly kWh values
}
df = pd.DataFrame(data)
df.to_csv("smart_meter_data.csv", index=False)
print("File 'smart_meter_data.csv' created successfully!")
A quick intuition check: 500 hourly rows is about 20.8 days of hourly readings—perfect for demonstration. For “real” scenarios, just imagine the same schema repeated across thousands (or millions) of smart meter IDs (meter_id1, meter_id2 etc).
Step 2 — Why pd.read_csv(...) can crash on city-scale data
When a CSV is bigger than available RAM (or big enough to trigger memory pressure), attempting to read it all into a single DataFrame can fail.
The key concept is to stream the file: read a limited number of rows, compute what you need, discard the chunk, then repeat.
Pandas supports this pattern directly: using the chunksize parameter makes pd.read_csv return a TextFileReader iterator, yielding one DataFrame “chunk” at a time. The documentation describes chunksize as the number of lines read per chunk and notes that passing it returns an iterator for chunked processing.
Step 3 — Compute total kWh using chunked reads (streaming)
The chunking approach is exactly the right “data engineering” pattern for large flat files:
Use
chunk['kwh'].sum()(vectorized Pandas) instead of Python’ssum(...)Print both chunk-level and running totals (more intuitive for learners)
import pandas as pd
total_energy_kwh = 0.0
for i, chunk in enumerate(pd.read_csv("smart_meter_data.csv", chunksize=100), start=1):
chunk_sum = chunk["kwh"].sum()
total_energy_kwh += chunk_sum
print(f"Chunk {i}: chunk_sum={chunk_sum:.2f} kWh | running_total={total_energy_kwh:.2f} kWh")
print(f"\nTotal Energy Consumed: {total_energy_kwh:.2f} kWh")
What’s happening here?
chunksize=100means: read 100 rows (100 hours, in your dataset) at a time.Each chunk is a normal DataFrame, so you can run standard Pandas operations on it.
You keep only a small running accumulator (
total_energy_kwh) in memory.
This is exactly the kind of “streaming aggregate” operation chunking is best at.
Production-minded improvements
Chunking is powerful, but you get the best results when you combine it with “load less data” and efficient datatypes.
1) Read only the columns you need with usecols
If your goal is total kWh, you don’t need timestamps or meter IDs at all. Pandas explicitly calls out usecols as a way to limit columns read into memory.
total_energy_kwh = 0.0
for chunk in pd.read_csv("smart_meter_data.csv", usecols=["kwh"], chunksize=100_000):
total_energy_kwh += chunk["kwh"].sum()
print(f"Total Energy Consumed: {total_energy_kwh:.2f} kWh")
2) Choose a practical chunksize
Smaller chunks reduce peak memory usage.
Larger chunks can be faster.
In real workloads, values like
50_000to1_000_000rows per chunk are common, subject to your RAM and number of columns.
City-scale files: total kWh per meter (not just one household)
The previous demo file has only METER_001, but city-scale datasets usually have many meters mixed together.
A very common pattern is:
Compute per-meter totals within each chunk
Accumulate results into a dictionary (or Series) across chunks
import pandas as pd
totals = {} # meter_id -> total kWh
for chunk in pd.read_csv("smart_meter_data.csv", chunksize=200_000):
# Sum within this chunk
chunk_totals = chunk.groupby("meter_id")["kwh"].sum()
# Add into the running dictionary
for meter_id, kwh_sum in chunk_totals.items():
totals[meter_id] = totals.get(meter_id, 0.0) + kwh_sum
totals_series = pd.Series(totals).sort_values(ascending=False)
print(totals_series.head(10))
Wrap-up
In smart-meter analytics we follow the rules below:
Small CSV file? Load it normally, keep it simple, using df = pd.read_csv(“a.cv”)
Large file (millions of rows)? Stream it with
chunksize, compute incremental results, and keep memory usage stable, as shown above.Combine chunking with
usecolsto go even further.


This is a very nice tutorial 👌.