Processing Large Smart-Meter CSVs in Python in a Memory-Efficient method

How Energy Companies actually Read Electricity Data

Mar 31, 2026

Want structured training, courses, mentorship, and career support for the energy sector? Join the Energy Data Scientist to unlock hundreds of advanced courses, downloadable models (optimisation, ML, etc), and industry reports.

Smart Meter Datasets Grow Fast

A single household at hourly resolution is manageable; an entire city at hourly resolution can quickly become a multi‑gigabyte (or larger) CSV.

If you try to load in Python a 100 GB CSV with a simple df = pd.read_csv(...), you’re effectively asking your machine to materialize the entire dataset in memory at once, often ending in a MemoryError.

This post walks through a clean, tutorial-friendly pattern:

Generate a dummy smart-meter CSV (small enough for demos, but realistic).
Compute total electricity consumption using chunked reads (streaming through the file instead of loading it all at once).
Add a few production-minded improvements that make chunking even more effective.

Step 1: Creating a Dummy Electricity Consumption Dataset

The demo file has three columns:

timestamp: hourly time points
meter_id: unique smart meter identifier
kwh: electricity consumption per hour

Pandas makes it easy to generate hourly timestamps using pd.date_range, which returns equally spaced time points according to a specified frequency (e.g., hourly).

Here’s a polished version:

import pandas as pd
import numpy as np

# Optional: make results reproducible
np.random.seed(42)

data = {
    "timestamp": pd.date_range(start="2026-01-01", periods=500, freq="H"),
    "meter_id": ["METER_001"] * 500,
    "kwh": np.random.uniform(0.5, 5.0, 500).round(2),  # hourly kWh values
}

df = pd.DataFrame(data)
df.to_csv("smart_meter_data.csv", index=False)

print("File 'smart_meter_data.csv' created successfully!")

A quick intuition check: 500 hourly rows is about 20.8 days of hourly readings—perfect for demonstration. For “real” scenarios, just imagine the same schema repeated across thousands (or millions) of smart meter IDs (meter_id1, meter_id2 etc).

Step 2 — Why `pd.read_csv(...)` can crash on city-scale data

When a CSV is bigger than available RAM (or big enough to trigger memory pressure), attempting to read it all into a single DataFrame can fail.

The key concept is to stream the file: read a limited number of rows, compute what you need, discard the chunk, then repeat.

Pandas supports this pattern directly: using the chunksize parameter makes pd.read_csv return a TextFileReader iterator, yielding one DataFrame “chunk” at a time. The documentation describes chunksize as the number of lines read per chunk and notes that passing it returns an iterator for chunked processing.

Step 3 — Compute total kWh using chunked reads (streaming)

The chunking approach is exactly the right “data engineering” pattern for large flat files:

Use chunk['kwh'].sum() (vectorized Pandas) instead of Python’s sum(...)
Print both chunk-level and running totals (more intuitive for learners)

import pandas as pd

total_energy_kwh = 0.0

for i, chunk in enumerate(pd.read_csv("smart_meter_data.csv", chunksize=100), start=1):
    chunk_sum = chunk["kwh"].sum()
    total_energy_kwh += chunk_sum

    print(f"Chunk {i}: chunk_sum={chunk_sum:.2f} kWh | running_total={total_energy_kwh:.2f} kWh")

print(f"\nTotal Energy Consumed: {total_energy_kwh:.2f} kWh")

What’s happening here?

chunksize=100 means: read 100 rows (100 hours, in your dataset) at a time.
Each chunk is a normal DataFrame, so you can run standard Pandas operations on it.
You keep only a small running accumulator (total_energy_kwh) in memory.

This is exactly the kind of “streaming aggregate” operation chunking is best at.

Production-minded improvements

Chunking is powerful, but you get the best results when you combine it with “load less data” and efficient datatypes.

1) Read only the columns you need with `usecols`

If your goal is total kWh, you don’t need timestamps or meter IDs at all. Pandas explicitly calls out usecols as a way to limit columns read into memory.

total_energy_kwh = 0.0

for chunk in pd.read_csv("smart_meter_data.csv", usecols=["kwh"], chunksize=100_000):
    total_energy_kwh += chunk["kwh"].sum()

print(f"Total Energy Consumed: {total_energy_kwh:.2f} kWh")

2) Choose a practical `chunksize`

Smaller chunks reduce peak memory usage.
Larger chunks can be faster.
In real workloads, values like 50_000 to 1_000_000 rows per chunk are common, subject to your RAM and number of columns.

City-scale files: total kWh per meter (not just one household)

The previous demo file has only METER_001, but city-scale datasets usually have many meters mixed together.

A very common pattern is:

Compute per-meter totals within each chunk
Accumulate results into a dictionary (or Series) across chunks

import pandas as pd

totals = {}  # meter_id -> total kWh

for chunk in pd.read_csv("smart_meter_data.csv", chunksize=200_000):
    # Sum within this chunk
    chunk_totals = chunk.groupby("meter_id")["kwh"].sum()

    # Add into the running dictionary
    for meter_id, kwh_sum in chunk_totals.items():
        totals[meter_id] = totals.get(meter_id, 0.0) + kwh_sum

totals_series = pd.Series(totals).sort_values(ascending=False)
print(totals_series.head(10))

Wrap-up

In smart-meter analytics we follow the rules below:

Small CSV file? Load it normally, keep it simple, using df = pd.read_csv(“a.cv”)
Large file (millions of rows)? Stream it with chunksize, compute incremental results, and keep memory usage stable, as shown above.
Combine chunking with usecols to go even further.

The Energy Data Scientist

1 Comment

Ready for more?