Python: Split up iterable into evenly-sized chunks

In a recent tool I developed, I had the need to split up a list (or any iterable) into equal-sized lists of values. In this post, I document the process on how I reached my goal, what steps it took and the decisions behind them.

Since Python 3.12, you can use itertools.batched instead 🥳

The StackOverflow way

The most comfortable way to solve problems is often to just search on the internet. When looking for "python chunk list", one of the first results is an already answered question on StackOverflow.

python-chunks/divide_chunks_simple.py

# Source: https://stackoverflow.com/a/312464
def chunks(lst, n):
    """Yield successive n-sized chunks from lst."""
    for i in range(0, len(lst), n):
        yield lst[i:i + n]

While this tiny function solves the original question in a beautiful and simple way, there is still room for improvement.

Iterate over anything

A shortcoming of the solution above: It only works with lists that have a known length. If your source data is an iterator or generator that doesn't have that, it won't work. As I'm a great fan of composable functions, I refactored the solution to allow that:

python-chunks/divide_chunks_iterable.py

from collections import deque


def divide_chunks(data, chunksize):
    """
    Divide an iterator into chunks of the given size.
    The last chunks might be smaller that the chunksize.

    :param data: Anything that iterates values
    :param chunksize: Size of the
    """
    data_iter = iter(data)
    buffer = deque()

    while True:
        try:
            buffer.append(next(data_iter))
        except StopIteration:
            break

        if len(buffer) == chunksize:
            yield list(buffer)
            buffer.clear()

    if buffer:
        yield list(buffer)

In comparison with the (much smaller) previous approach, this one uses a deque as temporary buffer, until enough items have been collected.

As soon as the iterator has ended, whatever is left will be yielded as last chunk. You cannot use return for that, as it's would drop the value.

Adding and verifying types

Since the last few releases, Python added great support for adding type hints to code, which adds clarity about what functions receive and return, and allow for nice integration with your IDE of choice.

Because we don't know what data exactly comes in and goes out, we need to define a generic type to mark it as placeholder for the type checker to fill in. It is then able to correctly infer the result and use that to verify what the caller is doing with it later.

python-chunks/divide_chunks_types.py

from collections import deque
from typing import TypeVar, Iterable


ChunkValue = TypeVar("ChunkValue")


def divide_chunks(
    data: Iterable[ChunkValue], chunksize: int
) -> Iterable[list[ChunkValue]]:
    """
    Divide an iterator into chunks of the given size.
    The last chunks might be smaller that the chunksize.

    :param data: Anything that iterates values
    :param chunksize: Size of the
    """
    if not isinstance(chunksize, int):
        raise TypeError(f"Chunksize must be an int, got {type(chunksize)} instead")

    data_iter: Iterable[ChunkValue] = iter(data)
    buffer: deque[ChunkValue] = deque()

    while True:
        try:
            buffer.append(next(data_iter))
        except StopIteration:
            break

        if len(buffer) == chunksize:
            yield list(buffer)
            buffer.clear()

    if buffer:
        yield list(buffer)

I'm not using the Generator type here, as we don't use any generator-related features, and the Python docs recommend to use the simpler and more generic Iterable type instead.

Testing it

To ensure that the written method works in all cases, I added a unit test to the project's test suite (using pytest).

I often write tests using parametrizing test functions, which allows pytest to generate separate tests for each case, which gives an nicer output in case something goes wrong. The use of dataclasses and typing allows again for a nicer IDE experience.

python-chunks/test_divide_chunks.py

from typing import Iterable
from dataclasses import dataclass

import pytest

from divide_chunks_types import divide_chunks


@dataclass
class DivideCase:
    name: str
    input: Iterable
    chunksize: int
    expected: list


DIVIDE_TEST_CASES: list[DivideCase] = [
    # Simple list of numbers
    DivideCase(
        name="simple",
        input=[1, 2, 3, 4, 5],
        chunksize=1,
        expected=[[1], [2], [3], [4], [5]],
    ),
    # Use of a generator functions
    DivideCase(
        name="range",
        input=range(10),
        chunksize=2,
        expected=[[0, 1], [2, 3], [4, 5], [6, 7], [8, 9]],
    ),
    # Handling of leftovers when the iterator ends
    DivideCase(
        name="leftover",
        input=[1, 2, 3, 4, 5],
        chunksize=2,
        expected=[[1, 2], [3, 4], [5]],
    ),
    # Behaviour with an empty input
    DivideCase(
        name="empty",
        input=[],
        chunksize=2,
        expected=[],
    ),
]


@pytest.mark.parametrize(
    "case", DIVIDE_TEST_CASES, ids=[c.name for c in DIVIDE_TEST_CASES]
)
def test_divide_chunks(case: DivideCase):
    # Ensure that the algorithm works correctly
    assert list(divide_chunks(case.input, case.chunksize)) == case.expected
    # Ensure that double iter() works as well
    assert list(divide_chunks(iter(case.input), case.chunksize)) == case.expected


def test_divide_chunks_typing():
    with pytest.raises(TypeError, match="Chunksize must be an int.*"):
        list(divide_chunks([], None))

In the test I call the function twice, once with the input directly, and once wrapped wit iter() to ensure that both cases work correctly.

And with the right invocation of pytest, we can see that we tested all code (and ran all of test code):

Bash

$ pipenv run pytest --cov-report term-missing --cov=.