Python

Data Classes In Python and How to Build Them?

How to Create Data Classes In Python with Named Tuples, Typed Named Tuples and The dataclass Decorator.

Nadim Jendoubi

Feb 10, 2023 — 8 min read

Data classes in Python and How To build them

Data classes are used in every OOP language, these are classes that contain only fields and CRUD methods for accessing them. Now the author of the book Fluent Python has an interesting take on this, that we will discover together.

Data Class Builders

Martin Fowler, wrote in his book Refactoring: Improving the Design of Existing Code:

Data classes are like children. They are okay as a starting point, but to participate as a grownup object, they need to take some responsibility.

This should give you an idea about the points we’re visiting today:

An overview of Data class Builders
Building a class with Named Tuples
Building a class with Typed Name Tuples
Building a class with @dataclass Decorator
Refactoring data classes
Pattern Matching class instances

Overview of Data Class Builders

Python offers multiple ways to build a data class, which is basically a collection of fields and methods, we will cover 3 of the data class builders:

Named Tuples through collections.namedtuple
Typed Named Tuples through typing.NamedTuple, which is a Named Tuple but with type hints for its fields.
@dataclasses.dataclass

We will start by building the same class with the different builders:

class Family:
    def __init__(self, mother, father, daughter):
        self.mother= mother
        self.father = father
        self.daughter= daughter


adams_family = Family('Morticia Addams', 'Gomez Addams', 'Wednesday Addams')

print(adams_family)
# <Family object at 0x107142f10>

This is a classic data class with the boilerplate __init__ method that serves nothing but to initialize, we can implement __repr__ and __eq__ to get more meaningful results as they are inherited but it does not change the fact that this is Not the Pythonic Way.

Implementing the same class with a Named Tuple saved us boilerplate code and gave us a useful __repr__ method , the more Elegant Pythonic Way.

import typing
Family= typing.NamedTuple('Family', [('mother', str), ('father', str), ('daughter', str)])

adams_family = Family('Morticia Addams', 'Gomez Addams', 'Wednesday Addams')
print(adams_family)

# Family(mother='Morticia Addams', father='Gomez Addams' daughter='Wednesday Addams')

We can also implement it with a Typed Named Tuple:

from typing import NamedTuple

class Family(NamedTuple):
    mother: str
    father: str
    daughter: str

    def __str__(self):
        return f'The Father is {self.father}, the mother is called {self.mother} and the daughter is {self.daughter}'

adams_family = Family('Morticia Addams', 'Gomez Addams', 'Wednesday Addams')

print(adams_family)
# The Father is Gomez Addams, the mother is called Morticia Addams and the daughter is Wednesday Addams

Or we implement it using the @dataclass Decorator:

from dataclasses import dataclass

@dataclass(frozen=True)
class Family:
    mother: str
    father: str
    daughter: str

    def __str__(self):
        return f'The Father is {self.father}, the mother is called {self.mother} and the daughter is {self.daughter}'

adams_family = Family('Morticia Addams', 'Gomez Addams', 'Wednesday Addams')

print(adams_family)
# The Father is Gomez Addams, the mother is called Morticia Addams and the daughter is Wednesday Addams

Notice how the @dataclass decorator does not depend on inheritance or a Metaclass so it should not interfere with our own business logic.

We can compare the 3 Builders as such:

Selected features compared across the three data class builders. Image from the Fluent Python Book

The classes built by typing.NamedTuple and @dataclass have an
__annotations__ attribute holding the type hints for the fields accessible with inspect.get_annotations(MyClass) or typing.get_type_hints(MyClass).
Instances built with collections.namedtuple and typing.NamedTuple are immutable since the tuples are immutable, whereas @dataclass instances are mutable unless frozen is set to True.

Building a class with Named Tuples

The collections.namedtuple function is a factory that builds subclasses of tuple enhanced with field names, a class name, and an informative __repr__ method.

from collections import namedtuple

# Create the class
City = namedtuple('City', 'name country population coordinates')

# Create the instance
tokyo = City('Tokyo', 'JP', 36.933, (35.689722, 139.691667))

print(tokyo)
# City(name='Tokyo', country='JP', population=36.933, coordinates=(35.689722,139.691667))

print(tokyo.coordinates)
# (35.689722, 139.691667)

As a Tuple Subclass, the City class inherits useful methods like the __repr__, __eq__ and even the special methods used for comparison like __lt__.

As a namedtuple, we have access to extra attributes and methods such as _fields Class attribute, _make(iterable) Class method and the _asdict() Instance method.

Class Attributes or Class Methods are Shared by all Instances. Instance Attributes or Instance Methods are not, they are specific to that instance.

from collections import namedtuple

# Create the class
City = namedtuple('City', 'name country population coordinates')

print(City._fields)
# ('name', 'country', 'population', 'location')

# Create Cooridanate Class
Coordinate = namedtuple('Coordinate', 'lat lon')

# Create Tuple
delhi_data = ('Delhi NCR', 'IN', 21.935, Coordinate(28.613889, 77.208889))

# Create Instance from Tuple
delhi = City._make(delhi_data)

# print ready to json serialize dictionary 
print(delhi._asdict())
#{'name': 'Delhi NCR', 'country': 'IN', 'population': 21.935,'location': Coordinate(lat=28.613889, lon=77.208889)}

Building a class with Typed Named Tuples

Typed Named Tuple is Named Tuple but with a type hint which allows it to support regular class statement syntax.

from typing import NamedTuple

class Coordinate(NamedTuple):
    lat: float
    lon: float

Python by default does not enforce any type hints and there is no impact on the runtime behavior of our apps, so these are mostly for documentation purposes.

The type hints are intended primarily to support third-party type checkers, like Mypy or any IDE’s type checker.

Type Hints

Let’s talk about type hints or what Python calls Type Annotations. A Type Annotation is the explicit definition of a Type for a function’s argument, return value, variables and attributes.

Python is Duck Typed, this means there are no runtime effects on our app whether we annotate our code or not.

The basic syntax of a variable annotation is:

var_name: some_type = a_value

Python processes these annotation through the __annotations__ dictionary, ie. each variable is saved with it’s type in this dictionary, and if it has a value it is treated as a class attribute.

class DemoClass: 
    a: int
    b: float = 1.1
    c = 'spam'

print(DemoClass.__annotations__)
# {'a': <class 'int'>, 'b': <class 'float'>}

Building a Class with DataClass Decorator

The dataclass module provides a @dataclass decorator and functions for automatically adding generated special methods such as __init__ .

We can pass it multiple arguments and each argument generates its equivalent method. Here's a selection of these arguments:

The list or arguments passed to @dataclass provided in the book Fluent Python

There are a few points to keep in mind in our day to day code:

Class attributes are similar to static attributes, they are attributes that of the class itself.
Instance attributes are specific to each instance.
If we provide a value for 1 field we have to provide the values for the rest of our class's fields as Python does not allow parameters without defaults after parameters with defaults.
@dataclass classes will reject any class attributes with mutable default value as the default value gets easily corrupted or mutated and Python views this as a common source of bugs, so this class will be rejected:

@dataclass
class Team:
    team_name: str
    members: list = []

# ValueError: mutable default <class 'list'> for field guests is not allowed

# The solution is to use the default_factory
@dataclass
class Team:
    team_name: str
    members: list = field(default_factory=list)
# each instance will have its own members list instead of all instances 
# sharing the same list from the class
# which is rarely what we want and is often a bug

The @dataclass does not generate __post_init__ method so if we need any validation or computing after the execution of the __init__ method we need to provide it ourselves.
If we want to declare a typed class attribute which is not possible, for example using set[…](it turns into an instance attribute), we have to import a Pseudotype named typing.ClassVar, which leverages the generics [] notation to set the type of the variable and also declare it a class attribute, like this:

from typing import ClassVar

@dataclass
class HackerClubMember(ClubMember):
    all_handles: ClassVar[set[str]] = set()
    handle: str = ''
    def __post_init__(self):
        cls = self.__class__
        if self.handle == '':
            self.handle = self.name.split()[0]
        if self.handle in cls.all_handles:
            msg = f'handle {self.handle!r} already exists.'
            raise ValueError(msg)
        cls.all_handles.add(self.handle)

# all_handles is a class attribute of type set-of-str, with an empty set as its default value

If we want to initialize a class attribute in the __init__ or just pass it as an argument method to the __post_init__ we have to import another Pseudotype named InitVar which uses the same syntax of typing.ClassVar, like this:

@dataclass
class DbHandler:
    i: int
    j: int = None
    database: InitVar[DatabaseType] = None
    def __post_init__(self, database):
        if self.j is None and database is not None:
            self.j = database.lookup('j')

c = C(10, database=my_database)

Finally the @dataclass decorator doesn’t care about the types in the annotations, except in the last discussed two cases typing.ClassVar and InitVar.

Refactoring Data classes

The main idea of OOP is to encapsulate the behavior and the data together in the same code unit which is what we call a Class. This is not always the case especially with the scale of some projects, we might end up with scattered code dealing with the instances.

The solution is to bring back the responsibility to the class itself, unless we’re in one of these situations:

We just started the project or developing a new feature and we just need a simple implementation or scaffolding of the data. With time our class should evolve to include the methods that define the instances behavior, ie. the class becomes Independent.
Using a data class as an intermediate representation for JSON or some other interchange format. Every data class should have a method to convert the instance to a plain dict which is close to JSON format.

Pattern Matching Class Instances

Class patterns are designed to match class instances by type and by attributes and there are 3 variations of class patterns:

Simple Class Patterns which matches the instances to a class or a type.

class Example:
    ex_attr: str

var = Example()

match var:
    case Example():

Keyword class Patterns which matches instances based on attribute values.

import typing

class City(typing.NamedTuple):
    continent: str
    name: str
    country: str

cities = [
    City('Asia', 'Tokyo', 'JP'),
    City('Asia', 'Delhi', 'IN'),
    City('North America', 'Mexico City', 'MX'),
    City('North America', 'New York', 'US'),
    City('South America', 'São Paulo', 'BR'),
]

def match_asian_cities():
    results = []
    for city in cities:
        match city:
            case City(continent='Asia'):
                results.append(city)
    return results

# matches cities in the continent of Asia

Positional Class Patterns which matches instances based on the attribute value in a specific position

import typing

class City(typing.NamedTuple):
    continent: str
    name: str
    country: str

cities = [
    City('Asia', 'Tokyo', 'JP'),
    City('Asia', 'Delhi', 'IN'),
    City('North America', 'Mexico City', 'MX'),
    City('North America', 'New York', 'US'),
    City('South America', 'São Paulo', 'BR'),
]

def match_asian_cities():
    results = []
    for city in cities:
        match city:
            case City('Asia', _, country):
                results.append(country)
    return results

# returns the list of countries in the continent of Asia

Pattern matching instances is possible because classes have a class attribute __match_args__ created automatically by class builders, and the names of the attributes are declared in the order they will be used in positional patterns.

print(City.__match_args__) 
# ('continent', 'name', 'country')

Conclusion

Data classes have always been a cornerstone in OOP and it is vital that we provide the right implementation to avoid any subtle bugs or performance issues down the line. I hope this has been informative, but just in case I am attaching different resources that could help your Python Journey.