Python Pickle

Python Pickle: Serialize Your Objects [With Examples]

If you want to serialize and deserialize Python objects you might have considered using the Python Pickle module.

The Python Pickle module allows to serialize and deserialize a Python object structure. Pickle provides two functions to write/read to/from file objects (dump() and load()). It also provides two functions to write/read to/from bytes objects.

We will go through few examples to show how pickle works both with file objects and bytes objects. We will also test it with multiple data types.

It’s time to pickle!

Python Pickle Example

The Python Pickle module is used to perform serialization and deserialization of Python objects.

Serializing a Python object means converting it into a byte stream that can be stored in a file or in a string. Pickled data can then be read using the process called deserialization.

To store a pickled object into a string use the dumps() function. To read an object from a string that contains its pickled representation use the loads() function.

Let’s see an example of how you can use the pickle module to serialize a Python list.

>>> import pickle
>>> animals = ['tiger', 'lion', 'giraffe']
>>> pickle.dumps(animals) b'\x80\x04\x95\x1e\x00\x00\x00\x00\x00\x00\x00]\x94(\x8c\x05tiger\x94\x8c\x04lion\x94\x8c\x07giraffe\x94e.'         

After importing the pickle module we define a list and then use the pickle dumps() function to generate a bytes representation of our list.

Now, we will store the pickled string in a variable and use the loads() function to convert the bytes string back to our original list.

>>> pickled_animals = pickle.dumps(animals)
>>> unpickled_animals = pickle.loads(pickled_animals)
>>> print(unpickled_animals)
['tiger', 'lion', 'giraffe']         

The letter s at the end of the dumps() and loads() pickle functions stands for string. The pickle module also provides two functions that use files to store and read pickled data: dump() and load().

Save a Python Dictionary Using Pickle

With the pickle module you can save different types of Python objects.

Let’s use the dumps() function to pickle a Python dictionary.

>>> animals = {'tiger': 23, 'lion': 45, 'giraffe': 67}
>>> pickled_animals = pickle.dumps(animals)
>>> print(pickled_animals)
b'\x80\x04\x95$\x00\x00\x00\x00\x00\x00\x00}\x94(\x8c\x05tiger\x94K\x17\x8c\x04lion\x94K-\x8c\x07giraffe\x94KCu.'         

And then the loads() function to get the dictionary back from its pickled representation.

>>> new_animals = pickle.loads(pickled_animals)
>>> print(new_animals)
>>> {'tiger': 23, 'lion': 45, 'giraffe': 67}         

So, this confirms that we can also save dictionary objects in a string of bytes using Pickle.

Write Pickled Python Dictionary to a File

The pickle module also allows to store the pickled representation of a Python object to a file.

To store a pickled object to a file use the dump() function. To read an object from its pickled representation stored in a file use the load() function.

Firstly, we will open a file in binary mode using the Python open function, store the pickled dictionary in the file and close the file.

>>> import pickle
>>> animals = {'tiger': 23, 'lion': 45, 'giraffe': 67}
>>> f = open('data.pickle', 'wb')
>>> pickle.dump(animals, f)
>>> f.close()

The data.pickle file will get created in the same directory as your Python program.

Note: remember to close the file when you are done with it.

If you look at the content of the data.pickle file with a text editor you will see data in binary format.

€•$       }”(Œtiger”KŒlion”K-Œgiraffe”KCu.

Now, read the bytes from the file and get back the original dictionary object using the load() function.

>>> f = open('data.pickle', 'rb')
>>> unpickled_animals = pickle.load(f)
>>> f.close()
>>> print(unpickled_animals)
{'tiger': 23, 'lion': 45, 'giraffe': 67}         

This time we have opened the file in read binary mode considering that we only want to read its content.

In the next section we will see if the pickle module can also serialize nested objects.

Pickle a Nested Dictionary Object

Let’s find out if a Python nested dictionary can be serialized and deserialized using the Pickle module.

Update the dictionary used in the previous section to include dictionaries as values mapped to each key.

>>> animals = {'tiger': {'count': 23}, 'lion': {'count': 45}, 'giraffe': {'count': 67}}         

Write the pickled nested dictionary to a file. The code is identical to the one we have seen before to pickle a basic dictionary.

>>> f = open('data.pickle', 'wb')
>>> pickle.dump(animals, f)
>>> f.close()

No errors so far…

Now, convert the pickled data back to the nested dictionary:

>>> f = open('data.pickle', 'rb')
>>> unpickled_animals = pickle.load(f)
>>> f.close()
>>> print(unpickled_animals)
{'tiger': {'count': 23}, 'lion': {'count': 45}, 'giraffe': {'count': 67}}         

The nested dictionary looks good.

Using Pickle With a Custom Class

I want to find out if I can pickle a Python custom class…

Let’s create a class called Animal that contains two attributes.

class Animal:
    def __init__(self, name, group):
        self.name = name
        self.group = group

Then create one object and pickle it into a file.

tiger = Animal('tiger', 'mammals')
f = open('data.pickle', 'wb')
pickle.dump(tiger, f)
f.close()

And finally, read the data using the pickle load() function.

f = open('data.pickle', 'rb')
data = pickle.load(f)
print(data)
f.close()

This is the content of the data object:

<main.Animal object at 0x0353BF58>

And here are the attributes of our object…as you can see they are correct.

>>> print(data.__dict__)
{'name': 'tiger', 'group': 'mammals'} 

You can customise this output by adding the __str__ method to the class.

Save Multiple Objects with Pickle

Using the same class defined in the previous section we will save two objects in a file using the pickle module.

Create two objects of type Animal and pickle them into a file as a list of objects:

tiger = Animal('tiger', 'mammals')
crocodile = Animal('crocodile', 'reptiles')
f = open('data.pickle', 'wb')
pickle.dump([tiger, crocodile], f)
f.close()

You can access each object using a for loop.

f = open('data.pickle', 'rb')
data = pickle.load(f)
f.close()

for animal in data:
    print(animal.__dict__)

[output]
{'name': 'tiger', 'group': 'mammals'}
{'name': 'crocodile', 'group': 'reptiles'}

Pickle and Python With Statement

So far we had to remember to close the file object every time after finishing working with it.

Instead of doing that we can use the with open statement that takes care of closing the file automatically.

Here is how our code to write multiple objects becomes:

tiger = Animal('tiger', 'mammals')
crocodile = Animal('crocodile', 'reptiles')

with open('data.pickle', 'wb') as f:
    pickle.dump([tiger, crocodile], f) 

And now use the with open statement also to read the pickled data…

with open('data.pickle', 'rb') as f:
    data = pickle.load(f)

print(data)

[output]
[<__main__.Animal object at 0x7f98a015d2b0>, <__main__.Animal object at 0x7f98a01a4fd0>] 

Nice, it’s a lot more concise.

No more f.close() every time we read or write a file.

Using Python Pickle with Lambdas

So far we have used the pickle module with variables, but what happens if we use it with a function?

Define a simple lambda function that returns the sum of two numbers:

>>> import pickle
>>> pickle.dumps(lambda x,y : x+y)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
_pickle.PicklingError: Can't pickle <function <lambda> at 0x7fbc60296c10>: attribute lookup <lambda> on __main__ failed 

The pickle module doesn’t allow to serialize a lambda function.

As an alternative we can use the dill module that extends the functionality of the pickle module.

You might get the following error when you try to import the dill module…

>>> import dill
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'dill' 

In that case you have to install the dill module using pip:

$ pip install dill
Collecting dill
  Downloading dill-0.3.3-py2.py3-none-any.whl (81 kB)
     |████████████████████████████████| 81 kB 4.4 MB/s 
Installing collected packages: dill
Successfully installed dill-0.3.3 

The dill module provides the dumps and loads functions in the same way the pickle module does.

Let’s first create a bytes object from the lambda using the dumps function:

>>> import dill
>>> pickled_lambda = dill.dumps(lambda x,y : x+y)
>>> print(pickled_lambda) b'\x80\x04\x95\x9e\x00\x00\x00\x00\x00\x00\x00\x8c\ndill._dill\x94\x8c\x10_create_function\x94\x93\x94(h\x00\x8c\x0c_create_code\x94\x93\x94(K\x02K\x00K\x00K\x02K\x02KCC\x08|\x00|\x01\x17\x00S\x00\x94N\x85\x94)\x8c\x01x\x94\x8c\x01y\x94\x86\x94\x8c\x07<stdin>\x94\x8c\x08<lambda>\x94K\x01C\x00\x94))t\x94R\x94c__builtin__\n__main__\nh\x0bNN}\x94Nt\x94R\x94.' 

Then unpickle the data using the loads function:

>>> print(dill.loads(pickled_lambda))
<function <lambda> at 0x7f9558408280>
>>> unpickled_lambda = dill.loads(pickled_lambda)
>>> unpickled_lambda(1,3)
4 

It works!

The lambda function returns the result we expect.

Error When Pickling a Class with a Lambda Attribute

Let’s go back to the custom class we have defined before…

We have already seen how to serialize and deserialize it. Now let’s add a new attribute and set its value to a lambda function.

class Animal:
    def __init__(self, name, group):
        self.name = name
        self.group = group
        self.description = lambda: print("The {} belongs to {}".format(self.name, self.group)) 

Note: this lambda attribute doesn’t take any input arguments. It just prints a string based on the values of the other two class instance attributes.

Firstly, confirm that the class works fine:

tiger = Animal('tiger', 'mammals')
tiger.description()
crocodile = Animal('crocodile', 'reptiles')
crocodile.description() 

And here you can see the output of the lambda function:

$ python3 exclude_class_attribute.py
The tiger belongs to mammals 
The crocodile belongs to reptiles

You know that the pickle module cannot serialize a lambda function. And here is what happens when we serialize our two objects created from the custom class.

Traceback (most recent call last):
  File "multiple_objects.py", line 16, in <module>
    pickle.dump([tiger, crocodile], f)
AttributeError: Can't pickle local object 'Animal.__init__.<locals>.<lambda>' 

This is caused by the lambda attribute inside our two objects.

Exclude Python Class Attribute from Pickling

Is there a way to exclude the lambda attribute from the serialization process of our custom object?

Yes, to do that we can use the class __getstate__() method.

Python Pickle __getstate__

To understand what the __getstate__ method does let’s start by looking at the content of __dict__ for one of our class instances.

tiger = Animal('tiger', 'mammals')
print(tiger.__dict__)

[output]
{'name': 'tiger', 'group': 'mammals', 'description': <function Animal.__init__.<locals>.<lambda> at 0x7fbc9028ca60>} 

To be able to serialize this object using pickle we want to exclude the lambda attribute from the serialization process.

In order to avoid serializing the lambda attribute using __getstate__() we will first copy the state of our object from self.__dict__ and then remove the attribute that cannot be pickled.

class Animal:
    def __init__(self, name, group):
        self.name = name
        self.group = group
        self.description = lambda: print("The {} is a {}".format(self.name, self.group))

    def __getstate__(self):
        state = self.__dict__.copy()
        del state['description']
        return state 

Note: we are using the dict.copy() method to make sure we don’t modify the original state of the object.

Let’s see if we can pickle this object now…

tiger = Animal('tiger', 'mammals')
pickled_tiger = pickle.dumps(tiger)

Before continuing confirm that no exception is raised by the Python interpreter when pickling the object.

Now, unpickle the data and verify the value of __dict__.

unpickled_tiger = pickle.loads(pickled_tiger)
print(unpickled_tiger.__dict__)

[output]
{'name': 'tiger', 'group': 'mammals'} 

It worked! And the unpickled object doesn’t contain the lambda attribute anymore.

Restore the Original Structure of a Python Object Using Pickle

We have seen how to exclude from the serialization process of a Python object one attribute for which pickling is not supported.

But, what if we want to preserve the original structure of an object as part of pickling / unpickling?

How can we get our lambda attribute back after unpickling the bytes representation of our object?

We can use the __setstate__ method that as explained in the official documentation it’s called with the unpickled state as part of the unpickling process.

Python Pickle __setstate__

Update our class to implement the __setstate__() method. This method will restore the instance attributes and then add the lambda attribute that wasn’t part of the pickled object.

class Animal:
    def __init__(self, name, group):
        self.name = name
        self.group = group
        self.description = lambda: print("The {} is a {}".format(self.name, self.group))

    def __getstate__(self):
        state = self.__dict__.copy()
        del state['description']
        return state

    def __setstate__(self, state):
        self.__dict__.update(state)
        self.description = lambda: print("The {} is a {}".format(self.name, self.group)) 

Let’s pickle and unpickle an object to confirm that we get back the lambda attribute.

tiger = Animal('tiger', 'mammals')
pickled_tiger = pickle.dumps(tiger) 

unpickled_tiger = pickle.loads(pickled_tiger)
print(unpickled_tiger.__dict__)

[output]
{'name': 'tiger', 'group': 'mammals', 'description': <function Animal.__setstate__.<locals>.<lambda> at 0x7f9380253e50>} 

All good, the unpickled object also contains the lambda attribute.

Pickling and Unpickling Between Python 2 and Python 3

I want to find out if there are any limitations when it comes to pickling data with a version of Python and unpickling it with a different version of Python.

Is there backward compatibility with the pickle module between Python 2 and 3?

In this test I will use Python 3.8.5 to serialize a list of tuples and Python 2.7.16 to deserialize it.

Python 3.8.5 (default, Sep  4 2020, 02:22:02) 
[Clang 10.0.0 ] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pickle
>>> animals = [('tiger', 'mammals'), ('crocodile', 'reptiles')]
>>> with open('data.pickle', 'wb') as f:
...     pickle.dump(animals, f)
...
>>> exit()  

Exit from the Python shell to confirm that the file data.pickle has been created.

$ ls -al data.pickle 
-rw-r--r--  1 myuser  mygroup  61  3 May 12:01 data.pickle 

Now use Python 2 to unpickle the data:

Python 2.7.16 (default, Dec 21 2020, 23:00:36) 
[GCC Apple LLVM 12.0.0 (clang-1200.0.30.4) [+internal-os, ptrauth-isa=sign+stri on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pickle
>>> with open('data.pickle', 'rb') as f:
...     data = pickle.load(f)
... 
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 1384, in load
    return Unpickler(file).load()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 864, in load
     dispatch[key](self)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 892, in load_proto
    raise ValueError, "unsupported pickle protocol: %d" % proto
ValueError: unsupported pickle protocol: 4 

It didn’t work, the Python interpreter throws a ValueError exception complaining about the pickle protocol being unsupported.

Let’s find out why and to what protocol the interpreter is referring to…

Default Protocol for Python Pickle

According to the documentation of the Pickle module a default protocol version is used for pickling by your Python interpreter.

The DEFAULT_PROTOCOL value depends on the version of Python you use…

…ok, we are getting somewhere…

Default protocol for Python Pickle module

It looks the default protocol for Python 3.8 is 4, this matches the error we have seen considering that the Python 2 interpreter is complaining with the error “unsupported pickle protocol: 4“.

Using the Python shell we can confirm the value of the pickle DEFAULT_PROTOCOL for our Python 3 interpreter.

Python 3.8.5 (default, Sep  4 2020, 02:22:02) 
[Clang 10.0.0 ] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pickle
>>> print(pickle.DEFAULT_PROTOCOL)
4 

I wonder if I can use the Python 3.8.5 interpreter to generate pickled data and specify a protocol version supported by Python 2.7.16.

Protocol version 3 was added in Python 3.0 and protocol version 2 was implemented in Python 2.3.

So we should be able to use version 2 when pickling our list of tuples…

We can pass the protocol as third argument of the pickle dump() function as you can see below:

Dump protocol for Python Pickle

Let’s try it…

>>> import pickle
>>> animals = [('tiger', 'mammals'), ('crocodile', 'reptiles')]
>>> with open('data.pickle', 'wb') as f:
...     pickle.dump(animals, f, 2)
... 
>>>  

And now let’s unpickle it with Python 2:

Python 2.7.16 (default, Dec 21 2020, 23:00:36) 
[GCC Apple LLVM 12.0.0 (clang-1200.0.30.4) [+internal-os, ptrauth-isa=sign+stri on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pickle
>>> with open('data.pickle', 'rb') as f:
...     data = pickle.load(f)
... 
>>> print(data)
[(u'tiger', u'mammals'), (u'crocodile', u'reptiles')] 

It worked!

So, now you know how to save data with pickle if you need it to be exchanged between applications that use different versions of Python.

You can get the highest protocol available for the pickle module used by your Python interpreter by looking at the value of pickle.HIGHEST_PROTOCOL. You can pass this value to the functions dump() and dumps().

Compression for Data Generated with Python Pickle

If you have a huge amount of data to save using pickle, you can reduce the size of your data by applying bzip2 compression to it. To do that you can use the Python bz2 module.

The bz2 module provides the class bz2.BZ2File that allows to open a file compressed with bzip2 in binary mode.

Here is how we can use it with a list of tuples and together with pickle:

>>> import pickle
>>> import bz2
>>> animals = [('tiger', 'mammals'), ('crocodile', 'reptiles')]
>>> with bz2.BZ2File('data.pickle.compressed', 'w') as f:
...     pickle.dump(animals, f)
... 
>>>

We can use the built-in Python type() function to confirm the type of our file object.

>>> type(f)
<class 'bz2.BZ2File'> 

And now let’s unpickle the compressed data…

>>> with bz2.BZ2File('data.pickle.compressed', 'r') as f:
...     print(pickle.load(f))
... 
[('tiger', 'mammals'), ('crocodile', 'reptiles')] 

Nice one 🙂

Python Pickle and Pandas DataFrames

Let’s find out if we can use the pickle module to serialize and deserialize a Pandas dataframe.

First of all create a new dataframe:

>>> import pandas as pd
>>> df = pd.DataFrame({"Animals": ["Tiger", "Crocodile"], "Group": ["Mammals", "Reptiles"]})
>>> print(df)
     Animals     Group
0      Tiger   Mammals
1  Crocodile  Reptiles 

Can we serialize this object?

>>> import pickle
>>> pickled_dataframe = pickle.dumps(df) 

Yes, we can!

Let’s see if we get back the original dataframe using the pickle loads() function.

>>> unpickled_dataframe = pickle.loads(pickled_dataframe)
>>> print(unpickled_dataframe)
     Animals     Group
0      Tiger   Mammals
1  Crocodile  Reptiles 

Yes, we do!

The Pandas library also provides its own functions to pickle and unpickle a dataframe.

You can use the function to_pickle() to serialize the dataframe to a file:

>>> df.to_pickle('./dataframe.pickle') 

This is the file that contains the pickled dataframe:

$ ls -al dataframe.pickle
-rw-r--r--  1 myuser  mygroup  706  3 May 14:42 dataframe.pickle 

To get the dataframe back you can use the read_pickle() function.

>>> import pandas as pd
>>> unpickled_dataframe = pd.read_pickle('./dataframe.pickle')
>>> print(unpickled_dataframe)
     Animals     Group
0      Tiger   Mammals
1  Crocodile  Reptiles 

Exactly what we were expecting.

Python Pickle Security

Everything we have seen so far about the pickle module is great but at the same time the Pickle module is not secure.

It's important to only unpickle data that you trust. Data for which you definitely know the source.

Why?

The Pickle deserialization process is insecure.

Pickled data can be constructed in such a way to execute arbitrary code when it gets unpickled.

Pickled data can act as an exploit by using the __setstate__() method we have seen in one of the previous sections to add an attribute to our deserialized object.

Here is a basic class that explains how this would work:

import pickle, os 

class InsecurePickle:
    def __init__(self, name):
        self.name = name

    def __getstate__(self):
        return self.__dict__

    def __setstate__(self, state):
        os.system('echo Executing malicious command')

As you can see in the implementation of the __setstate__ method we can call any arbitrary command that can harm the system that unpickles the data.

Let’s see what happens when we pickle and unpickle this object…

insecure1 = InsecurePickle('insecure1')
pickled_insecure1 = pickle.dumps(insecure1)
unpickled_insecure1 = pickle.loads(pickled_insecure1)

Here is the output of this code:

$ python3 pickle_security.py
Executing malicious command

For example, you could use the os.system call to create a reverse shell and gain access to the target system.

Protecting Pickled Data with HMAC

One of the ways to protect pickled data from tampering is to have a secure connection between the two parties exchanging pickled data.

It’s also possible to increase security of data shared between multiple systems by using a cryptographic signature.

The idea behind it is that:

  1. Pickled data is signed before being stored on the filesystem or before being transmitted to another party.
  2. Its signature can then be verified before the data is unpickled.

This process can help understand if pickled data has been tampered with and hence it might be unsafe to read.

We will apply cryptographic signature to the Pandas dataframe defined before using the Python hmac module:

>>> import pandas as pd
>>> import pickle
>>> df = pd.DataFrame({"Animals": ["Tiger", "Crocodile"], "Group": ["Mammals", "Reptiles"]})
>>> pickled_dataframe = pickle.dumps(df) 

Assume that sender and receiver share the following secret key:

secret_key = '25345-abc456'

The sender generates a digest for the data using the hmac.new() function.

>>> import hmac, hashlib
>>> digest =  hmac.new(secret_key.encode(), pickled_dataframe, hashlib.sha256).hexdigest()
>>> print(digest)
022396764cea8a60a492b391798e4155daedd99d794d15a4d574caa182bab6ba  

The receiver knows the secret key and it can calculate the digest to confirm if its value is the same as the value received with the pickled data.

If the two digest values are the same the receiver knows that the pickled data has not been tampered with and it’s safe to read.

Conclusion

If you didn’t get the chance to use the pickle module before going through this tutorial, now you should have a pretty good idea of how pickle works.

We have seen how to use pickle to serialize lists, dictionaries, nested dictionaries, list of tuples, custom classes and Pandas dataframes.

You have also learned how to exclude certain attributes that are not supported by pickle from the serialization process.

Finally we have covered security issues that can occur when exchanging data serialized with pickle.

Now it’s your turn…

…how are you planning to use the pickle module in your application?

Share knowledge with your friends!

Leave a Reply

Your email address will not be published. Required fields are marked *