Check for duplicates in Python list

How To Check For Duplicates in a Python List

Are you writing a Python application and do you need to check for duplicates in a list? You are in the right place, let’s find out how to work with duplicates.

There are several approaches to check for duplicates in a Python list. Converting a list to a set allows to find out if the list contains duplicates by comparing the size of the list with the size of the set. This tells if the list contains duplicates and one way to know which items are duplicates you can use collections.Counter.

There are two aspects of duplicates you might want to know more about:

  • How to know if there are any duplicates in a list.
  • If duplicates are present in the list identify which elements are duplicates.

Let’s get started!

Check If a Python List Has Duplicates

I have the following list and first I want to know if this list contains any duplicates:

>>> planets = ['mercury', 'earth', 'mars', 'jupiter', 'mars']

We can see if this list has any duplicates by using the properties of a Python set.

Here is what happens when I convert this list to a set:

>>> set(planets)
{'earth', 'mars', 'jupiter', 'mercury'} 

Ignore the fact that the order of the elements has changed (considering that a set is unordered).

The important thing to notice is that the duplicate string “mars” has disappeared because a set only contains unique values.

So, to check if a list contains any duplicates we can simply compare the size of the list with the size of the set. If they are different the list contains duplicates.

The size of the list and the set are:

>>> len(planets)
5
>>> len(set(planets))
4 

We can write a function that uses a conditional statement to verify if a list contains any duplicates and that returns True if it does.

>>> def has_duplicates(values):
...     if len(values) != len(set(values)):
...             return True
...     else:
...             return False
... 
>>> 
>>> has_duplicates(planets)
True 

Let’s redefine the list, remove the duplicate string and pass the list to our function again:

>>> planets = ['mercury', 'earth', 'mars', 'jupiter']
>>> has_duplicates(planets)
False 

Et voilà, this time it returns False as we expected.

Search For Duplicates in a Python List

Now that we know how to check IF a list contains duplicates it would be useful to get the value of duplicate elements.

We could come up with some convoluted code that uses for loops to figure out which element is in the list but not in the tuple, but that wouldn’t be the right approach.

A better approach could be to create a dictionary where every key is an item in the list and each value the number of times that item is present in the list.

We can achieve this result simply by using collections.Counter that is a dictionary subclass where elements of an iterable become dictionary keys and their counts are dictionary values.

>>> from collections import Counter
>>> Counter(planets)
Counter({'mars': 2, 'mercury': 1, 'earth': 1, 'jupiter': 1}) 

With a single line of code we can see that the string ‘mars’ appears two times in the list.

We can then create a list of duplicates using the following list comprehension:

>>> [key for key in Counter(planets).keys() if Counter(planets)[key]>1]
['mars'] 

This expression creates a list that contains keys for which the count value is greater than 1 (they appear more than one time in the original list).

Check For Duplicates in Two Python Lists

In some cases you might want to find elements that are the same in two different lists.

Let’s take the following lists:

>>> planets1 = ['mercury', 'earth', 'mars']
>>> planets2 = ['earth', 'jupiter', 'saturn']

We convert them into tuples and see the methods available to tuples in case there is anything that can help us.

>>> p1 = set(planets1)
>>> p2 = set(planets2)
>>> p1.
p1.add(                          p1.intersection(                 p1.remove(
p1.clear(                        p1.intersection_update(          p1.symmetric_difference(
p1.copy(                         p1.isdisjoint(                   p1.symmetric_difference_update(
p1.difference(                   p1.issubset(                     p1.union(
p1.difference_update(            p1.issuperset(                   p1.update(
p1.discard(                      p1.pop(                           

The intersection method could be the one, let’s confirm it using its help page:

 >>> help(p1.intersection) 

Yes, that’s the correct method…

>>> p1.intersection(p2)
{'earth'} 

The result is a tuple that contains the element in common.

We can obtain the same result by using the & operator:

>>> p1 & p2
{'earth'} 

Check For Duplicates in a List of Tuples

What if we have a list of tuples and we want to verify if there are any duplicates and which ones are they?

Let’s say we have created a game and we use a list of tuples to store first name and score for each player.

But, for some reason we haven’t thought that there could be two players with the same first name and score.

When we identify the problem we decide to create a function that tells us if there is a duplicate in our list of tuples and which one is the duplicate.

>>> scores = [('Jane', 45), ('Anthony', 340), ('Jake', 34), ('Jane', 45)]

We can use the same approach explained before with collections.Counter to get back a dictionary that tells us which ones are the duplicate tuples and how many times are present.

>>> from collections import Counter
>>> Counter(scores)
Counter({('Jane', 45): 2, ('Anthony', 340): 1, ('Jake', 34): 1}) 

Pretty simple to do, that’s one of the reasons why I love Python. Things you might think require lots of code can be often written with just a couple of lines.

Let’s write a function that raises an exception at the first duplicate tuple found in the list.

from collections import Counter 

def has_duplicates(elements):
    counter = Counter(elements) 

    for key, value in counter.items():
        if value > 1:
            raise ValueError("Duplicate score found {}".format(key))
 
scores = [('Jane', 45), ('Anthony', 340), ('Jake', 34), ('Jane', 45)]
has_duplicates(scores)

The output is:

# python3 duplicates_list.py
Traceback (most recent call last):
  File "duplicates_list.py", line 12, in <module>
    has_duplicates(scores)
  File "duplicates_list.py", line 8, in has_duplicates
    raise ValueError("Duplicate score found {}".format(key))
ValueError: Duplicate score found ('Jane', 45) 

This is just to give you an idea of the logic you can implement depending on what you need your Python program to do.

Find Duplicates in a List of Dictionaries

This time we want to find duplicate objects in a list of dictionaries.

>>> users = [{'name':'Jane', 'score': 45}, {'name':'Anthony', 'score': 234}, {'name':'John', 'score': 786}, {'name':'Jane', 'score': 45}]

A duplicate dictionary would be one that has the same values for both keys ‘name’ and ‘score’.

With a list comprehension we can generate a list of lists where each list contains both values for each dictionary:

>>> [list(user.values()) for user in users]
[['Jane', 45], ['Anthony', 234], ['John', 786], ['Jane', 45]] 

I wonder what happens if I use collections.Counter with this list of lists:

>>> from collections import Counter
>>> Counter([['Jane', 45], ['Anthony', 234], ['John', 786], ['Jane', 45]])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/claudiosabato/opt/anaconda3/lib/python3.8/collections/__init__.py", line 552, in __init__
    self.update(iterable, **kwds)
  File "/Users/claudiosabato/opt/anaconda3/lib/python3.8/collections/__init__.py", line 637, in update
    _count_elements(self, iterable)
TypeError: unhashable type: 'list' 

Why are we getting the error unhashable type: ‘list’?

This error is caused by the fact that you cannot use lists as the keys of a dictionary because keys of a dictionary have to be immutable and lists are mutable.

So, let’s convert our list of lists into a list of tuples and then apply collections.Counter to it again.

To get a list of tuples we have to update the previous list comprehension and also add the tuple() function:

>>> [tuple(user.values()) for user in users]
[('Jane', 45), ('Anthony', 234), ('John', 786), ('Jane', 45)] 

And now let’s apply Counter to it:

>>> Counter([tuple(user.values()) for user in users])
Counter({('Jane', 45): 2, ('Anthony', 234): 1, ('John', 786): 1}) 

The only duplicate dictionary is the one whose values are ‘Jane’ and 45.

Find Duplicates in a List and Get Their Index

We have seen how to find duplicates in a list, but how can we get their index in the list?

Let’s first create a function that uses the list comprehension we have created at the beginning of this tutorial to get duplicates in a list:

from collections import Counter 

def get_duplicates(values):
    return [key for key in Counter(values).keys() if Counter(values)[key]>1]

planets = ['mercury', 'earth', 'mars', 'jupiter', 'mars', 'earth']
duplicates = get_duplicates(planets)
print(duplicates) 

We get back a list that contains the duplicates:

# python3 duplicates_list.py
['earth', 'mars'] 

The next step is to get the indexes in the list for each element that has duplicates. For that we will use the enumerate function.

Here is how you can generate all the indexes in our list using enumerate:

>>> [index for index, value in enumerate(planets)]
[0, 1, 2, 3, 4, 5] 

Create a function that takes as inputs our list and an element of the list and returns a dictionary where the key is the element of the list and the value is a list that contains the indexes for that element in the list.

It’s easier to code than to explain 🙂

def get_indexes_for_element(values, element):
    element_indexes = [index for index, value in enumerate(values) if value == element]
    return { element : element_indexes } 

Let’s call it to see if it returns what we expect:

planets = ['mercury', 'earth', 'mars', 'jupiter', 'mars', 'earth']
print(get_indexes_for_element(planets, 'earth'))

[output]
{'earth': [1, 5]} 

Exactly what we want!

Time to put everything together…

…we will create a list of dictionaries where each dictionary has the format we have just seen with the string ‘earth’.

Let’s add a third function that goes through all the duplicates and generates the final list of dictionaries:

def get_indexes_for_duplicates(values, duplicates):
    indexes_for_duplicates = [] 

    for duplicate in duplicates:
        indexes_for_duplicates.append(get_indexes_for_element(values, duplicate))

    return indexes_for_duplicates 

Here is the final code:

from collections import Counter 

def get_duplicates(values):
    return [key for key in Counter(values).keys() if Counter(values)[key]>1] 

def get_indexes_for_element(values, element):
    element_indexes = [index for index, value in enumerate(values) if value == element]
    return { element : element_indexes } 

def get_indexes_for_duplicates(values, duplicates):
    indexes_for_duplicates = [] 

    for duplicate in duplicates:
        indexes_for_duplicates.append(get_indexes_for_element(values, duplicate))

    return indexes_for_duplicates
 

planets = ['mercury', 'earth', 'mars', 'jupiter', 'mars', 'earth']
duplicates = get_duplicates(planets)
print(get_indexes_for_duplicates(planets, duplicates))

And the output is…

# python3 duplicates_list.py
[{'earth': [1, 5]}, {'mars': [2, 4]}] 

It works well 🙂

Find Duplicates in a Python List and Remove Them

One last thing that can be useful to do is to remove any duplicate elements from a list.

We could use the list remove() method to do that but it would only work well if a single duplicate for a give element is present in the list.

Let’s have a look at this example:

>>> planets = ['mercury', 'earth', 'mars', 'jupiter', 'mars', 'earth']
>>> planets.remove('earth')
>>> planets
['mercury', 'mars', 'jupiter', 'mars', 'earth']
>>> planets.remove('mars')
>>> planets
['mercury', 'jupiter', 'mars', 'earth'] 

The list remove() method deletes the first occurrence of a given element from a list.

For this approach to work, after removing a given element we need to confirm if the list still contains any duplicates.

We can use a while loop that is executed as long as the list of duplicates is not empty:

from collections import Counter 

def get_duplicates(values):
    return [key for key in Counter(values).keys() if Counter(values)[key]>1]

planets = ['mercury', 'earth', 'mars', 'jupiter', 'mars', 'earth']
print("The initial list is {}".format(planets)) 

while len(get_duplicates(planets)) != 0:
    duplicates = get_duplicates(planets)
    print("Loop iteration: the duplicates in the list are {}".format(duplicates)) 
    planets.remove(duplicates[0])

print("The list without duplicates is {}".format(planets)) 

If the list still contains duplicates we remove from the list the first element in the duplicates list. Eventually the duplicates list will be empty and the execution of the while loop will stop.

# python3 remove_duplicates.py
The initial list is ['mercury', 'earth', 'mars', 'jupiter', 'mars', 'earth']
Loop iteration: the duplicates in the list are ['earth', 'mars']
Loop iteration: the duplicates in the list are ['mars']
The list without duplicates is ['mercury', 'jupiter', 'mars', 'earth'] 

How To Remove Duplicate Numbers From a List

Let’s find out if the approach we just used to remove duplicate strings from a list also works with a list of numbers.

Firstly we will make our code more generic by using an additional function that receives a list and returns the same list without duplicates.

def get_list_without_duplicates(values):
    print("The initial list is {}".format(values)) 

    while len(get_duplicates(values)) != 0:
        duplicates = get_duplicates(values)
        print("Loop iteration: the duplicates in the list are {}".format(duplicates))
        values.remove(duplicates[0])

    print("The list without duplicates is {}".format(values))
    return values 

The implementation of the get_duplicates() function doesn’t change compared to the previous code. And here is how we can call the new function:

planets = ['mercury', 'earth', 'mars', 'jupiter', 'mars', 'earth']
print(get_list_without_duplicates(planets)) 

Confirm that the result is correct before continuing.

Now, let’s try to pass a list of numbers instead.

numbers = [1, 2, 3, 3, 3, 4, 3, 5, 5, 7, 54, 45, 43, 43, 2, 1]
print(get_list_without_duplicates(numbers)) 

Our program does the job:

# python3 remove_duplicate_numbers.py
The initial list is [1, 2, 3, 3, 3, 4, 3, 5, 5, 7, 54, 45, 43, 43, 2, 1]
Loop iteration: the duplicates in the list are [1, 2, 3, 5, 43]
Loop iteration: the duplicates in the list are [2, 3, 5, 43]
Loop iteration: the duplicates in the list are [3, 5, 43]
Loop iteration: the duplicates in the list are [3, 5, 43]
Loop iteration: the duplicates in the list are [3, 5, 43]
Loop iteration: the duplicates in the list are [5, 43]
Loop iteration: the duplicates in the list are [43]
The list without duplicates is [4, 3, 5, 7, 54, 45, 43, 2, 1]
[4, 3, 5, 7, 54, 45, 43, 2, 1] 

If you want the list to be sorted you can do it using the list sort() method in the get_list_without_duplicates() function before the return statement.

def get_list_without_duplicates(values):
    ...
    ...
    values.sort()
    return values 

Try to run the program and confirm that you receive a sorted list.

Conclusion

After going through this tutorial you shouldn’t have any doubts on how to check if a list has duplicates and also on how to get the value and index of the duplicates.

We have also seen how this works with list of lists, list of tuples and lists of dictionaries.

And now it’s your time to use the method you feel it’s best for you.

Happy coding!

Share knowledge with your friends!

Leave a Reply

Your email address will not be published. Required fields are marked *