Filename Pattern Matching with the Python glob Module

What is the Python glob module? For which type of Python applications can you use it?

The Python glob module is used to find pathnames that match a particular pattern. The general term “glob” is used to describe methods for matching specific patterns according to Unix shell rules. With glob, you can also use wildcards (“*, ?, [ranges]) apart from exact string search to make path retrieval more powerful.

Let’s learn the Python glob module!

What Is the Python glob Module?

The Python glob module allows searching over pathnames to find files that fit a given pattern (which is defined by you). The rules established by the Unix shell are used to define the supplied pattern for file matching.

All file paths that match a certain pattern are returned using the glob function of the glob module. We may use glob to search for a certain file pattern, or perhaps more effectively, we can utilize wildcard characters to search for files where the filenames match a given pattern.

To use glob, we have to import it using the Python import statement as shown below.

import glob

Note: glob is a built-in module that comes with Python so you do not need to install it separately from your Python installation (e.g. using pip or conda) before importing it.

After importing the glob module we can use it to find the files that match a specific pattern on our local machine

The following code snippet matches a specific filename in the directory /opt/app/tutorial/ and shows it in the output.

import glob

file = glob.glob("/opt/app/tutorial/file1.txt")
print(file)

The output is:

['/opt/app/tutorial/file1.txt']

In the code above, we are searching for only one text file, we can also search for multiple files with the same file extension.

import glob

files = glob.glob("/opt/app/tutorial/*.txt")
print(files)

As you can see, we have specified “*.txt” which matches all the files with txt extension in the directory specified. The character “*” is a wildcard.

Let’s assume we have the following five txt files in the directory /opt/app/tutorial/:

file1.txt
file2.txt
file3.txt
file4.txt
file5.txt

After executing the Python code I get the following output:

['/opt/app/tutorial/file2.txt', '/opt/app/tutorial/file3.txt', '/opt/app/tutorial/file1.txt', '/opt/app/tutorial/file4.txt', '/opt/app/tutorial/file5.txt']

The glob.glob() function returns the list of all the files that match the path we have specified.

You can also see that…

The glob() function of the Python glob module returns results in arbitrary order.

Using Absolute and Relative Paths with Python glob

The path we provide to the glob() function can be absolute or relative.

Let’s see what this means…

  • Absolute path: this is the full path of the file from the root of the filesystem.
  • Relative path: this is the path relative to the current directory.

The two code snippets we have seen before use an absolute path.

Now, let’s see one example using a relative path…

import glob
 
files = glob.glob("*.txt")
print(files)

The path specified in this code snippet is a relative path and it refers to the current directory.

From the output you can see that the list of files only contains filenames and not the full path for each file:

['file2.txt', 'file3.txt', 'file1.txt', 'file4.txt', 'file5.txt']

How to Use glob to Find Files Recursively

The word recursive means repetition which relates to the repeated application of a rule. We can also use glob to search files recursively which means to search files in subdirectories.

The recursive nature of the glob module is useful because you might not know the exact location of the files you are looking for.

To enable the recursive behavior, you have to pass the recursive boolean argument to the glob() function set to True. This argument is False by default.

Before using the recursive argument we want to understand the behaviour of the double-asterisk (**) wildcard when used with the glob.glob() function.

Create a directory called test_dir in the current directory and then inside test_dir create a file called file6.txt. Our directory structure becomes the following:

file1.txt
file2.txt
file3.txt
file4.txt
file5.txt
test_dir/file6.txt

Now execute the previous Python code that uses “*.txt” as pathname expression. The output is the following:

['file2.txt', 'file3.txt', 'file1.txt', 'file4.txt', 'file5.txt']

Now, pass the recursive argument to the glob function:

import glob

files = glob.glob("*.txt", recursive=True)
print(files)

When you execute the code you will notice that the output doesn’t change. That’s because the recursive argument has to be used in conjunction with the double-asterisk wildcard (**).

Update the expression passed to the glob() function as shown below (don’t change anything else in the previous code):

files = glob.glob("**/*.txt", recursive=True)

This time in the output you will also see “test_dir/file6.txt”. Our code has matched the .txt files in the current directory and also the .txt file in the test_dir subdirectory.

['file2.txt', 'file3.txt', 'file1.txt', 'file4.txt', 'file5.txt', 'test_dir/file6.txt']

Now, let’s try to set recursive to False without changing the pathname expression:

files = glob.glob("**/*.txt", recursive=False)

The output is:

['test_dir/file6.txt']

Our code has only matched the file in the subdirectory.

You now know how the double-asterisk works together with the recursive parameter when using the glob() function.

Using Wildcard Characters with glob

With the glob module you can use wildcards. There are many wildcard characters but the most used wildcard characters are explained below.

Asterisk (*): glob uses the asterisk (*) which is also known as the star to match zero or more characters.

filepath = "*.txt"
filepath = "*.py"
filepath = "*.jpg"

In the examples above, the * will match all the files with the specified extension.

Double Asterisk (**): two asterisks (**) are the same as the single asterisk but they work with subfolders. When you are searching recursively then use the double asterisk (**) to search for files in the current folders as well as in the subfolders.

filepath = "docs/**/*.txt"
filepath = "python/**/*.py"
filepath = "images/**/*.jpg"

Square Brackets ([ ]): square brackets are a very powerful wildcard character because they allow to search for files using different combinations of characters.

For example:

  • Let’s say you want to search for files whose name matches lowercase vowels, to achieve this, you will specify all the lowercase vowels within square brackets: [aeiou].
  • [0-9]: If you want to match any digits then specify digits from 0 to 9 within square brackets.
  • [A-Z]: it matches any uppercase letters, [a-z] matches any lowercase letters. You can also combine uppercase and lowercase letters this way: [a-z,A-Z].
  • If you want to exclude certain characters then you can use square brackets and also specify the (!) symbol. For example: [!abc].

Let’s try some practical examples to understand the wildcard characters.

We have already seen previously how the * and ** wildcards work. Now let’s see an example with the square brackets ([ ]).

Within square brackets, we specify the pattern of numbers or letters to search for:

import glob

path1 = "[a-e]*.txt"
print('Files with a name that starts with a letter between a and e in the current directory')
print(glob.glob(path1))

path2 = "[1-5]*.txt"
print('Files with a name that starts with a number between 1 and 5 in the current directory')
print(glob.glob(path2))

Create the following files in the current directory to confirm the wildcard expressions work as expected:

a.txt
ask.txt
cat.txt
d.txt
echo.txt
delta.txt
fox.txt
1.txt
134.txt
345.txt
67.txt

And now execute the code.

The output is:

Files with a name that starts with a letter between a and e in the current directory
['echo.txt', 'ask.txt', 'a.txt', 'd.txt', 'delta.txt', 'cat.txt']
Files with a name that starts with a number between 1 and 5 in the current directory
['345.txt', '1.txt', '134.txt']

Our code works as expected and the files “fox.txt” and “67.txt” are not matched.

Why Would You Use iglob vs glob in Python?

Until now we have used the glob() function. In the glob module there is another function called iglob().

The iglob() function is similar to the glob() function, the main difference is that iglob() returns a generator which yields file names matching the given pattern.

Let’s confirm the type of object the iglob() function returns by using one of the previous code examples. We will just replace the glob function with the iglob function.

import glob

files = glob.iglob("*.txt")
print(type(files))

[output]
<class 'generator'>

The glob() function goes through all the files and stores them in the memory at once, while iglob() returns a generator which allows to iterate through all the files without storing them simultaneously in the memory.

By passing the generator to the Python next() function you will get back the first filename returned by the iglob function:

print(next(files))

[output]
file2.txt

The iglob() function is very useful when we are matching a large number of files. We could risk filling up the whole memory by loading them all using glob().

To avoid this, the iglob() helps us match all the filenames in the form of a generator which improves performance and reduces memory usage.

To print all the files matched by the iglob function you can use a Python for loop:

import glob

files = glob.iglob("*.txt")

for filename in files:
    print(filename)

Conclusion

You now should have a clear understanding of how the glob module works. You have learned how the glob module is particularly useful for tasks involving filename pattern matching and how to obtain a list of all the files that adhere to a particular pattern.

We covered some practical examples of how the glob module actually works, including some of the most used wildcard characters.

And you have also seen the difference between glob() and iglob() and why you should use iglob() over the glob() function.

Bonus read: deepen your understanding of Python and learn more about Python yield that we have mentioned when describing the behaviour of the iglob() function.

Leave a Comment