search-for-youtube-videos-using-python-with-6-lines-of-code

Search for Youtube Videos Using Python with 6 Lines of Code

I was wondering how I could use Python to search for videos on Youtube without having to do it myself.

And it was actually pretty simple!

That’s one of the things I love about Python, it allows you to create programs quickly and with just few lines of code.

In this article I will show you how to search for a video on Youtube using Python. The program I will write can easily be used for any search.

How Youtube Search Works

First of all, in order to be able to search for videos using a program we need to understand the URL structure used by Youtube when we search for a video.

If I search for “Mozart” directly in Youtube I get redirected to the following URL:

https://www.youtube.com/results?search_query=mozart

So, the only part of the URL that changes is the search term.

Let’s start creating a simple Python program that does this specific search and returns the HTML from Youtube.

The urllib package

The main package used in Python to work with URLs is urllib and it includes several modules. The one we are interested in is urllib.request that can be used to open and read URLs.

I will use urllib.request to get the HTML for the search results page on Youtube and print its HTML.

Python programs can access the code in another module using the import statement, so let’s:

  1. Import urllib.request in our program.
  2. Use the urlopen function of the urllib.request module to get the HTML of the Youtube search page.
  3. Print the HTML of the page.

For HTTP and HTTPS URLs, the urlopen function returns a http.client.HTTPResponse object whose body can be read using the read() method.

The urlopen function returns a bytes object because there is no way for urlopen to know the encoding of the stream it receives from the HTTP server. For this reason you also need to remember to decode the bytes object from the read() method to string using the decode() method.

import urllib.request

html = urllib.request.urlopen("https://www.youtube.com/results?search_query=mozart")
print(html.read().decode())

Here is a fragment of the HTML of the page printed by our program…

I’m showing you the part of the HTML we will focus on to identity the URL of a video from the search results page:

<div class="yt-lockup-content">
<h3 class="yt-lockup-title ">
<a href="/watch?v=ULihXz-MHH8" class="yt-uix-tile-link yt-ui-ellipsis yt-ui-ellipsis-2 yt-uix-sessionlink spf-link " data-sessionlink="itct=CHMQ3DAYCyITCP_O6sPq9OgCFVUMFgodouUKPjIGc2VhcmNoUgZtb3phcnSaAQMQ9CQ"  title="Sylvia Schwartz: Mozart - Duet Papageno &amp; Papagena from &quot;Die Zauberflöte&quot; (with Thomas Quasthoff)" rel="spf-prefetch" aria-describedby="description-id-143900" dir="ltr">Sylvia Schwartz: Mozart - Duet Papageno &amp; Papagena from &quot;Die Zauberflöte&quot; (with Thomas Quasthoff)</a>

In the third line of the HTML above you can see:

href="/watch?v=ULihXz-MHH8"

Why are we looking at this part of the HTML?

If I click on any Youtube video I get redirected to a URL in the following format:

https://www.youtube.com/watch?v=ULihXz-MHH8

Can you see the last part of the URL?

ULihXz-MHH8 is a unique identifier for this specific video, the Youtube identifier for videos is made of 11 characters.

So, to get the URL of each video in the Youtube search results page I have to find occurrences similar to the one we have seen above.

How do we do that?

Finding a Pattern in HTML Using Regular Expressions

To find occurrences that include the 11-characters identifier we can use regular expressions.

A regular expression (also known as regex) is a sequence of characters that defines a search pattern.

In this case the sequence of characters is:

/watch?v=<11_characters_identifier>

The module used in Python for regular expressions is called re. You can find more details about this module here.

For the program we are creating we just need to know one specific function of this module: findall.

The function findall returns all non-overlapping matches for a specific pattern in a string (the HTML content of the Youtube search results page).

The generic syntax of the findall function is:

re.findall(pattern, string)

Note: regular expression patterns in Python are prefixed with the letter ‘r‘.

I will explain regular expression patterns in a different article, for now we just want to focus on the regular expression required to find the identifiers of the Youtube video in the HTML of the search results page.

Once again, this is the string we are looking for:

/watch?v=<11_characters_identifier>

And here is the regular expression pattern:

r"watch\?v=(\S{11})"

So, let’s explain it:

  • r: as mentioned before we use it to define regular expression patterns.
  • backslash ( \ ): used to escape special characters like the question mark ( ? ).
  • \S: matches any non-whitespace character.
  • {11}: specifies that exactly 11 copies of the previous regular expression should be matched. In this case \S.
  • round parentheses ( … ): indicate the start and end of a group. We use a group to define what the regular expression has to return, in this case just the occurrences of the 11-characters identifiers (excluding the initial part… /watch?v=.

Time to Update our Python Code

The Python code we have written so far is:

import urllib.request

html = urllib.request.urlopen("https://www.youtube.com/results?search_query=mozart")
print(html.read().decode())

The next step is to add the line that using the findall function identifies the pattern we are looking for:

import urllib.request
import re

html = urllib.request.urlopen("https://www.youtube.com/results?search_query=mozart")
video_ids = re.findall(r"watch\?v=(\S{11})", html.read().decode())
print(video_ids)

Here is the output of the script:

['shoVsQhou-8', 'shoVsQhou-8', 'Rb0UmrCXxVA', 'Rb0UmrCXxVA', 'iUohO2MSot8', 'iUohO2MSot8', 'QEDZd066a2k', 'QEDZd066a2k', 'QHl6wYCwlcQ', 'QHl6wYCwlcQ',
......
(not all identifiers included to keep the output small)
...
'FpK1tjbeeA0', 'FpK1tjbeeA0', 'sjTLIW-qx_A', 'sjTLIW-qx_A', 'pB2p_r5Gvs8']

Basically we get back the list video_ids that contains all the 11-characters identifiers in the Youtube search results page.

Finally, we can get the full URL of a video in the following way:

"https://www.youtube.com/watch?v=" + video_ids[i]

where the index i allows to pick any element in the list video_ids. To select the first result we can use video_ids[0].

So, here is the version of the program that prints the URL for the first search result in Youtube:

import urllib.request
import re

search_keyword="mozart"
html = urllib.request.urlopen("https://www.youtube.com/results?search_query=" + search_keyword)
video_ids = re.findall(r"watch\?v=(\S{11})", html.read().decode())
print("https://www.youtube.com/watch?v=" + video_ids[0])

And this is the output of our program, the URL of the first video in the Youtube search results when we search for “mozart”:

https://www.youtube.com/watch?v=Rb0UmrCXxVA

As you can see I have stored the value “mozart” in the variable search_keyword.

Now, let’s say I want to search for “mozart piano”…

Here is what happens when I replace the value of the search_keyword variable and run the program. I get the following error back:

http.client.InvalidURL: URL can't contain control characters. '/results?search_query=mozart piano' (found at least ' ')

It looks like this program only works for search queries that contain a single term.

How would you update it to support multiple terms?

I will leave it for you to solve! 🙂

Conclusion

In this article we have covered a lot and you also have a quite interesting program that you can expand the way you prefer.

So, let’s recap what I have explained:

  • The urllib package and the urllib.request module.
  • Regular expressions in Python.
  • How to use a Python program to perform a Youtube search.

Everything clear? 🙂

Share knowledge with your friends!

Leave a Reply

Your email address will not be published. Required fields are marked *