Web Scraping: find, find_all and get

Introduction (Recap)

In a previous tutorial, Web Scraping: The Basics, we went through the basics of web scraping with Python. That involved obtaining a webpage’s HTML source through a get request and parsing the source through BeautifulSoup in order to access elements.

In this tutorial we shall delve deeper into the find() function and introduce you to two more functions provided by BeautifulSoup. If you will be doing web scraping with BeautifulSoup you will definitely use these three functions. A good understanding of these functions is almost mandatory.

We will continue using the Wikipedia’s 2022 World Cup Squads page.

find()

As the name suggests, the find function finds descendants of a “soup” element through filters you specify.

In the previous post, we were able to retrieve the wiki’s header through:

title_tag = src_soup.find("span", {"class": "mw-page-title-main"})

We told the find function to find a <span> tag that has class value of “mw-page-title-main“. Find will return the FIRST element that fits this criteria. That is important to note because sometimes you may assume that only one element has a certain class value. However, if you are familiar with web development, you are aware that classes can be share among several elements. Therefore, if you use the find() function and it returns an element you didn’t expect, it’s more likely than not that a similar element with the same class exists up the parse tree created by BeautifulSoup.

The basic syntax of the find() function is as follows:

# Find() syntax
find(tag, attrs)

The attributes (attrs) argument is not mandatory. If we use the find function without it, we will still get the heading as you can see below.

Returns the same because it is the first <span> tag in the parse tree.

Also you may have noticed that the attrs argument is a Python dictionary. If you want to be specific about an element you can include more than one attribute. For example; the “From Wikipedia, the free encyclopedia” text is contained in a <div> tag that has both ID and class attributes.

Has id=”siteSub” and class=”noprint”

We access it the following way:

sub = src_soup.find("div", {"id": "siteSub", "class": "noprint"})

That’s just about it with the find() function. You can find more about it through the official docs. One other thing I can mention is that you can pass the id and class attributes without using the dictionary. As far as I know, only the id and class attributes can be pass this way. Examples:

Pass id and class attributes without a dictionary.

We used:

src_soup.find("div", id = "siteSub")
# and
src_soup.find("span", class_ = "mw-page-title-main") # notice the underscore after "class"

find_all()

The find() function locates the first element in a parse tree that matches set filters. In some cases you may want to locate different elements that fit a certain set of parameters. In such cases, the find_all() function is more appropriate to use.

The find_all() function returns a list of elements that match the set filters. Even if it finds one element it will return a list of one. Therefore, find_all() will always return a list. See Below:

Find_all() is diverse and can do a lot but I’ll just go through some helpful and commonly used filters.

Obtaining Different Elements

In the Wikipedia’s 2022 World Cup Squads page, we may want to obtain all headers in the page all at once. To do that, we pass a list of the header tags in the page to the find_all() function as shown below:

Obtain a Specific Number

You may want to collect only a certain number of similar tags. For example; on our example Wikipedia page, we may want to work with only players of teams in group A. If we just find_all("table"), we will return all tables (41 tables) on the page but to return just the first 4 tables we pass the limit parameter specifying that we just require the first 4 tables: find_all("table", limit=4) – which returns the first 4 tables.

In conclusion, the find_all() function can do a lot for you. But I find these two to be very useful. To find out more read the docs.

get()

The get() function allows you to access the attributes’ values. Sometimes you may not be interested in the innerHTML/text. Such cases are when require the alt text of an img or the href value of an <a> tag in order to scrape a linked webpage.

On our webpage there’s a link to the Covid-19 wikipage. We can get the href of this link using get().

Covid-19 <a> tag
get() function example

Conclusion

With these three (3) functions together with control structures we shall be able to obtain data from the wiki. That is what we shall do in the final installment of this series. You can practice with the functions and see what you can obtain from the page. See you soon.

Leave a Reply

Your email address will not be published. Required fields are marked *