Web Scraping: The Basics

What is Web Scraping?

In simple words, web scraping is the process of extracting data from a webpage and making it usable to any user. Plenty websites/webpages contain lots of invaluable data. For example; Wikipedia’s 2022 FIFA World Club players page contains a lot of information about the players at the 2022 World Cup and that is the data we will be extracting in this web scraping series of tutorials.

Why Web Scrape?

When wanting to answer some “business” questions through the use of data, it may be difficult to find the appropriate datasets. You may find incomplete and inaccurate datasets. It’s therefore important for one to know how to obtain data from various sources.

Indeed, web scraping can be done manually. You can visit a web page on your browser and enter the data in a database or spreadsheet. The problem with this method is that one will probably use a lot of time.

With the Python programming language, one can extract data from a webpage efficiently and with ease. The caveat is that one requires basic knowledge of how webpages are presented on the browser (HTML5) and additionally, the basics of the Python programming language.

Extracting Players’ Data

We are going to extract data from Wikipedia’s 2022 FIFA World Club players page so get your favourite browser and console ready!

To start basic web scraping you will require two (2) libraries:

Requests – Allows you to make HTTP requests with Python
BeautifulSoup – Is used for parsing HTML and XML. It creates a tree structure of the HTML you get from a HTTP request

Step 1: Install the libraries

In your console, install the libraries:

pip install requests beautifulsoup4

Step 2: Import the libraries

In your python script or console import the libraries:

import requests
from bs4 import BeautifulSoup

Step 3: Make a GET request to the url

url = "https://en.wikipedia.org/wiki/2022_FIFA_World_Cup_squads"
response = requests.get(url)
response

If everything went okay you should get the response 200. This means that the GET request was successful.

It is possible to get different responses. The most common are:

404 – Not Found
400 – Bad Request – Usually an error on the client side
300 – Request has more than one response
429 – Many requests – Caused by suspicious number of requests from a single user/IP at a given time

Therefore, you can only proceed with scraping if you get a 200 (OK) response, a successful GET request.

Step 4: Parsing the Source and Accessing Elements

Get the HTML source through response.text and parse it using BeautifulSoup.

source = response.text
src_soup = BeautifulSoup(source, "html.parser")

After parsing the source you should be ready to extract data from the page. BeautifulSoup provides us with functions to access the HTML elements. For example; If I wanted to extract the title of the page, I would inspect the element on my browser to determine its HTML tag and attributes of the tag.

The page title is in a **<span>** tag with a **class attribute** of the value “mw-page-title-main”

I would then use the find() function provided by BeautifulSoup to access it.

title_tag = src_soup.find("span", {"class": "mw-page-title-main"})
title_tag

To take a step further you can obtain the text as follows:

title_tag.text

So there we have it: the basics of web scraping. You can find a few things on any web page with the knowledge you have gained from this post. In the next post I shall dive deeper into functions provided to us by BeautifulSoup in particular, find() and find_all().

Full Script

import requests
from bs4 import BeautifulSoup
url = "https://en.wikipedia.org/wiki/2022_FIFA_World_Cup_squads"
response = requests.get(url)
source = response.text
src_soup = BeautifulSoup(source, "html.parser")
title_tag = soup.find("span", {"class": "mw-page-title-main"})
title_tag
title_tag.text