Web Scraping: The Basics
What is Web Scraping?
In simple words, web scraping is the process of extracting data from a webpage and making it usable to any user. Plenty websites/webpages contain lots of invaluable data. For example; Wikipedia’s 2022 FIFA World Club players page contains a lot of information about the players at the 2022 World Cup and that is the data we will be extracting in this web scraping series of tutorials.
Why Web Scrape?
When wanting to answer some “business” questions through the use of data, it may be difficult to find the appropriate datasets. You may find incomplete and inaccurate datasets. It’s therefore important for one to know how to obtain data from various sources.
Indeed, web scraping can be done manually. You can visit a web page on your browser and enter the data in a database or spreadsheet. The problem with this method is that one will probably use a lot of time.
With the Python programming language, one can extract data from a webpage efficiently and with ease. The caveat is that one requires basic knowledge of how webpages are presented on the browser (HTML5) and additionally, the basics of the Python programming language.
Extracting Players’ Data
We are going to extract data from Wikipedia’s 2022 FIFA World Club players page so get your favourite browser and console ready!
To start basic web scraping you will require two (2) libraries:
- Requests – Allows you to make HTTP requests with Python
- BeautifulSoup – Is used for parsing HTML and XML. It creates a tree structure of the HTML you get from a HTTP request
Step 1: Install the libraries
In your console, install the libraries:
pip install requests beautifulsoup4
Step 2: Import the libraries
In your python script or console import the libraries:
import requests
from bs4 import BeautifulSoup
Step 3: Make a GET request to the url
url = "https://en.wikipedia.org/wiki/2022_FIFA_World_Cup_squads"
response = requests.get(url)
response
If everything went okay you should get the response 200. This means that the GET request was successful.
It is possible to get different responses. The most common are:
- 404 – Not Found
- 400 – Bad Request – Usually an error on the client side
- 300 – Request has more than one response
- 429 – Many requests – Caused by suspicious number of requests from a single user/IP at a given time
Therefore, you can only proceed with scraping if you get a 200 (OK) response, a successful GET request.
Step 4: Parsing the Source and Accessing Elements
Get the HTML source through response.text
and parse it using BeautifulSoup.
source = response.text
src_soup = BeautifulSoup(source, "html.parser")
After parsing the source you should be ready to extract data from the page. BeautifulSoup provides us with functions to access the HTML elements. For example; If I wanted to extract the title of the page, I would inspect the element on my browser to determine its HTML tag and attributes of the tag.
I would then use the find()
function provided by BeautifulSoup to access it.
title_tag = src_soup.find("span", {"class": "mw-page-title-main"})
title_tag
To take a step further you can obtain the text as follows:
title_tag.text
So there we have it: the basics of web scraping. You can find a few things on any web page with the knowledge you have gained from this post. In the next post I shall dive deeper into functions provided to us by BeautifulSoup in particular, find()
and find_all()
.
Full Script
import requests
from bs4 import BeautifulSoup
url = "https://en.wikipedia.org/wiki/2022_FIFA_World_Cup_squads"
response = requests.get(url)
source = response.text
src_soup = BeautifulSoup(source, "html.parser")
title_tag = soup.find("span", {"class": "mw-page-title-main"})
title_tag
title_tag.text