Web Scraping: Scraping Player Details

Introduction

In the previous two (2) posts of this series, Web Scraping: The Basics and Web Scraping: find, find_all and get, I introduced to you to the basics of web scrapping. In this blog post, we’ll go through the steps of using a Python script and the knowledge gained until this point to scrape player data from the Wikipedia page for the 2022 World Cup squads (https://en.wikipedia.org/wiki/2022_FIFA_World_Cup_squads).

Scraping

The first step in our script is to import the necessary libraries: requests, pandas, and BeautifulSoup. Requests is a library for making HTTP requests, pandas is a library for data manipulation and analysis, and BeautifulSoup is a library for parsing and navigating HTML and XML documents.

import requests
import pandas as pd
from bs4 import BeautifulSoup

Next, we define the URL of the Wikipedia page we want to scrape, and use the requests.get function to make an HTTP GET request to the URL. This returns a Response object, which we can use to access the source code of the page. We then create a BeautifulSoup object from the source code, which allows us to easily parse and navigate the page using its HTML structure (You can also use the faster “lxml” parser).

url = "https://en.wikipedia.org/wiki/2022_FIFA_World_Cup_squads"
response = requests.get(url)
source = response.text
src_soup = BeautifulSoup(source, "html.parser")

Now that we have the page loaded and parsed, we can start extracting the data we want. In this script, we use the find_all method of the BeautifulSoup object to find all of the tr elements with the class "nat-fs-player", which represents a row in the tables containing player information. We loop through these rows, extracting the data for each player using the find_all and find methods to locate specific HTML elements within the row. We also create two (2) variables, player_number and players_dict, to keep track of the number of players and a dictionary to store the details of players. We will use the players_dict to write the csv.

players = src_soup.find_all("tr", {"class": "nat-fs-player"})
player_number = 0
players_dict = {}

For each player, we extract the following information:

  • First and last name
  • Date of birth
  • Position
  • Number of international caps (appearances for the national team)
  • Number of international goals scored
  • Club team
  • Country of the club team

We also split the date of birth into its individual parts (day, month, and year) for easier analysis. Note the use of the find(), find_all() and get() functions.

for player in players:
    player_dets = player.find_all("td")
    player_position = player_dets[1].find("a").text
    player_dob = player_dets[2].text
    pdob = player_dob[13:]
    player_dob = pdob[:len(pdob) - 11]
    dob_split = player_dob.split(" ")
    date = dob_split[0]
    mob = dob_split[1]
    yob = dob_split[2]
    player_caps = player_dets[3].text.strip()
    player_goals = player_dets[4].text.strip()
    club_team_country = player_dets[5].find("img").get("alt")
    player_club = player_dets[5].find_all("a")[1].text
    player_name = player.find("th", {"scope": "row"}).get("data-sort-value")

Once we have extracted all of the player data, we use the pandas library to create a dictionary of player data, with each player represented by a unique key (the player number). We then convert this dictionary into a pandas DataFrame, which allows us to easily manipulate and analyze the data. Finally, we use the to_csv method of the DataFrame to save the data to a CSV file.

wc_players_dict = pd.DataFrame.from_dict(players_dict, orient="index", columns=["First Name", "Other Names", "Full Name", "DOB", "Day of Birth", "Month of Birth", "Year of Birth", "Position", "International Caps", "International Goals", "Club", "Club Country"])

wc_players_dict.to_csv("WorldCup2022Players.csv")

Conclusion

The Python script we have used is relatively simple but powerful. We have extracted and organized the players’ data from the webpage. Play around with the script to see if you can extract more data or organize it differently. I also did not go into detail about pandas because that is beyond web scrapping and itself could be a series.

Thank you for reading through the series. Hope you learnt something and in case you have a question or comment feel free to reach out.

If you want to see what I did with the data you can find out more on my GitHub –> https://github.com/Hammy25/2022WorldCupPlayerAnalysis

Full Script:

# Python script to scrape player data
import requests
import pandas as pd
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/2022_FIFA_World_Cup_squads"

response = requests.get(url)
source = response.text
src_soup = BeautifulSoup(source, "html.parser")
players = src_soup.find_all("tr", {"class": "nat-fs-player"})
player_number = 0
players_dict = {}

for player in players:
    player_dets = player.find_all("td")
    player_position = player_dets[1].find("a").text
    player_dob = player_dets[2].text
    pdob = player_dob[13:]
    player_dob = pdob[:len(pdob) - 11]
    dob_split = player_dob.split(" ")
    date = dob_split[0]
    mob = dob_split[1]
    yob = dob_split[2]
    player_caps = player_dets[3].text.strip()
    player_goals = player_dets[4].text.strip()
    club_team_country = player_dets[5].find("img").get("alt")
    player_club = player_dets[5].find_all("a")[1].text
    player_name = player.find("th", {"scope": "row"}).get("data-sort-value")
    if ", " in player_name:
        first_name = player_name.split(", ")[1]
        other_names = player_name.split(", ")[0]
    else:
        first_name = player_name
        other_names = "N/A"
    player_number += 1
    players_dict[player_number] = [first_name, other_names, player_name, player_dob, date, mob, yob, player_position, player_caps, player_goals, player_club, club_team_country]

wc_players_dict = pd.DataFrame.from_dict(players_dict, orient="index", columns=[
 "First Name", "Other Names", "Full Name", "DOB", "Day of Birth", "Month of Birth", "Year of Birth", "Position", "International Caps", "International Goals", "Club", "Club Country"])

wc_players_dict.to_csv("WorldCup2022Players.csv")

Leave a Reply

Your email address will not be published. Required fields are marked *