Due to previously taking on many small projects involving web scraping and LINE Bot applications, I have always believed that the technology or process of web scraping should be widely known. Additionally, with the rise of various AI tools like ChatGPT, Bard, etc., I have not encountered anyone asking about web scraping projects recently. Today, a friend asked me about scraping data from the MLB website, and I simply used the browser's developer management tools in the Network tab to check the MLB website they sent me, telling them that the API could retrieve the data they wanted. However, after seeing their confused expression, I realized that many people are unclear about how to start creating a web scraper or misunderstand the notion that learning Python alone is sufficient to scrape data, which is why this article was born.
Myths and Misconceptions#
- As long as I learn Python, I can easily write a web scraper.
- As long as I can write code, I can effortlessly scrape the data I want.
- Learning one set of web scraping techniques means I can scrape anything I want.
Myth-Busting Section#
- Web scraping is a program, and programs are written by humans; just because you don't understand the process doesn't mean it will execute correctly.
- Web scraping is merely a process that helps you automate data retrieval, provided you understand this process.
- Before writing a web scraper, you must be able to manually retrieve the data and understand the entire flow.
Demonstration of Writing a Web Scraping Program - Using the MLB Website as an Example#
Target Data: 10 years of game data
Source: https://www.mlb.com/gameday/braves-vs-phillies/2023/09/11/717664/final/box
Data to be scraped:
WP:Alvarado.
HBP:Harris II, M (by Walker, T); Riley, A (by Walker, T).
Pitches-strikes:Morton, C 104-64; Lee, D 8-4; Jiménez, J 11-8; Minter 13-9; Iglesias, R 15-10; Yates 14-9; Walker, T 103-53; Bellatti 21-13; Covey 18-14; Alvarado 16-13.
Groundouts-flyouts:Morton, C 2-4; Lee, D 1-0; Jiménez, J 0-0; Minter 1-0; Iglesias, R 0-2; Yates 1-0; Walker, T 3-2; Bellatti 0-2; Covey 5-0; Alvarado 2-0.
Batters faced:Morton, C 27; Lee, D 3; Jiménez, J 2; Minter 3; Iglesias, R 5; Yates 3; Walker, T 26; Bellatti 8; Covey 7; Alvarado 5.
Inherited runners-scored:Bellatti 1-1.
Umpires:HP: Larry Vanover. 1B: Jacob Metz. 2B: Edwin Moscoso. 3B: D.J. Reyburn.
Weather:78 degrees, Sunny.
Wind:4 mph, Out To CF.
First pitch:1:08 PM.
T:3:08.
Att:30,572.
Venue:Citizens Bank Park.
September 11, 2023
Step 1: Identify the Data Source#
Since we want to scrape data from the page, the data on the webpage essentially comes from two places:
- SSR - The backend returns the entire HTML webpage.
- CSR - The frontend calls the backend API to obtain data rendered on the webpage.
Here, I will use the browser's developer tools > Network, set the filter to Fetch/XHR, and then refresh the page, checking each request's response one by one. After checking each request, we find that this request should be the API we are looking for because its response contains the data displayed on the webpage.
By clicking on the header, we can see its API URL.
https://ws.statsapi.mlb.com/api/v1.1/game/717664/feed/live?language=en
We reasonably suspect that 717664 is the game number. To confirm, we can look at the webpage's URL.
https://www.mlb.com/gameday/braves-vs-phillies/2023/09/11/717664/final/box
It seems that 717664 is the game number, and it is the same for other games.
https://www.mlb.com/gameday/braves-vs-phillies/2023/09/11/716590/final/box
https://ws.statsapi.mlb.com/api/v1.1/game/716590/feed/live?language=en
Step 2: Confirm the Data Source Contains the Required Data#
We can browse > right-click > copy Object or response > select all > copy.
For simplicity, we can paste it into an online JSON parser (jsoneditoronline, json.parser) for checking, and we can use Ctrl + F to search for keywords. We can see that the info in the API response JSON contains the data we want to scrape. Bingo!
Step 3: Use External Tools to Verify API Feasibility#
Here, we need to confirm whether the API requires special authentication or other things that only that webpage can access. We can use Postman to test it. If you don't know how to use this tool, you can search for tutorial articles.
We confirmed that as long as we can call the API, we can obtain information, so we can proceed to the next step.
Step 4: How to Continuously Obtain Information from Different Games#
Programs are designed by humans, so to write a web scraper, you must understand the entire process to be able to write it. It's not enough to just read a book or a tutorial. In this example, the overall logic of the web scraper should be:
- Scrape all game numbers for 10 years.
- Use the API above to obtain all game information for the 10 years.
- Store the game information in a variable and then write this information to a CSV file.
How do we scrape all game numbers? We can use this URL to find the previous layer's URL.
https://www.mlb.com/gameday/braves-vs-phillies/2023/09/11/717664/final/box
Using
https://www.mlb.com/gameday/
Leads us to
https://www.mlb.com/scores
This is the place. Next, use the method mentioned earlier to find which request retrieves this data.
Found the API
https://bdfed.stitch.mlbinfra.com/bdfed/transform-mlb-scoreboard?stitch_env=prod&sortTemplate=4&sportId=1&&sportId=51&startDate=2023-09-11&endDate=2023-09-11&gameType=E&&gameType=S&&gameType=R&&gameType=F&&gameType=D&&gameType=L&&gameType=W&&gameType=A&&gameType=C&language=en&leagueId=104&&leagueId=103&&leagueId=160&contextTeamId=
By throwing this API URL into Postman, we can see the parameters that accompany the call to that URL on the left. Some parameters may be unclear in function, but for safety, do not change them arbitrarily.
However, we can see that there are startDate & endDate among them. We can change them to test whether we can scrape data for multiple days. If we can, it will speed up our data scraping.
Bingo! We can retrieve all game data from 2023-09-01 to 2023-09-11 at once, but the call time is as long as 8 seconds, and the response data volume is very large. Here, we may scrape data at most one month at a time; too much may cause a timeout.
Step 5: Transform These Processes into Program Logic#
Here, don't rush to write code; we need to first transform the above process into program flow and logic before writing. Here is a simple demonstration of how the code should be transformed into a web scraping process.
import required_libs ex: requests
import calendar
# Declare global variables
# Declare scores_api_url
scores_api_url = "https://bdfed.stitch.mlbinfra.com/bdfed/transform-mlb-scoreboard"
# Declare game_api_url, which will change based on the game number, replace it with a specific placeholder game_id
game_api_url = "https://ws.statsapi.mlb.com/api/v1.1/game/game_id/feed/live?language=en"
# Declare start year for scraping
start_year = "2012"
# Declare end year for scraping
end_year = "2022"
# Store all GameId data
game_data = []
# Main program block
def main:
day_list = get_month()
# Loop through all months
for seDay in day_list:
# Use get_scores_data to scrape all game numbers from the first to the last day of that month
gameId_list = get_scores_data(seDay[0], seDay[1])
# Use get_game_date to scrape all gameId data and add it to game_data
game_data = game_data + get_game_date(gameId_list)
# Save game_data to CSV
...
# Get the first and last day of each month from start year to end year
def get_month() -> list:
result = []
for year in range(start_year, end_year + 1):
for month in range(1, 13):
# Get the first day of the month and the total number of days
_, days_in_month = calendar.monthrange(year, month)
# First day
first_day = f"{year}-{month:02}-01"
# Last day
last_day = f"{year}-{month:02}-{days_in_month:02}"
result.append((first_day, last_day))
return result
# Function to scrape scores data
def get_scores_data(sDay: str, eDay: str) -> list:
gameId_list = []
# Replace with the correct URL
url = scores_api_url
# Set payload
payload = {
"stitch_env": "prod",
"sortTemplate": "4",
"sportId": "1",
"sportId": "51",
"startDate": sDay,
"endDate": eDay,
"gameType": "E",
...
}
res = get_api(url, payload)
if res != {}:
# Loop through the list of games for that day
for game_list in res.get("dates"):
# Loop through the games for that day
for game in game_list:
gameId_list.append(game.get("gamePk"))
return gameId_list
# Function to scrape game data
def get_game_date(gameId_list: list) -> list:
result = []
# Loop through gameId_list to obtain all gameId data
for gameId in gameId_list:
# Replace with the correct gameId
url = game_api_url.replace("game_id", str(gameId))
res = get_api(url, {})
# If the API call returns a value
if res != {}:
# Implement to extract the required information from the res dict
...
result.append(gameData)
return result
# Function to call API
def get_api(url: str, payload: dict) -> dict:
res = request.get(url, params=payload)
if res.status_code == 200:
return res.json()
else:
return {}
# Program entry point
if __name__ == '__main__':
main()
The overall program for scraping game data for 10 years looks roughly like this. Some functions are not implemented in detail, leaving it for everyone to try. After completion, there may be many areas for optimization or problems encountered. Below are possible issues and optimization directions.
- May encounter data too large causing request timeout.
- May encounter sudden large API access causing the server to refuse access.
- Temporary data too large causing the program to crash.
- Not implemented to save data as CSV.
- Can save executed data to avoid repeating execution from the beginning next time.
- Can use multiprocessing to speed up processing.