Getting Started with Beautiful Soup
Start your free 7-days trial now!
What is Beautiful Soup?
Beautiful Soup is a Python library that allows you to retrieve desired data from HTML and XML.
Overall flow
Use the
urllib.request
module to retrieve the HTML from a website.Pass the HTML to Beautiful Soup to retrieve a
BeautifulSoup
object representing the HTML tree structure.Use methods included in Beautiful Soup to filter the
BeautifulSoup
object, and retrieve only the information you are after.
Installing
To install Beautiful Soup using pip
or conda
:
pip install beautifulsoup4conda install beautifulsoup4
Importing
To import Beautiful Soup:
from bs4 import BeautifulSoup
Terminology
Different Objects
There are 4 types of objects in Beautiful Soup:
Tag
The term tag is often used interchangeably with element.
In the above example we have a <p>
tag: <p>Hello World</p>
NavigableString
The text within a tag.
A NavigableString
is similar to a Python string, however, it also supports some features specific to Beautiful Soup. You can convert a NavigableString
to a normal Python string using the str(~)
method.
BeautifulSoup
A BeautifulSoup
object represents the whole parsed document (you can think of it as a collection of all the tags in an HTML document).
It behaves similarly to a Tag
, however, unlike a Tag
it does not have any attributes.
Comments
A Comment
is a special type of NavigableString
which is displayed with special formatting.
A comment is indicated by the <!--comment-->
syntax.
Other terminology
Name
Every tag has a name which can be accessed using .name
:
html = """<p>Hello World</p>"""
soup = BeautifulSoup(html, 'html.parser')tag = soup.ptag.name
'p'
Attributes
A tag can have multiple attributes. For example, the tag <div id="people">
has an attribute "id"
with value "people"
. You can access the value of an attribute using tag["attribute"]
syntax.
my_html = """<div id="people"> <div id="profile"> <p>Alex</p> </div></div>"""tag = BeautifulSoup(my_html).divtag["id"]
'people'
Parent
The Tag.parent
property returns the parent for the particular tag:
my_html = """<div id="people"> <div id="profile"> <p>Alex</p> </div></div>"""soup = BeautifulSoup(my_html)p_tag = soup.find("p")print(p_tag.parent)
<div id="profile"><p>Alex</p></div>
Notice how the parent tag printed includes the child tag <p>Alex</p>
.
Children
The Tag.children
property in Beautiful Soup returns a generator used to iterate over the immediate child elements of a Tag
. To iterate over children of the <div>
tag:
Sibling
Tags
that are at the same indentation level are known as siblings. We can navigate between siblings using the .next_sibling
and .previous_sibling
properties.
Webscraping Workflow
Inspect the Website
Right click on the element to inspect, which lets us look at the HTML code.
Parse HTML
Now we know the structure of the HTML we are trying to retrieve data from, we can start to parse the HTML. We will use Beautiful Soup to parse the page and search for specific elements.
To connect to the website and get the html we will use urllib
which is a Python Standard Library.
Url of the website to retrieve data from:
url = "url_we_want_to_retrieve_data_from"
Connect to the website using urllib
:
try: page = urllib.request.urlopen(url)except: print("Error")
We next pass the page object to Beautiful Soup:
soup = BeautifulSoup(page, 'html.parser')
A parser in layman's terms is something that will check whether the input belongs to a particular language or not. For example, an HTML parser will check whether the input is valid HTML or not, allowing us to know that we have properly structured HTML data to work with (or not if there are any errors).
Extracting information
Find specific tags
To find the tag we are interested in, we can search in many different ways such as using:
Getting text from the tag
Once we have identified the tag we are interested in, we can use the get_text()
method to extract the text within the tag.
To extract the text contained within the <b>
tag:
Webscraping Example
Let us say we are interested in extracting the following paragraph from the Python wikipedia pageopen_in_new:
Inspect the website
We can right click the highlighted paragraph and click "Inspect" to understand which tag contains this information we are after:
Parse HTML
To connect to the website using urllib
and storing the HTML of the page to a BeautifulSoup
object:
# Importing required librariesfrom bs4 import BeautifulSoupimport urllib.request
# URL of the page we would like to scrape fromurl = "https://en.wikipedia.org/wiki/Python_(programming_language)"
try: page = urllib.request.urlopen(url)except: # Create a BeautifulSoup object to store the datasoup = BeautifulSoup(page, 'html.parser')
# Print the BeautifulSoup object and check the type of the objectprint(soup)
<!DOCTYPE html><html class="client-nojs" dir="ltr" lang="en"><head><meta charset="utf-8"/><title>Python (programming language) - Wikipedia</title><script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"a75a29f4-8bc1-435e-b5c0-4e1962eaf0a3","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Python_(programming_language)","wgTitle":"Python (programming language)","wgCurRevisionId":1031165903,"wgRevisionId":1031165903,"wgArticleId":23862,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["All articles with dead external links","Articles with dead external links from June 2021","Articles with short description","Short description is different from Wikidata","Use dmy dates from August 2020","Articles containing potentially dated statements from March 2021","All articles containing potentially dated statements","Articles containing potentially dated statements from February 2021","Pages using Sister project links with wikidata namespace mismatch","Pages using Sister project links with hidden wikidata","Wikipedia articles with GND identifiers","Wikipedia articles with BNF identifiers","Wikipedia articles with LCCN identifiers","Wikipedia articles with FAST identifiers","Wikipedia articles with MA identifiers","Wikipedia articles with SUDOC identifiers","Articles with example Python (programming language) code","Good articles","Python (programming language)","Class-based programming languages","Computational notebook","Computer science in the Netherlands","Cross-platform free software","Cross-platform software","Dutch inventions","Dynamically typed programming languages","Educational programming languages","High-level programming languages",...<class 'bs4.BeautifulSoup'>
Extracting information
We know from the first step inspecting the website that the paragraph of interest is the third child of the <div>
tag with "class"="mw-parser-output"
.
Python is dynamically-typed and garbage-collected. It supports multiple programming paradigms, including structured (particularly, procedural), object-oriented and functional programming. Python is often described as a "batteries included" language due to its comprehensive standard library.[31]