Use the urllib.request module to retrieve the HTML from a website.
Pass the HTML to Beautiful Soup to retrieve a BeautifulSoup object representing the HTML tree structure.
Use methods included in Beautiful Soup to filter the BeautifulSoup object, and retrieve only the information you are after.

Installing

To install Beautiful Soup using pip or conda:


        
        
            
                
                
                    pip install beautifulsoup4
conda install beautifulsoup4

Importing

To import Beautiful Soup:


        
        
            
                
                
                    from bs4 import BeautifulSoup

Terminology

Different Objects

There are 4 types of objects in Beautiful Soup:

Tag

The term tag is often used interchangeably with element.


        
        
            
                
                
                    html = """
<p>Hello World</p>
"""

soup = BeautifulSoup(html, 'html.parser')
tag = soup.p
print(tag)
print(type(tag))
                
            
            <p>Hello World</p>
<class 'bs4.element.Tag'>

In the above example we have a <p> tag: <p>Hello World</p>

NavigableString

The text within a tag.


        
        
            
                
                
                    html = """
<p>Hello World</p>
"""

soup = BeautifulSoup(html, 'html.parser')
tag = soup.p
print(tag.string)
print(type(tag.string))
                
            
            Hello World
<class 'bs4.element.NavigableString'>

A NavigableString is similar to a Python string, however, it also supports some features specific to Beautiful Soup. You can convert a NavigableString to a normal Python string using the str(~) method.

BeautifulSoup

A BeautifulSoup object represents the whole parsed document (you can think of it as a collection of all the tags in an HTML document).


        
        
            
                
                
                    html = """
 <div>
        <p>Alice</p>
        <p>Bob</p>
        <p id="cathy"></p>
 </div>
"""
soup = BeautifulSoup(html, 'html.parser')
print(soup)
print(type(soup))
                
            
            
<div>
<p>Alice</p>
<p>Bob</p>
<p id="cathy"></p>
</div>
<class 'bs4.BeautifulSoup'>

It behaves similarly to a Tag, however, unlike a Tag it does not have any attributes.

Comments

A Comment is a special type of NavigableString which is displayed with special formatting.


        
        
            
                
                
                    html = """
<b><!--Hello World--></b>
"""
soup = BeautifulSoup(html, 'html.parser')
comment = soup.b.string
print(comment)
print(type(comment))
                
            
            Hello World
<class 'bs4.element.Comment'>

A comment is indicated by the  syntax.

Other terminology

Name

Every tag has a name which can be accessed using .name:


        
        
            
                
                
                    html = """
<p>Hello World</p>
"""

soup = BeautifulSoup(html, 'html.parser')
tag = soup.p
tag.name
                
            
            'p'

Attributes

A tag can have multiple attributes. For example, the tag <div id="people"> has an attribute "id" with value "people". You can access the value of an attribute using tag["attribute"] syntax.


        
        
            
                
                
                    my_html = """
<div id="people">
    <div id="profile">
        <p>Alex</p>
    </div>
</div>
"""
tag = BeautifulSoup(my_html).div
tag["id"]
                
            
            'people'

Parent

The Tag.parent property returns the parent for the particular tag:


        
        
            
                
                
                    my_html = """
<div id="people">
    <div id="profile">
        <p>Alex</p>
    </div>
</div>
"""
soup = BeautifulSoup(my_html)
p_tag = soup.find("p")
print(p_tag.parent)
                
            
            <div id="profile">
<p>Alex</p>
</div>

Notice how the parent tag printed includes the child tag <p>Alex</p>.

Children

The Tag.children property in Beautiful Soup returns a generator used to iterate over the immediate child elements of a Tag. To iterate over children of the <div> tag:


        
        
            
                
                
                    my_html = """
       <div id="names">
              <p>Alex</p>
              <p>Bob</p>
              <p>Cathy</p>
       </div>
"""
soup = BeautifulSoup(my_html)

for child in soup.find("div").children:
   print(child)
                
            
            <p>Alex</p>
<p>Bob</p>
<p>Cathy</p>

Sibling

Tags that are at the same indentation level are known as siblings. We can navigate between siblings using the .next_sibling and .previous_sibling properties.

Webscraping Workflow

Inspect the Website

Right click on the element to inspect, which lets us look at the HTML code.

Parse HTML

Now we know the structure of the HTML we are trying to retrieve data from, we can start to parse the HTML. We will use Beautiful Soup to parse the page and search for specific elements.

To connect to the website and get the html we will use urllib which is a Python Standard Library.

Url of the website to retrieve data from:


        
        
            
                
                
                    url = "url_we_want_to_retrieve_data_from"

Connect to the website using urllib:


        
        
            
                
                
                    try:
    page = urllib.request.urlopen(url)
except:
    print("Error")

We next pass the page object to Beautiful Soup:


        
        
            
                
                
                    soup = BeautifulSoup(page, 'html.parser')

NOTE

A parser in layman's terms is something that will check whether the input belongs to a particular language or not. For example, an HTML parser will check whether the input is valid HTML or not, allowing us to know that we have properly structured HTML data to work with (or not if there are any errors).

Extracting information

Find specific tags

To find the tag we are interested in, we can search in many different ways such as using:

Getting text from the tag

Once we have identified the tag we are interested in, we can use the get_text() method to extract the text within the tag.

To extract the text contained within the <b> tag:


        
        
            
                
                
                    my_html = """
       <div>
              <p>I like tea.</p>
              <p>I like <b>soup</b>.</p>
              I like soda.
       </div>
"""
soup = BeautifulSoup(my_html)
soup.find('b').get_text()
                
            
            'soup'

Webscraping Example

Let us say we are interested in extracting the following paragraph from the Python wikipedia pageopen_in_new:

Inspect the website

We can right click the highlighted paragraph and click "Inspect" to understand which tag contains this information we are after:

Parse HTML

To connect to the website using urllib and storing the HTML of the page to a BeautifulSoup object:


        
        
            
                
                
                    # Importing required libraries
from bs4 import BeautifulSoup
import urllib.request

# URL of the page we would like to scrape from
url = "https://en.wikipedia.org/wiki/Python_(programming_language)"

try:
    page = urllib.request.urlopen(url)
except:
    print("Error")
    
# Create a BeautifulSoup object to store the data
soup = BeautifulSoup(page, 'html.parser')

# Print the BeautifulSoup object and check the type of the object
print(soup)
print(type(soup))
                
            
            <!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Python (programming language) - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"a75a29f4-8bc1-435e-b5c0-4e1962eaf0a3","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Python_(programming_language)","wgTitle":"Python (programming language)","wgCurRevisionId":1031165903,"wgRevisionId":1031165903,"wgArticleId":23862,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["All articles with dead external links","Articles with dead external links from June 2021","Articles with short description","Short description is different from Wikidata","Use dmy dates from August 2020",
"Articles containing potentially dated statements from March 2021","All articles containing potentially dated statements","Articles containing potentially dated statements from February 2021","Pages using Sister project links with wikidata namespace mismatch","Pages using Sister project links with hidden wikidata","Wikipedia articles with GND identifiers","Wikipedia articles with BNF identifiers","Wikipedia articles with LCCN identifiers","Wikipedia articles with FAST identifiers","Wikipedia articles with MA identifiers","Wikipedia articles with SUDOC identifiers","Articles with example Python (programming language) code","Good articles","Python (programming language)","Class-based programming languages","Computational notebook","Computer science in the Netherlands","Cross-platform free software","Cross-platform software","Dutch inventions","Dynamically typed programming languages","Educational programming languages","High-level programming languages",
.
.
.
<class 'bs4.BeautifulSoup'>

Extracting information

We know from the first step inspecting the website that the paragraph of interest is the third child of the <div> tag with "class"="mw-parser-output".


        
        
            
                
                
                    div_tag = soup.find('div',{"class":"mw-parser-output"})

# Find all the p tags under the div_tag
p_tags = div_tag.find_all('p')

# We know the paragraph of interest is stored in the third p tag
print(p_tags[2].get_text())
                
            
            Python is dynamically-typed and garbage-collected. It supports multiple programming paradigms, including structured (particularly, procedural), object-oriented and functional programming. Python is often described as a "batteries included" language due to its comprehensive standard library.[31]

Published by Arthur Yanagisawa

Edited by 0 others

Did you find this page useful?

thumb_up

thumb_down

Comment

Citation

Ask a question or leave a feedback...

thumb_up

thumb_down

chat_bubble_outline

settings

Enjoy our search

Hit / to insta-search docs and recipes!