What is Web Scraping? It's just scraping the web for data haha ;)

Sometimes, the data may exist on the internet as web pages. A lot of data is not accessible through datasets and APIs. However, it exists on the web pages.

Web Scraping technique is one way to access the data without using APIs. We need not wait for an API provider to create an API in this case.

So how does Web Scraping work?

Web scraping loads a web page into Python so we can extract the information we want using Python analysis tools pandas, numpy.

The Python requests library enables us to download a web page. The Python beautifulsoup library is used to extract the relevant parts (using HTML tags, CSS Selectors) of the web page.

Before we do the web scraping, it is important to understand the structure of the web page and find a way to extract parts of that structure.

Web Page Structure

Web pages uses HTML. When a web browser like Chrome or Firefox downloads a webpage, it reads the HTML to determine how to render and display the contents of the web page.

Downloading the page

script.py

# Make a GET request
response = requests.get("http://dataquestio.github.io/web-scraping-pages/simple.html")
# Get the content of the response
content = response.content

Variables

contentbytes (<class 'bytes'>)
b'<!DOCTYPE html>\n<html>\n <head>\n <title>A simple example page</title>\n </head>\n <body>\n <p>Here is some simple content for this page.</p>\n </body>\n</html>'

responseResponse (<class 'mock_requests.Response'>)
<mock_requests.Response at 0x7f4832746828>

<!DOCTYPE html>\n<html>\n <head>\n <title>A simple example page</title>\n </head>\n <body>\n <p>Here is some simple content for this page.</p>\n </body>\n</html>

<html>

<head>

<title> A simple example page </title>

</head>

<body>

<p>Here is some simple content for this page. </p>

</body>

</html>

script.py

from bs4 import BeautifulSoup
​
# Initialize the parser, and pass in the content we grabbed earlier.
parser = BeautifulSoup(content, 'html.parser')
​
# Get the body tag from the document.
# Since we passed in the top level of the document to the parser, we need to pick a branch off of the root.
# With BeautifulSoup, we can access branches by using tag types as attributes.
body = parser.body
​
# Get the p tag from the body.
p = body.p
​
# Print the text inside the p tag.
# Text is a property that gets the inside text of a tag.
print(p.text)
​
# Get the head tag from the document
head = parser.head
​
# Get the title tag
title = parser.title
​
# Print the text in title
title_text = title.text
print(title_text)

Output

Here is some simple content for this page.
A simple example page

Variables

title_textstr (<class 'str'>)
'A simple example page'

headTag (<class 'bs4.element.Tag'>)
<head> <title>A simple example page</title> </head>

titleTag (<class 'bs4.element.Tag'>)
<title>A simple example page</title>

bodyTag (<class 'bs4.element.Tag'>)
<body> <p>Here is some simple content for this page.</p> </body>

BeautifulSouptype (<class 'type'>)
bs4.BeautifulSoup

pTag (<class 'bs4.element.Tag'>)
<p>Here is some simple content for this page.</p>

contentbytes (<class 'bytes'>)
b'<!DOCTYPE html>\n<html>\n <head>\n <title>A simple example page</title>\n </head>\n <body>\n <p>Here is some simple content for this page.</p>\n </body>\n</html>'

parserBeautifulSoup (<class 'bs4.BeautifulSoup'>)
<!DOCTYPE html> <html> <head> <title>A simple example page</title> </head> <body> <p>Here is some simple content for this page.</p> </body> </html>

Using Find All method

Using the tag type as property as above, is not always the best way to parse a document. It's better to use find_all() method.

If we only want the first occurrence of an item, we'll need to index the list to get it. Aside from this difference, the process is the same as passing in the tag type as an attribute.

script.py

parser = BeautifulSoup(content, 'html.parser')
​
# Get a list of all occurrences of the body tag in the element.
body = parser.find_all("body")
print(body)
​
# Get the paragraph tag.
p = body[0].find_all("p")
print(p)
​
# Get the text.
print(p[0].text)
​
# Get the text inside title tag
head = parser.find_all("head")
print(head)
​
# Title is the first occurrence of head tag
title = head[0].find_all("title")
print(title) 
​
# Get the title text
title_text = title[0].text
print(title_text)

Output

[<body>
<p>Here is some simple content for this page.</p>
</body>]
[<p>Here is some simple content for this page.</p>]
Here is some simple content for this page.
[<head>
<title>A simple example page</title>
</head>]
[<title>A simple example page</title>]
A simple example page

Element IDs

HTML allows elements to have IDs. And they are unique. With this unique ID we can access the elements associated with that ID.

HTML uses the div tag to create a divider that splits the page into logical units. We can think of a divider as a "box" that contains content. For example, different dividers hold a web page's footer, sidebar, horizontal menu.

Here is an example page:

<!DOCTYPE html>

<html>

<head>

<title> A simple example page </title>

</head>

<body>

<div>

First Paragraph!!

</p>

</div>

<b>

Second Paragraph!!!

</b>

</body>

</html>

Page looks like this below:

First Paragraph!!

Second Paragraph!!!

There are two paragraphs on this page. The first paragraph is nested inside the div tag. But we are lucky, the paragraphs have ID. This means we can access them easily, even though they are nested.

We can use the find_all() with the additional "id" attribute to get the element we want.

cript.py

# Get the page content and set up a new parser.
response = requests.get("http://dataquestio.github.io/web-scraping-pages/simple_ids.html")
content = response.content
parser = BeautifulSoup(content, 'html.parser')
​
# Pass in the ID attribute to only get the element with that specific ID.
first_paragraph = parser.find_all("p", id="first")[0]
print(first_paragraph.text)
​
second_paragraph = parser.find_all("p", id="second")[0]
second_paragraph_text = second_paragraph.text
print(second_paragraph_text)

Output

                First paragraph.
            
                Second paragraph.

Without the element's index [0], we get the below error

script.py

# Get the page content and set up a new parser.
response = requests.get("http://dataquestio.github.io/web-scraping-pages/simple_ids.html")
content = response.content
parser = BeautifulSoup(content, 'html.parser')
​
# Pass in the ID attribute to only get the element with that specific ID.
first_paragraph = parser.find_all("p", id="first")[0]
print(first_paragraph.text)
​
second_paragraph = parser.find_all("p", id="second")
second_paragraph_text = second_paragraph.text
print(second_paragraph_text)

Output

AttributeErrorTraceback (most recent call last)
<ipython-input-1-799c88851397> in <module>()
     15 
     16 second_paragraph = parser.find_all("p", id="second")
---> 17 second_paragraph_text = second_paragraph.text
     18 print(second_paragraph_text)

AttributeError: 'ResultSet' object has no attribute 'text'

Elements classes

In HTML, elements can also have classes. Instances of classes aren't necessarily unique. Many different elements can belong to the same class, usually because they share a common purpose.

For example, let us we want 3 dividers to display 3 paragraphs. And the look and feel of the dividers are common. This is when we use a class.

We can create one class for all 3 dividers and then apply that class to all of the dividers to display the paragraphs. One element can have multiple classes.

The page looks like this:

First paragraph.

Second paragraph.

First outer paragraph.

Second outer paragraph.

Select Elements by class

To select elements by class, we just need to use class_ parameter in the find_all() method.

script.py

# Get the website that contains classes.
response = requests.get("http://dataquestio.github.io/web-scraping-pages/simple_classes.html")
content = response.content
parser = BeautifulSoup(content, 'html.parser')
​
# Get the first inner paragraph.
# Find all the paragraph tags with the class inner-text.
# Then, take the first element in that list.
first_inner_paragraph = parser.find_all("p", class_="inner-text")[0]
print(first_inner_paragraph.text)
​
second_inner_paragraph = parser.find_all("p", class_="inner-text")[1]
second_inner_paragraph_text = second_inner_paragraph.text
print(second_inner_paragraph_text)
​
first_outer_paragraph = parser.find_all("p", class_="outer-text")[0]
first_outer_paragraph_text = first_outer_paragraph.text
print(first_outer_paragraph_text)

Output

                First paragraph.
            
                Second paragraph.
            
                First outer paragraph.

CSS Selectors

What is CSS first? It is a language for adding styles to a HTML page. Cascading Style Sheets(CSS)

CSS uses Selectors to add style to the elements and classes of elements. Selectors can be used to add styles like background color, text colors, borders, padding, etc....

How does CSS Selectors work?

p{
color: red
}

The page looks like

First paragraph.

Second paragraph.

First outer paragraph.

Second outer paragraph.

p.inner-text{
color: red
}

The page looks like

First paragraph.

Second paragraph.

First outer paragraph.

Second outer paragraph.

p#first{
color: red
}

The page looks like

First paragraph.

Second paragraph.

Using CSS Selectors to select elements when we do web scraping

This above HTML looks like this in a web page

First paragraph.

Second paragraph.

First outer paragraph.

Second outer paragraph.

In the CSS, a class selector is a name preceded by a full stop (“.”) and an ID selector is a name preceded by a hash character (“#”). The difference between an ID and a class is that an ID can be used to identify one element, whereas a class can be used to identify more than one.

script.py

# Get the website that contains classes and IDs.
response = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
content = response.content
parser = BeautifulSoup(content, 'html.parser')
​
# Select all of the elements that have the first-item class.
first_items = parser.select(".first-item")
print(first_items)
print(type(first_items))
# Print the text of the first paragraph (the first element with the first-item class).
print(first_items[0].text)
​
# Select all of the elements that have the class outer-text.
outer_text = parser.select(".outer-text")
print(outer_text)
print(type(outer_text))
first_outer_text = outer_text[0].text
print(first_outer_text)
​
# Select all of the elements that have the ID second
id_second = parser.select("#second")
print(id_second)
print(type(id_second))
second_text = id_second[0].text
print(second_text)
print(type(second_text))
​

Output

[<p class="inner-text first-item" id="first">
                First paragraph.
            </p>, <p class="outer-text first-item" id="second">
<b>
                First outer paragraph.
            </b>
</p>]
<class 'list'>

                First paragraph.
            
[<p class="outer-text first-item" id="second">
<b>
                First outer paragraph.
            </b>
</p>, <p class="outer-text">
<b>
                Second outer paragraph.
            </b>
</p>]
<class 'list'>


                First outer paragraph.
            

[<p class="outer-text first-item" id="second">
<b>
                First outer paragraph.
            </b>
</p>]
<class 'list'>


                First outer paragraph.
            

<class 'str'>

Nesting CSS Selectors

div p - This selector will target any paragraph inside a div tag

div .first-item - This selector will target any item inside a div tag that has the class first-item

body div #first - This selector selects any element that's inside a div tag inside a body tag, but only if it also has the ID first.

.first-item #first - This selector selects any element with ID first that are with the class first-item

The web page looks like this

http://dataquestio.github.io/web-scraping-pages/2014_super_bowl.html

script.py

# Get the Superbowl box score data.
response = requests.get("http://dataquestio.github.io/web-scraping-pages/2014_super_bowl.html")
content = response.content
parser = BeautifulSoup(content, 'html.parser')
​
# Find the number of turnovers the Seahawks committed.
turnovers = parser.select("#turnovers")[0]
seahawks_turnovers = turnovers.select("td")[1]
seahawks_turnovers_count = seahawks_turnovers.text
print(seahawks_turnovers_count)
​
# Find the Total Plays for the New England
​
total_plays = parser.select("#total-plays")[0]
patriots_total_plays = total_plays.select("td")[2]
patriots_total_plays_count = patriots_total_plays.text
print(patriots_total_plays_count)
​
# Find the Total Yards for the Seahawks 
​
total_yards = parser.select("#total-yards")[0]
seahawks_total_yards = total_yards.select("td")[1]
seahawks_total_yards_count = seahawks_total_yards.text
print(seahawks_total_yards_count)
​

Output

1
72
396

Here are the takeaways from this article

Technique - web scraping

Skills Acquired:

How to load a webpage in Python
And extract the information we want (how to retrieve elements from a web page)
Working with Python libraries pandas, numpy, beautifulsoup, requests

Tools Used:

pandas
numpy
beautifulsoap - .select() for CSS Selectors, find_all() for HTML elements
requests
HTML - tags, id, classs
CSS Selectors

Conclusion:

Web scraping is most useful when you need to gather a lot of information from many web pages quickly.

For example, if we wanted to find the total number of yards each NFL team gained in every single NFL game over an entire season. We could do this manually, but it would take days. We could write a script to automate this in a couple of hours instead, and have a lot more fun doing it.

Thank you!

I have a dream...

Wednesday, July 14, 2021

Web Scraping

What is Web Scraping? It's just scraping the web for data haha ;)

So how does Web Scraping work?

Web Page Structure

Downloading the page

Using Find All method

Element IDs

Elements classes

Select Elements by class

CSS Selectors

Using CSS Selectors to select elements when we do web scraping

Nesting CSS Selectors

Conclusion:

No comments: