Wednesday, July 14, 2021

Web Scraping

What is Web Scraping? It's just scraping the web for data haha ;)

Sometimes, the data may exist on the internet as web pages. A lot of data is not accessible through datasets and APIs. However, it exists on the web pages. 

Web Scraping technique is one way to access the data without using APIs. We need not wait for an API provider to create an API in this case.

So how does Web Scraping work?

Web scraping loads a web page into Python so we can extract the information we want using Python analysis tools pandas, numpy.

The Python requests library enables us to download a web page. The Python beautifulsoup library is used to extract the relevant parts (using HTML tags, CSS Selectors) of the web page. 

Before we do the web scraping, it is important to understand the structure of the web page and find a way to extract parts of that structure.

Web Page Structure

Web pages uses HTML. When a web browser like Chrome or Firefox downloads a webpage, it reads the HTML to determine how to render and display the contents of the web page.

Downloading the page

script.py
Variables
contentbytes (<class 'bytes'>)
b'<!DOCTYPE html>\n<html>\n <head>\n <title>A simple example page</title>\n </head>\n <body>\n <p>Here is some simple content for this page.</p>\n </body>\n</html>'
responseResponse (<class 'mock_requests.Response'>)
<mock_requests.Response at 0x7f4832746828>


<!DOCTYPE html>\n<html>\n <head>\n <title>A simple example page</title>\n </head>\n <body>\n <p>Here is some simple content for this page.</p>\n </body>\n</html>



<html>
    <head>
                <title> A simple example page </title>
    </head>
    <body>
                <p>Here is some simple content for this page. </p>
    </body>
</html>

script.py
Output
Here is some simple content for this page. A simple example page
Variables
title_textstr (<class 'str'>)
'A simple example page'
headTag (<class 'bs4.element.Tag'>)
<head> <title>A simple example page</title> </head>
titleTag (<class 'bs4.element.Tag'>)
<title>A simple example page</title>
bodyTag (<class 'bs4.element.Tag'>)
<body> <p>Here is some simple content for this page.</p> </body>
BeautifulSouptype (<class 'type'>)
bs4.BeautifulSoup
pTag (<class 'bs4.element.Tag'>)
<p>Here is some simple content for this page.</p>
contentbytes (<class 'bytes'>)
b'<!DOCTYPE html>\n<html>\n <head>\n <title>A simple example page</title>\n </head>\n <body>\n <p>Here is some simple content for this page.</p>\n </body>\n</html>'
parserBeautifulSoup (<class 'bs4.BeautifulSoup'>)
<!DOCTYPE html> <html> <head> <title>A simple example page</title> </head> <body> <p>Here is some simple content for this page.</p> </body> </html>


Using Find All method

Using the tag type as property as above, is not always the best way to parse a document. It's better to use find_all() method.

If we only want the first occurrence of an item, we'll need to index the list to get it. Aside from this difference, the process is the same as passing in the tag type as an attribute.

script.py
Output
[<body> <p>Here is some simple content for this page.</p> </body>] [<p>Here is some simple content for this page.</p>] Here is some simple content for this page. [<head> <title>A simple example page</title> </head>] [<title>A simple example page</title>] A simple example page


Element IDs

HTML allows elements to have IDs. And they are unique. With this unique ID we can access the elements associated with that ID.

HTML uses the div tag to create a divider that splits the page into logical units. We can think of a divider as a "box" that contains content. For example, different dividers hold a web page's footer, sidebar, horizontal menu.

Here is an example page:

<!DOCTYPE html>
<html>
      <head>
            <title> A simple example page </title>
      </head>
      <body>
            <div>
                 <p id ="first">
                                First Paragraph!!
                  </p>
            </div>
                 <p id="second">
                 <b> 
                               Second Paragraph!!!
                 </b>
        </body>
</html>


Page looks like this below:

First Paragraph!!

Second Paragraph!!!

There are two paragraphs on this page. The first paragraph is nested inside the div tag. But we are lucky, the paragraphs have ID. This means we can access them easily, even though they are nested.

We can use the find_all() with the additional "id" attribute to get the element we want.

cript.py
Output
First paragraph. Second paragraph.

Without the element's index [0], we get the below error


script.py
Output
AttributeErrorTraceback (most recent call last) <ipython-input-1-799c88851397> in <module>() 15 16 second_paragraph = parser.find_all("p", id="second") ---> 17 second_paragraph_text = second_paragraph.text 18 print(second_paragraph_text) AttributeError: 'ResultSet' object has no attribute 'text'

Elements classes

In HTML, elements can also have classes. Instances of classes aren't necessarily unique. Many different elements can belong to the same class, usually because they share a common purpose.

For example, let us we want 3 dividers to display 3 paragraphs. And the look and feel of the dividers are common. This is when we use a class.

We can create one class for all 3 dividers and then apply that class to all of the dividers to display the paragraphs. One element can have multiple classes.




The page looks like this:

First paragraph.

Second paragraph.

First outer paragraph.

Second outer paragraph.

Select Elements by class

To select elements by class, we just need to use class_ parameter in the find_all() method.

script.py
Output
First paragraph. Second paragraph. First outer paragraph.


CSS Selectors

What is CSS first? It is a language for adding styles to a HTML page. Cascading Style Sheets(CSS)

CSS uses Selectors to add style to the elements and classes of elements. Selectors can be used to add styles like background color, text colors, borders, padding, etc....

How does CSS Selectors work?

p{
    color: red
 }

The page looks like

First paragraph.

Second paragraph.

First outer paragraph.

Second outer paragraph.


p.inner-text{
    color: red
 }

The page looks like

First paragraph.

Second paragraph.

First outer paragraph.

Second outer paragraph.


p#first{
    color: red
 }

The page looks like

First paragraph.

Second paragraph.

Using CSS Selectors to select elements when we do web scraping




This above HTML looks like this in a web page

First paragraph.

Second paragraph.

First outer paragraph.

Second outer paragraph.


In the CSS, a class selector is a name preceded by a full stop (“.”) and an ID selector is a name preceded by a hash character (“#”). The difference between an ID and a class is that an ID can be used to identify one element, whereas a class can be used to identify more than one.


script.py
Output
[<p class="inner-text first-item" id="first"> First paragraph. </p>, <p class="outer-text first-item" id="second"> <b> First outer paragraph. </b> </p>] <class 'list'> First paragraph. [<p class="outer-text first-item" id="second"> <b> First outer paragraph. </b> </p>, <p class="outer-text"> <b> Second outer paragraph. </b> </p>] <class 'list'> First outer paragraph. [<p class="outer-text first-item" id="second"> <b> First outer paragraph. </b> </p>] <class 'list'> First outer paragraph. <class 'str'>


Nesting CSS Selectors

div p  - This selector will target any paragraph inside a div tag
div .first-item - This selector will target any item inside a div tag that has the class first-item
body div #first - This selector selects any element that's inside a div tag inside a body tag, but only if it also has the ID first.
.first-item #first - This selector selects any element with ID first that are with the class first-item




The web page looks like this

http://dataquestio.github.io/web-scraping-pages/2014_super_bowl.html



script.py
Output
1 72 396


Here are the takeaways from this article

  • Technique - web scraping

Skills Acquired:
  • How to load a webpage in Python
  • And extract the information we want (how to retrieve elements from a web page)
  • Working with Python libraries pandas, numpy, beautifulsoup, requests
Tools Used:
  • pandas
  • numpy
  • beautifulsoap - .select() for CSS Selectors, find_all() for HTML elements
  • requests
  • HTML - tags, id, classs
  • CSS Selectors 

Conclusion:

Web scraping is most useful when you need to gather a lot of information from many web pages quickly. 

For example, if we wanted to find the total number of yards each NFL team gained in every single NFL game over an entire season. We could do this manually, but it would take days. We could write a script to automate this in a couple of hours instead, and have a lot more fun doing it.

Thank you!
























No comments: