Web scraping with python and BeautifulSoup

BeautifulSoup is a python library that makes web scraping very easy. You can install BeautifulSoup through pip. Just type pip install bs4 and you can choose to install it in a virtual environment.

Let's start, let's create a python file called crawler.py. By the way, we are going to be using python3, well because it's awesome and that is what I want to use.

In the python file, we are going to import BeautifulSoup and other functions like urlopen, Request that we are going to use for webscraping.

from bs4 import BeautifulSoup
from urllib.request import urlopen, Request

We are going to use the Request function to set the 'User-Agent' because some websites won't allow you to crawl them with the default agent of python.

req = Request('your-url', headers={'User-Agent': 'Mozilla/5.0'})
the_page = urlopen(req)
page_html = the_page.read()
page_soup = BeautifulSoup(page_html, 'html.parser')

We use the urlopen function to open the url and the read function to get the html of the file. Then you can use a load of functions provided by BeautifulSoup to get the data you need from the page. To get all the methods, you can type dir(BeautifulSoup) in the python console and it will give you a list of methods you can use.

You can use the find_all() function to find all occurrence of the specific item you passed as argument in the function. To get all the h2 tags, input:

the_h2_tags = page_soup.find_all('h2')

You can use the getText() attribute to get the text inside a particular tag. Lastly, you can use the find(tag_name) method to get the first occurrence of the particular tag. That is all on introduction to web scraping with BeautifulSoup, you can get a lot more information by reading the BeautifulSoup documentation online.

Thank you very much, feel free to comment, this is my first article and I welcome contributions and suggestions, thanks.

No Comments Yet