Basics of Web Scraping w/ Raspberry Pi & Python [w/ Sample Project]

Table of Contents

Overview

In this tutorial, you will learn about web scraping and how to automatically extract your desired information from a site by using a Raspberry Pi. As an example, we read the Tech News page of the Reuters website automatically and send news headlines to an arbitrary email account at a specific time.

What You Will Learn

What is Web Scraping?

Web scraping means extracting required information from a web page using code. Python is one the best programing languages for web scraping. Using libraries of Python , you can easily extract and process information from a web page by writing a few lines of code.

In this tutorial we have used Beautiful Soup library for web scraping

Web Scraping with Raspberry Pi & Python

Note

In this tutorial, Python 3 has been used. If you want to use Python 2, you may need to change some commands.

Installing Required Library

You need three libraries  bellow for web scraping: (You may have some of them by default)

  • Bs4
  • Requests
  • Re

You can use the pip command to install the  libraries. To do this, write down the following commands in the Raspberry Pi terminal one by one:

pip3 install bs4 
pip3 install requests 
pip3 install re 

To ensure that libraries are installed correctly, import them into Python IDLE. If you have not encountered an error, you have correctly installed the libraries.

We want to read a news page and extract news headlines. I have Chosen Tech news page from Reuters.

Right-click on the page and select View page source.

Look for the title of the first news you see on the page.

If you are familiar with HTML structure, you can easily analyze the above picture. (Do not worry if you don’t know! With a simple search you can learn the basic concepts easily)

Code

Use the following code to remove the news header from the HTML code:

from bs4 import BeautifulSoup 
import requests 
import re 
 
i = 1 
news = "" 
r = requests.get("https://www.reuters.com/news/technology") 
soup = BeautifulSoup(r.text, 'html.parser') 
result = soup.find_all('h3', attrs = {'class':'story-title'}) 
 
for new in result: 
    news += ("%s- "%i) 
    news += new.text.strip() 
    news += "\n" 
    i += 1 
 
print(news) 

Sending Email Using Python & Raspberry Pi

To send emails you need the following libraries:

  • Smtplib
  • Mime

The following code emails the previously-read news headline for the email addresses you specify:

from bs4 import BeautifulSoup 
import requests 
import re 
import smtplib 
from email.mime.multipart import MIMEMultipart 
 
i = 1 
news = "" 
 
r = requests.get("https://www.reuters.com/news/technology") 
soup = BeautifulSoup(r.text, 'html.parser') 
result = soup.find_all('h3', attrs = {'class':'story-title'}) 
 
for new in result: 
    news += ("%s- "%i) 
    news += new.text.strip() 
    news += "\n" 
    i += 1 
 
gmail_user = 'my@gmail.com' #Enter your Email address  
gmail_password = '*******'   #Enter your Email pasword 
sent_from = gmail_user   
to = ['to1@gmail.com', 'to2@gmail.com']      #list of destination Email address 
 
#Email title 
msg = """Subject: Raspberry pi NEWS 
""" 
#add news to email massage 
msg += news 
   
try:   
    server = smtplib.SMTP_SSL('smtp.gmail.com', 465) 
    server.ehlo() 
    server.login(gmail_user, gmail_password) 
    server.sendmail(sent_from, to, msg) 
    server.close() 
 
    print ('Email sent!') 
except:   
    print ('Something went wrong...')  

In the code above, enter the email address of the sender and its password and a list of recipient email addresses.

Note

Sender account should be Gmail, but the receiver can be from any email service provider.(Gmail, Yahoo, etc.)

If you get the “Something went wrong” after running the code, follow these steps:

  1. Check that you have typed the code correctly and entered the correct email address.
  2. Check that the port 465 of smtp.gmail.com has not been blocked by your ISP.
  3. Go to this link and put Allow less secure apps in the ON position.

Running the Code on Raspberry pi w/ Scheduling

If you want to run a program on the Raspberry Pi board with a specific timing (every hour or every day for example), you must do the following steps:

srep1. Move the code to home/pi/

step2. Add the following line to beginning of the code:

#!/usr/bin/python3

step3. Make codes executable by entering the following command in the terminal:

chmod +x reuters.py

Then check the code validity by entering the following command in the terminal:

./reuters.py

step4. enter the following command in the terminal:

crontab -e

Go to the bottom of the window and enter the following command:

*/5 * * * * /home/pi/reuters.py

Then save it and exit.

This command will cause your application to run every 5 minutes. You can change it to your liking. The pattern is as follows:

Minutes Hours Day of month Month Weekday Command

For more examples check this site.

Note

Make sure that your Raspberry Pi is connected to the internet.

What’s Next?

  • Proggram your Raspberry Pi so that it emails you every day at 7 am and 7 pm.
  • Try Scrape Another Site.

Buy Raspberry Pi 3 B+

Liked What you see?

Get updates and learn from the best

More To Explore

Comments (7)

  • George Jackson Reply

    Really cool tutorial! Thanks for going over the basics for this stuff, and for your clear and easy to follow examples. I like the web scraping output, how you listed all of the news articles. That really shows off the power of it. Then making that run via a schedule, and emailing out the results takes it to the next level. Thanks for this! I plan to implement some of this stuff in a Raspberry pi hosted website that I’m building

    January 12, 2020 at 4:42 pm
    • Saeed Hosseini Reply

      That’s great

      January 26, 2020 at 1:13 pm
  • Rafael Reply

    Hey! I followed this tutorial an everything went great! I tried expanding on it and getting the world news. The website sometimes works and other times doesn’t which is very weird. Can you help?
    https://www.reuters.com/news/world
    I’m afraid I’m being blocked at time.

    May 12, 2020 at 12:19 pm
  • Alex S Fliegel Reply

    Hey this is really great! Thanks so much for the baseline its great to get a head start in BS4. Whenever you are adding usernames and password to a script, it is always best practice to store the username and password (along with other sensitive data) in a seperate config.py file and importing it. That way, when you decide to share your awesome code on github you can add the config.py to the gitignore file so no one can get into your accounts!

    Thanks again!

    January 13, 2021 at 2:32 am
    • Mehran Maleki Reply

      Hi. You’re quite welcome! And thanks for the advice. We’ll take that into consideration.

      January 16, 2021 at 11:34 am
  • Sammy Reply

    Some of the code appears to be missing:
    -At the beginning, when talking about installing the libraries
    -Step 3 and 4 in the “Scheduling” section

    Can you clarify what the codes will be, as well as maybe publishing the full code block for the python program?

    October 5, 2021 at 8:43 pm
    • Mehran Maleki Reply

      Hi!
      There must have been a mistake while publishing the tutorial! Now the article is updated and the codes are added.

      October 6, 2021 at 6:39 am

Leave a Reply

Your email address will not be published. Required fields are marked *