Toggle Nav

Basics of Web Scraping w/ Raspberry Pi & Python [w/ Sample Project]

Author

Category

Table of Contents

Overview

In this tutorial, you will learn about web scraping and how to automatically extract your desired information from a site by using a Raspberry Pi. As an example, we read the Tech News page of the Reuters website automatically and send news headlines to an arbitrary email account at a specific time.

What You Will Learn

  • Web scraping basics in Python
  • Implementation of Web scraping projects on the Raspberry Pi board
  • How to send an email with Raspberry Pi and Python
  • Run an application automatically at specific times on the Raspberry Pi

What is Web Scraping?

Web scraping means extracting required information from a web page using code. Python is one the best programing languages for web scraping. Using libraries of Python , you can easily extract and process information from a web page by writing a few lines of code.

In this tutorial we have used Beautiful Soup library for web scraping

Web Scraping with Raspberry Pi & Python

Note

In this tutorial, Python 3 has been used. If you want to use Python 2, you may need to change some commands.

Installing Required Library

You need three libraries  bellow for web scraping: (You may have some of them by default)

  • Bs4
  • Requests
  • Re

You can use the pip command to install the  libraries. To do this, write down the following commands in the Raspberry Pi terminal one by one:

pip3 install bs4 
pip3 install requests 
pip3 install re 

To ensure that libraries are installed correctly, import them into Python IDLE. If you have not encountered an error, you have correctly installed the libraries.

We want to read a news page and extract news headlines. I have Chosen Tech news page from Reuters.

Right-click on the page and select View page source.

Look for the title of the first news you see on the page.

If you are familiar with HTML structure, you can easily analyze the above picture. (Do not worry if you don’t know! With a simple search you can learn the basic concepts easily)

Code

Use the following code to remove the news header from the HTML code:

from bs4 import BeautifulSoup 
import requests 
import re 
 
i = 1 
news = "" 
r = requests.get("https://www.reuters.com/news/technology") 
soup = BeautifulSoup(r.text, 'html.parser') 
result = soup.find_all('h3', attrs = {'class':'story-title'}) 
 
for new in result: 
    news += ("%s- "%i) 
    news += new.text.strip() 
    news += "\n" 
    i += 1 
 
print(news) 

Sending Email Using Python & Raspberry Pi

To send Emails  you need the following libraries:

  • Smtplib
  • Mime

The following code emails the previously-read news headline for the email addresses you specify

from bs4 import BeautifulSoup 
import requests 
import re 
import smtplib 
from email.mime.multipart import MIMEMultipart 
 
i = 1 
news = "" 
 
r = requests.get("https://www.reuters.com/news/technology") 
soup = BeautifulSoup(r.text, 'html.parser') 
result = soup.find_all('h3', attrs = {'class':'story-title'}) 
 
for new in result: 
    news += ("%s- "%i) 
    news += new.text.strip() 
    news += "\n" 
    i += 1 
 
gmail_user = 'my@gmail.com' #Enter your Email address  
gmail_password = '*******'   #Enter your Email pasword 
sent_from = gmail_user   
to = ['to1@gmail.com', 'to2@gmail.com']      #list of destination Email address 
 
#Email title 
msg = """Subject: Raspberry pi NEWS 
""" 
#add news to email massage 
msg += news 
   
try:   
    server = smtplib.SMTP_SSL('smtp.gmail.com', 465) 
    server.ehlo() 
    server.login(gmail_user, gmail_password) 
    server.sendmail(sent_from, to, msg) 
    server.close() 
 
    print ('Email sent!') 
except:   
    print ('Something went wrong...')  

In the code above, enter the email address of the sender and its password and a list of recipient email addresses.

Note

Sender account should be Gmail, but the receiver can be from any email service provider.(Gmail, Yahoo, etc.)

If you get the “Something went wrong” after running the code, follow these steps:

  1. Check that you have typed the code correctly and entered the correct email address.
  2. Check that the port 465 of smtp.gmail.com has not been blocked by your ISP.
  3. Go to this link and put Allow less secure apps in the ON position.

Running the Code on Raspberry pi w/ Scheduling

If you want to run a program on the Raspberry Pi board with a specific timing (every hour or every day for example), you must do the following steps:

srep1. Move the code to home/pi/

step2. Add the following line to beginning of the code:

#!/usr/bin/python3

step3. Make codes executable by entering the following command in the terminal:

chmod +x reuters.py

Then check the code validity by entering the following command in the terminal:

./reuters.py

step4. enter the following command in the terminal:

crontab -e

Go to the bottom of the window and enter the following command:

*/5 * * * * /home/pi/reuters.py

Then save it and exit.

This command will cause your application to run every 5 minutes. You can change it to your liking. The pattern is as follows:

Minutes Hours Day of month Month Weekday Command

For more examples check this site.

Note

Make sure that your Raspberry Pi is connected to the internet.

What’s Next?

  • Proggram your Raspberry Pi so that it emails you every day at 7 am and 7 pm.
  • Try Scrape Another Site.

Buy Raspberry Pi 3 B+

Liked What you see?

Get updates and learn from the best

More To Explore

Comments (3)

  • George Jackson Reply

    Really cool tutorial! Thanks for going over the basics for this stuff, and for your clear and easy to follow examples. I like the web scraping output, how you listed all of the news articles. That really shows off the power of it. Then making that run via a schedule, and emailing out the results takes it to the next level. Thanks for this! I plan to implement some of this stuff in a Raspberry pi hosted website that I’m building

    January 12, 2020 at 4:42 pm
    • Saeed Hosseini Reply

      That’s great

      January 26, 2020 at 1:13 pm
  • Rafael Reply

    Hey! I followed this tutorial an everything went great! I tried expanding on it and getting the world news. The website sometimes works and other times doesn’t which is very weird. Can you help?
    https://www.reuters.com/news/world
    I’m afraid I’m being blocked at time.

    May 12, 2020 at 12:19 pm

Leave a Reply

Your email address will not be published. Required fields are marked *