Contents

Web Scraping with Python on Raspberry Pi: Basics and Sample Project

Overview

Discover the fundamentals of web scraping with Python on a Raspberry Pi. This tutorial will guide you through the process of automatically extracting desired information from websites. As an illustration, we will extract news headlines from the Tech News page of the Reuters website and send them to a designated email account at specific times.

What You Will Learn

What is Web Scraping?

Web scraping means extracting required information from a web page using code. Python is one the best programming languages for web scraping. Using libraries of Python , you can easily extract and process information from a web page by writing a few lines of code.

In this tutorial we have used Beautiful Soup library for web scraping

Web Scraping with Python & Raspberry Pi

Note

In this tutorial, Python 3 has been used. If you want to use Python 2, you may need to change some commands.

Installing Required Library

To perform web scraping, you will need the following libraries (some may be pre-installed):

  • Beautiful Soup 4 (Bs4)
  • Requests
  • Regular Expressions (Re)

You can use the pip command to install the  libraries. To do this, write down the following commands in the Raspberry Pi terminal one by one:

pip3 install bs4 
pip3 install requests 
pip3 install re 

To ensure that libraries are installed correctly, import them into Python IDLE. If you have not encountered an error, you have correctly installed the libraries.

  • We want to read a news page and extract news headlines. I have Chosen Tech news page from Reuters.
  • Right-click on the page and select View page source.
  • Look for the title of the first news you see on the page.
  • If you are familiar with HTML structure, you can easily analyze the above picture. (Do not worry if you don’t know! With a simple search you can learn the basic concepts easily)

Code

Use the following code to remove the news header from the HTML code:

from bs4 import BeautifulSoup 
import requests 
import re 
 
i = 1 
news = "" 
r = requests.get("https://www.reuters.com/news/technology") 
soup = BeautifulSoup(r.text, 'html.parser') 
result = soup.find_all('h3', attrs = {'class':'story-title'}) 
 
for new in result: 
    news += ("%s- "%i) 
    news += new.text.strip() 
    news += "\n" 
    i += 1 
 
print(news) 

Sending Email Using Python & Raspberry Pi

To send emails you need the following libraries:

  • SMTP Library (Smtplib)
  • MIME Library (Mime)

The following code emails the previously-read news headline for the email addresses you specify:

from bs4 import BeautifulSoup 
import requests 
import re 
import smtplib 
from email.mime.multipart import MIMEMultipart 
 
i = 1 
news = "" 
 
r = requests.get("https://www.reuters.com/news/technology") 
soup = BeautifulSoup(r.text, 'html.parser') 
result = soup.find_all('h3', attrs = {'class':'story-title'}) 
 
for new in result: 
    news += ("%s- "%i) 
    news += new.text.strip() 
    news += "\n" 
    i += 1 
 
gmail_user = '[email protected]' #Enter your Email address  
gmail_password = '*******'   #Enter your Email pasword 
sent_from = gmail_user   
to = ['[email protected]', '[email protected]']      #list of destination Email address 
 
#Email title 
msg = """Subject: Raspberry pi NEWS 
""" 
#add news to email massage 
msg += news 
   
try:   
    server = smtplib.SMTP_SSL('smtp.gmail.com', 465) 
    server.ehlo() 
    server.login(gmail_user, gmail_password) 
    server.sendmail(sent_from, to, msg) 
    server.close() 
 
    print ('Email sent!') 
except:   
    print ('Something went wrong...')  

In the code above, enter the email address of the sender and its password and a list of recipient email addresses.

Note

Sender account should be Gmail, but the receiver can be from any email service provider.(Gmail, Yahoo, etc.)

If you get the “Something went wrong” after running the code, follow these steps:

  1. Check that you have typed the code correctly and entered the correct email address.
  2. Check that the port 465 of smtp.gmail.com has not been blocked by your ISP.
  3. Go to this link and put Allow less secure apps in the ON position.

Running the Code on Raspberry Pi with Scheduling

If you want to run a program on the Raspberry Pi board with a specific timing (every hour or every day for example), you must do the following steps:

srep1. Move the code to home/pi/

step2. Add the following line to beginning of the code:

#!/usr/bin/python3

step3. Make codes executable by entering the following command in the terminal:

chmod +x reuters.py

Then check the code validity by entering the following command in the terminal:

./reuters.py

step4. enter the following command in the terminal:

crontab -e

Go to the bottom of the window and enter the following command:

*/5 * * * * /home/pi/reuters.py

Then save it and exit.

This command will cause your application to run every 5 minutes. You can change it to your liking. The pattern is as follows:

Minutes Hours Day of month Month Weekday Command

For more examples check this site.

Note

Make sure that your Raspberry Pi is connected to the internet.

What’s Next?

  • Program your Raspberry Pi so that it emails you every day at 7 am and 7 pm.
  • Explore web scraping on other websites.

Buy Raspberry Pi 3 B+

Liked What You See?​
Get Updates And Learn From The Best​

Comments (9)

  • George Jackson Reply

    Really cool tutorial! Thanks for going over the basics for this stuff, and for your clear and easy to follow examples. I like the web scraping output, how you listed all of the news articles. That really shows off the power of it. Then making that run via a schedule, and emailing out the results takes it to the next level. Thanks for this! I plan to implement some of this stuff in a Raspberry pi hosted website that I’m building

    January 12, 2020 at 4:42 pm
    • Saeed Hosseini Reply

      That’s great

      January 26, 2020 at 1:13 pm
  • Rafael Reply

    Hey! I followed this tutorial an everything went great! I tried expanding on it and getting the world news. The website sometimes works and other times doesn’t which is very weird. Can you help?
    https://www.reuters.com/news/world
    I’m afraid I’m being blocked at time.

    May 12, 2020 at 12:19 pm
  • Alex S Fliegel Reply

    Hey this is really great! Thanks so much for the baseline its great to get a head start in BS4. Whenever you are adding usernames and password to a script, it is always best practice to store the username and password (along with other sensitive data) in a seperate config.py file and importing it. That way, when you decide to share your awesome code on github you can add the config.py to the gitignore file so no one can get into your accounts!

    Thanks again!

    January 13, 2021 at 2:32 am
    • Mehran Maleki Reply

      Hi. You’re quite welcome! And thanks for the advice. We’ll take that into consideration.

      January 16, 2021 at 11:34 am
  • Dominic Reply

    Hey! Thanks for this Tutorial, it works so far but I encounter a Problem. It can scrape the Text from the website just fine but as soon as I do this line:
    msg += news
    It is unable to send the email. Just a message text alone works but not with the scraped text. Is it somehow limited to a number of characters? I scraped some more data than just headlines.
    Thanks

    July 13, 2021 at 6:49 am
  • Sammy Reply

    Some of the code appears to be missing:
    -At the beginning, when talking about installing the libraries
    -Step 3 and 4 in the “Scheduling” section

    Can you clarify what the codes will be, as well as maybe publishing the full code block for the python program?

    October 5, 2021 at 8:43 pm
    • Mehran Maleki Reply

      Hi!
      There must have been a mistake while publishing the tutorial! Now the article is updated and the codes are added.

      October 6, 2021 at 6:39 am
  • King Reply

    Hi ,

    Could you please guide for javascript dynamic webpage scraping?

    Thanks

    December 26, 2021 at 4:29 am

Leave a Reply

Your email address will not be published. Required fields are marked *