Overview
Discover the fundamentals of web scraping with Python on a Raspberry Pi. This tutorial will guide you through the process of automatically extracting desired information from websites. As an illustration, we will extract news headlines from the Tech News page of the Reuters website and send them to a designated email account at specific times.
What You Will Learn
- Web scraping basics in Python
- Implementation of Web scraping projects on the Raspberry Pi board
- How to send an email with Raspberry Pi and Python
- Run an application automatically at specific times on the Raspberry Pi
What is Web Scraping?
Web scraping means extracting required information from a web page using code. Python is one the best programming languages for web scraping. Using libraries of Python , you can easily extract and process information from a web page by writing a few lines of code.
In this tutorial we have used Beautiful Soup library for web scraping
Web Scraping with Python & Raspberry Pi
Note
In this tutorial, Python 3 has been used. If you want to use Python 2, you may need to change some commands.
Installing Required Library
To perform web scraping, you will need the following libraries (some may be pre-installed):
- Beautiful Soup 4 (Bs4)
- Requests
- Regular Expressions (Re)
You can use the pip command to install the libraries. To do this, write down the following commands in the Raspberry Pi terminal one by one:
pip3 install bs4
pip3 install requests
pip3 install re
To ensure that libraries are installed correctly, import them into Python IDLE. If you have not encountered an error, you have correctly installed the libraries.
- We want to read a news page and extract news headlines. I have Chosen Tech news page from Reuters.
- Right-click on the page and select View page source.
- Look for the title of the first news you see on the page.
- If you are familiar with HTML structure, you can easily analyze the above picture. (Do not worry if you don’t know! With a simple search you can learn the basic concepts easily)
Code
Use the following code to remove the news header from the HTML code:
from bs4 import BeautifulSoup
import requests
import re
i = 1
news = ""
r = requests.get("https://www.reuters.com/news/technology")
soup = BeautifulSoup(r.text, 'html.parser')
result = soup.find_all('h3', attrs = {'class':'story-title'})
for new in result:
news += ("%s- "%i)
news += new.text.strip()
news += "\n"
i += 1
print(news)
Sending Email Using Python & Raspberry Pi
To send emails you need the following libraries:
- SMTP Library (Smtplib)
- MIME Library (Mime)
The following code emails the previously-read news headline for the email addresses you specify:
from bs4 import BeautifulSoup
import requests
import re
import smtplib
from email.mime.multipart import MIMEMultipart
i = 1
news = ""
r = requests.get("https://www.reuters.com/news/technology")
soup = BeautifulSoup(r.text, 'html.parser')
result = soup.find_all('h3', attrs = {'class':'story-title'})
for new in result:
news += ("%s- "%i)
news += new.text.strip()
news += "\n"
i += 1
gmail_user = '[email protected]' #Enter your Email address
gmail_password = '*******' #Enter your Email pasword
sent_from = gmail_user
to = ['[email protected]', '[email protected]'] #list of destination Email address
#Email title
msg = """Subject: Raspberry pi NEWS
"""
#add news to email massage
msg += news
try:
server = smtplib.SMTP_SSL('smtp.gmail.com', 465)
server.ehlo()
server.login(gmail_user, gmail_password)
server.sendmail(sent_from, to, msg)
server.close()
print ('Email sent!')
except:
print ('Something went wrong...')
In the code above, enter the email address of the sender and its password and a list of recipient email addresses.
Note
Sender account should be Gmail, but the receiver can be from any email service provider.(Gmail, Yahoo, etc.)
If you get the “Something went wrong” after running the code, follow these steps:
- Check that you have typed the code correctly and entered the correct email address.
- Check that the port 465 of smtp.gmail.com has not been blocked by your ISP.
- Go to this link and put Allow less secure apps in the ON position.
Running the Code on Raspberry Pi with Scheduling
If you want to run a program on the Raspberry Pi board with a specific timing (every hour or every day for example), you must do the following steps:
srep1. Move the code to home/pi/
step2. Add the following line to beginning of the code:
#!/usr/bin/python3
step3. Make codes executable by entering the following command in the terminal:
chmod +x reuters.py
Then check the code validity by entering the following command in the terminal:
./reuters.py
step4. enter the following command in the terminal:
crontab -e
Go to the bottom of the window and enter the following command:
*/5 * * * * /home/pi/reuters.py
Then save it and exit.
This command will cause your application to run every 5 minutes. You can change it to your liking. The pattern is as follows:
Minutes Hours Day of month Month Weekday Command
For more examples check this site.
Note
Make sure that your Raspberry Pi is connected to the internet.
What’s Next?
- Program your Raspberry Pi so that it emails you every day at 7 am and 7 pm.
- Explore web scraping on other websites.
Comments (9)
Really cool tutorial! Thanks for going over the basics for this stuff, and for your clear and easy to follow examples. I like the web scraping output, how you listed all of the news articles. That really shows off the power of it. Then making that run via a schedule, and emailing out the results takes it to the next level. Thanks for this! I plan to implement some of this stuff in a Raspberry pi hosted website that I’m building
That’s great
Hey! I followed this tutorial an everything went great! I tried expanding on it and getting the world news. The website sometimes works and other times doesn’t which is very weird. Can you help?
https://www.reuters.com/news/world
I’m afraid I’m being blocked at time.
Hey this is really great! Thanks so much for the baseline its great to get a head start in BS4. Whenever you are adding usernames and password to a script, it is always best practice to store the username and password (along with other sensitive data) in a seperate config.py file and importing it. That way, when you decide to share your awesome code on github you can add the config.py to the gitignore file so no one can get into your accounts!
Thanks again!
Hi. You’re quite welcome! And thanks for the advice. We’ll take that into consideration.
Hey! Thanks for this Tutorial, it works so far but I encounter a Problem. It can scrape the Text from the website just fine but as soon as I do this line:
msg += news
It is unable to send the email. Just a message text alone works but not with the scraped text. Is it somehow limited to a number of characters? I scraped some more data than just headlines.
Thanks
Some of the code appears to be missing:
-At the beginning, when talking about installing the libraries
-Step 3 and 4 in the “Scheduling” section
Can you clarify what the codes will be, as well as maybe publishing the full code block for the python program?
Hi!
There must have been a mistake while publishing the tutorial! Now the article is updated and the codes are added.
Hi ,
Could you please guide for javascript dynamic webpage scraping?
Thanks