Webscraping and me
Mine all of the data you could ever want with webscraping!
A long time ago, a co-worker showed me a project they had been working on that would gather website data from major grocery store chains near them and return the lowest priced item among the available stores. I thought that was pretty neat, but I didn't spend a whole lot of time really thinking about that project. As it turns out, I should have paid closer attention, because web scraping is one of the most powerful tools a dev can use.
Much later, I developed two interesting projects that utilize web scraping to different degrees. In this post, I want to quickly highlight the projects that I developed then talk more about web scraping and why I am mad at myself for not adding it to my toolkit earlier in my career.
Project 1
I wanted to have a load of good, meaningful content delivered to my inbox every day, so I created a cloud function that scrapes the top 20 posts on Hackernews and sends them to my slack. Why not just use the site? Good question, I don't know. My brain likes operating out of slack, and getting the links sent to me every morning, it just becomes part of my routine to click on the articles that interest me.
The Code:
import os
import requests
from bs4 import BeautifulSoup
from flask import Flask, request
import functions_framework
# Register an HTTP function with the Functions Framework
@functions_framework.http
def scrape_and_post(request):
# URL of the page we want to scrape
url = 'https://news.ycombinator.com/'
# Send a GET request
response = requests.get(url)
# Check the request was successful
if response.status_code != 200:
return 'Failed to get page content', 500
# Parse the body of the site
soup = BeautifulSoup(response.text, 'html.parser')
# Find the elements with class 'titleline'
elements = soup.find_all(class_='titleline')
# We are only interested in the top 20 entries
elements = elements[:20]
# Slack webhook URL
slack_webhook_url = 'SLACK_WEBHOOK_HERE'
# Array to hold all our messages
messages = []
for elem in elements:
title_link = elem.a.get('href')
title_text = elem.a.text
# Append each message to our messages list
messages.append(f'<{title_link}|{title_text}>')
# Join all our messages into a single string separated by newlines
message_text = '\n'.join(messages)
slack_data = {
'text': message_text,
'unfurl_links': True,
}
response = requests.post(
slack_webhook_url,
json=slack_data,
headers={'Content-Type': 'application/json'}
)
if response.status_code != 200:
return f'Failed to send to Slack with error: {response.content}', 500
return 'Success!', 200
Project 2
The other project gets my local public transit data and displays times for trains on an Adafruit Magtag So I can do my best to stop being late for the train every single morning.
What is web Scraping?
My definition: Web scraping is the process of mining useful information from a web interface without physically viewing the website. ChatGPT's definition: Web scraping is the process of automatically extracting information from websites. This is typically done through software that simulates human web surfing to collect specific data from various web pages. It allows for the gathering of data in a structured format, like text, images, or other content, which is otherwise presented in an unstructured format within web pages. Web scraping is widely used for various purposes, such as data analysis, automated testing of websites, gathering information for research, or monitoring changes to web content. Essentially, it turns the information available on web pages into usable data for various applications.
Wow, I'm sort of worried that my definition starts off almost exactly the same as ChatGPT...
Regardless, the important takeaway here is that it can turn any website into an unwilling API. This enables any website on the internet to be turned into a data source. That means that you could make a recipe database and ditch the ads, SEO fluff and pictures, or track prices of a specific item, You could send a slack or discord message any time your favorite blogger posts a new article.
Further Ideas for how to utilize web scraping
The next project that I would like to use this for is some sort of data collection, similar to my friends grocery store idea. Some service that runs daily that could gather web data from an unstructured source and compile it into something interesting. I'm also working on a demo that creates interactable game worlds based off of site data from wikipedia. Not sure what I'd do with that but I think that it's at least an interesting Idea.
If you like our content, please consider subscribing to our weekly newsletter. I'm biased, but it's pretty great.Sign Up