Introduction: Simple, Powerful Web Scraper in 5 Minutes
We are IoTalabs and we are a group of Internet of Things enthusiasts that love hacking together different devices. Over the past few months we have learned how to scrape the internet in a really easy way. We wanted to share our hacking tips with you all. Be sure to check out our current projects at
Our Website - doteverything.co
2) Google chrome
3) YQL – We will teach how to use this
The world of online APIs is a tough one. While many websites and services have their own APIs now, they are usually heavily limited in regards to usage and information available. An alternative to using a service’s API is to simply scrape their website for the data you want. Scraping involves parsing the HMTL of a web page and finding information based on the standardized structure of site. We have figured out the quickest way to scrape a website, so get ready!
Step 1: Theory Behind Scraping
So say I had a simple website that looked like the following
We can see that the vital information we want lies in a span with the class “hiInstructables”.(Image 1) It turns out that websites are very consistent when labeling a piece of information. So we can assume that if there were multiple vital pieces of information that we needed, they would be labeled all with the same class like this: (Image 2)
So this tackles the essence of scraping. Websites use a specific format for labeling their content. If we can figure out what that format is, then we can make a program that automatically looks for those labels in that format to get the data we need.
Step 2: Your First Scrape: Grabbing the Usernames Out of a Reddit Thread
The first step in building a scraper is always going to be
identifying what our key information is labeled under. In this case, we want all the usernames in the comments of a reddit thread. So we are going to use google chrome’s inspect element tool to find out what the username is labeled as. (image 1)
This should bring up the following terminal with the username highlighted: (2)
We see that all usernames in a reddit thread are related to links with the class “author”. Now here’s the tricky part: we need some way to sort through all the different web page elements to get through to the tag with the class “author”. As you can see it’s not an easy journey because these links lie in the:
<div class = "commentarea">
which then drops down into
<div id = "siteTable_t3_3rixq5" class = "sitetable nestedlisting">
which drops into even more html elements. To minimize the
Step 3: YQL ( YAHOO QUERY LANGUAGE)
So we’ve identified where in the web page our
usernames are. We now just need to obtain that information in a traversable format. Normally, scrapers are built by just loading the entire web page in a dense tree-like XML node format. This is a headache. Loading a webpage in JSON is much easier because it allows us to access elements directly using the . operator. To get the web-page in JSON format, we are going to use Yahoo’s Query Language. Basically YQL is an open tool built by Yahoo to query web pages into Json. The actual language is very similar to MYSQL. This is the link to the console:
Here's how it looks: (image 1)
so our query is pretty straight forward:
select * from html where url = "https://www.reddit.com/r/arduino/comments/3rixq5/i_programmed_a_robot_arm_to_feed_me_breakfast/" and xpath='//a[contains(@class,"author")]'
select * just means select everything from the webpage where the url = our reddit thread.
The xpath basically says, search through the page and return each place where we have an tag with a class of “author”.
As you can see the query is successful and returns all the usernames we wanted:(image 2)
To get this result in a JSON format, just click the json tab: (image 3)
Now to get this from the console into a local variable, all
we do is the use the REST Query (https://en.wikipedia.org/wiki/Representational_sta...) found at the bottom of the page. Our code with the proper async call is below: (image 1)
Using this code you can get all the usernames into an array