Web Scraping With Ruby

About: I had a long break in my online activity, bad things happened in my life and i'm still trying to recover, sorry folks that i was away for a long time and please be patient with me, i will be better in time, ...

A short Q&A about this instructable.

Q: What the #$%* is web scrapping and why do someone need it ?

A: The most of the webpages on the internet do not offer a web API, and you need one. The idea is to take data from the web page structured in a way that can be used by your application (a script, a executable, a webpage or even a database).

Q: Why ?

A: Lets see, you seek a apartment in city X, within a certain area, and it needs to be over Y square meters, you can seek whit the tools provided (but sometimes your criteria is not seek-able by the page tools) but the results are not presented in the way you need/like. Now think about a script that gets the data for city X in the way its best for your post processing, you then seek automatically for the certain area and display only the apartments that are over Y square meters as a list, sorted with the cheapest first. All this by just a double click and works on Windows, Mac or Linux.

Q: Is scraping legal ?

A: It is not ilegal, you don't get data that you are not supposed to get, you just get it in a automated manner and if you do it right you don't spam the server with not needed requests.

Q: It will always work, like a web API ?

A: No, if the webpage changes in a form that affects your readings then you will need to change your script to the new data layout. Nothing too big or hard, i can do it in under 1 minute.

Q: Can i get data that is not supposed to by accessed, like with SQL Inject ?

A: No, you can't, scraping is not hacking, it is just a way to get only what you need from one or more websites.

Step 1: Detailed Info and Example

Now out there people will try to tell, you need the X or Y gem (like Nekogiri or Mechanize) still for most of the cases YOU DON'T NEED THEM.

A normal ruby install and a text editor (Notepad++, or whatever you like).

I use RubyMine, it is not free, but i like it, it feels & looks like Visual Studio.

Now for the example. I play a game called Warframe (www.warframe.com) and the game has a system that offers one time mission with nice rewards, but the missions are time limited and appear randomly. The official site has a twitter account that presents the alert missions and there are some fan made sites too, even a android application. For windows you need to be logged it with the game or keep a browser window open with twitter or one of the fan made sites, but there is no application. Until now :D

I gonna use one of the fan made sites to get the data needed. (http://deathsnacks.com/wf/index.html)

now for the code (http://pastebin.com/153FFXJf) commented and syntax highlighted.

---------

# http://deathsnacks.com/wf/index.html
require "open-uri"

#start new thread

t = Thread.new do

while true

conn = open('http://deathsnacks.com/wf/index.html').read

table_data = conn.scan / /

table_data_refined = []

table_data.each { |data|

data.gsub!(/<.+?>/, '')

# add space after price

data.gsub!('0cr', '0cr ')

table_data_refined << data

}

puts ' '

puts ' Warframe Alerts by Neumann Gregor'

$i = 0

table_data_refined.each do |looped|

if (table_data_refined[$i][0] =~ /[[:digit:]]/)

#insert spaces between lowercase and uppercase letters in string

puts ' ' + (table_data_refined[$i]).to_s.gsub(/(?<=[a-z])(?=[A-Z])/, ' ')

end

$i +=1

end

sleep 10

Gem.win_platform? ? (system "cls") : (system "clear")

end

end

gets

t.kill

---------

As you see, we just read all data, the html page, then look for <li> </li> tags and get that in a array. then we refine that by looking for the records that start with numbers and we then strip the html tags and add some spaces for a better reading, we repeat that every 10 seconds until we hit enter, if you do that it quits.

I have added the source code as a .rb file and a ocra generated exe for the people that don't have ruby installed and don't want to install it.

Hack Your Day Contest

Participated in the
Hack Your Day Contest

Full Spectrum Laser Contest 2016

Participated in the
Full Spectrum Laser Contest 2016

Share

    Recommendations

    • Classroom Science Contest

      Classroom Science Contest
    • Sew Tough Challenge

      Sew Tough Challenge
    • Stone Concrete and Cement Contest

      Stone Concrete and Cement Contest

    2 Discussions

    0
    None
    neumanngregor

    3 years ago

    Feel free to ask about the sorce code, anything that you wanna know and i will happly, respond in ASAP. I will update the executable code if/when it will need to be fixed as the site changes. I recommend Ruby as a entry point in programming, it is easy and fun to work with. It is way better for some task as c++, the idea is to use the corect language for the correct task, for me correct language for the task is to have a easy and short code that does in the what you need.