Introduction: How to Build a Web Scraper

Picture of How to Build a Web Scraper

Many people use personal computers without utilizing them to

their fullest capabilities. By learning a few basic principles and utilizing free software, one can start to truly unlock the power and resources a computer has to offer. This tutorial will illustrate a method of constructing a “Web-Scraping” Bot or crawler. These “crawlers” are capable of automatically collecting all different types of data from any website. This tool is immensely powerful for any computer user.

Step 1: Required Materials:

Picture of Required Materials:

1 Personal Computer

- I will be using Windows 10 in this demonstration, but the same code and principals can be applied across all platforms, even mobile.

Internet Connection

Google Chrome

Step 2: Previous Computer Experience:

While this tutorial does not require prior coding experience,

it is recommended that users have a basic understanding of how to use a keyboard (copy and paste) and how to use a mouse.

CAUTION: Always make sure you backup your important files. Improper installation may cause data corruption.

Step 3: Starting the Project

Picture of Starting the Project

First, we need to download and install a program called

Python 2.7.14. Go to “https://www.python.org/downloads/” and click download Python 2.7.14. After it is done downloading, run the file and install Python. To check to make sure it installed, look in the C:/ Drive folder and find a folder called Python27. If it’s there, Python installed successfully. If it’s not there, try restarting your computer and running the installation program again.

Step 4:

Picture of

Now we need to make Windows and Python play nice together.

Open Control panel and select "System and Security"

Step 5:

Picture of

select “System”

Step 6:

Picture of

Go to the left column and select “Advanced System Settings” A new window should appear.

Step 7:

Picture of

Click “Environment Variables”

Step 8:

Picture of

One is called “User Variables” and another called “System Variables” Navigate to “System Variables” and click “New..” (We are going to ADD two new variables)

Step 9:

Picture of

First ADD Variable Name: PYTHON Variable PATH: C:\Python27\

Step 10:

Picture of

Second ADD Variable Name: Python_Scripts Variable PATH: C:\Python27\Scripts Restart your computer.

Step 11:

Picture of

After restart open command prompt (Hit windows key and type “cmd”) Enter the command: python You should see : “Python 2.7.13 (v2.7.13:a06454b1afa1, Dec 17 2016, 20:42:59) [MSC v.1500 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information.” >>> If you don’t see this, repeat STEP 1 & 2. Press “Crtl + C” and then “Enter” to exit Python and return to the main line. Close Command Prompt;

Step 12:

Picture of

We officially have Python installed. Now we have to install

a couple of small programs for our “crawler” to work.

Open a new notepad file and copy and paste all the text from “https://bootstrap.pypa.io/get-pip.py”

Save the text file as “get-pip.py” and move it into your documents folder.

Open command prompt as administrator

Step 13:

Picture of

Type “cd documents” press enter

Type “python get-pip.py” Press enter

Type “pip install selenium” press enter

After selenium is successfully installed move on to the next step

Step 14:

Picture of

Open a new notepad file.

Copy and paste all the CODE from https://pastebin.com/RbNpyc60 into the notepad

Now the fun begins…

We have to decide what kind of data we want to scrape. For the sake of demonstration, I will use Ebay item prices.

Let’s say I want to sell my instrument, but I am not sure what the mean price is.

I can user the “crawler” to collect the prices for me.

Step 15:

In the notepad file, look for a line that says

landing_page_url = 'https://xxxxxxxxxxxxxx.com’

I am going to copy and paste the URL from the page I want to scrape here.

In the case it will be

landing_page_url =

'https://www.ebay.com/sch/i.html?_from=R40&_trksid=p2380057.m570

.l1313.TR3.TRC2.A0.H0.Xmpc+2000xl.TRS0&_nkw=mpc+2000xl&_sacat=0’

This is the ebay search result page for “MPC 2000XL”. (the instrument I want to sell)

Step 16:

Picture of

Every single thing you see on a webpage is called an

“element”. As such, they each have their own “address” or “position” on the page that is unique to each element. We want the bot to grab and record certain element, but not others. We do this by discerning for the bot which things we want it to grab.

Go to the notepad file and locate a line that says,

Item_price_element_list[] = browser.find_elements_by_css_selector("xxxx") # Find the search box

Now open chrome and navigate to the landing_page that you pasted earlier.

Step 17:

Picture of

Right click the element you want to scrape and select

“Inspect Element”

A new section should open and you should be able to view the page source code.

Step 18:

Picture of

The element that you clicked on is now being highlighted in the source code window.

Right click that highlighted portion and hover over copy

Then select Copy CSS Selector

Step 19:

Paste it into the “xxxx” portion of the

Item_price_element_list like this:

Item_price_element_list[] = browser.find_elements_by_css_selector("#item3f88323b1e > ul.lvprices.left.space-zero > li.lvprice.prc > span") # Find the search box

Step 20: Enjoy

Believe it or not, we are done. This program will

successfully create a list of prices from the 1st result page for us.

Save the notepad file as crawler.py and move it to your Documents Folder

(CAUTION: IF YOU SAVE IT AS .TXT IT WILL NOT RUN)

Now open up CMD (does not have to be in administrative mode)

Type “cd documents”

Type ”python crawler.py”

You should see a list of prices

Now I can find the mean and the median and properly list my instrument for a fair price!

Comments

Swansong (author)2017-10-10

That's a neat setup :)