Introduction: How to Build a Web Scraper
Many people use personal computers without utilizing them to
their fullest capabilities. By learning a few basic principles and utilizing free software, one can start to truly unlock the power and resources a computer has to offer. This tutorial will illustrate a method of constructing a “Web-Scraping” Bot or crawler. These “crawlers” are capable of automatically collecting all different types of data from any website. This tool is immensely powerful for any computer user.
Step 1: Required Materials:
1 Personal Computer
- I will be using Windows 10 in this demonstration, but the same code and principals can be applied across all platforms, even mobile.
Step 2: Previous Computer Experience:
While this tutorial does not require prior coding experience,
it is recommended that users have a basic understanding of how to use a keyboard (copy and paste) and how to use a mouse.
CAUTION: Always make sure you backup your important files. Improper installation may cause data corruption.
Step 3: Starting the Project
First, we need to download and install a program called
Python 2.7.14. Go to “https://www.python.org/downloads/” and click download Python 2.7.14. After it is done downloading, run the file and install Python. To check to make sure it installed, look in the C:/ Drive folder and find a folder called Python27. If it’s there, Python installed successfully. If it’s not there, try restarting your computer and running the installation program again.
Now we need to make Windows and Python play nice together.
Open Control panel and select "System and Security"
Go to the left column and select “Advanced System Settings” A new window should appear.
Click “Environment Variables”
One is called “User Variables” and another called “System Variables” Navigate to “System Variables” and click “New..” (We are going to ADD two new variables)
First ADD Variable Name: PYTHON Variable PATH: C:\Python27\
Second ADD Variable Name: Python_Scripts Variable PATH: C:\Python27\Scripts Restart your computer.
After restart open command prompt (Hit windows key and type “cmd”) Enter the command: python You should see : “Python 2.7.13 (v2.7.13:a06454b1afa1, Dec 17 2016, 20:42:59) [MSC v.1500 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information.” >>> If you don’t see this, repeat STEP 1 & 2. Press “Crtl + C” and then “Enter” to exit Python and return to the main line. Close Command Prompt;
We officially have Python installed. Now we have to install
a couple of small programs for our “crawler” to work.
Open a new notepad file and copy and paste all the text from “https://bootstrap.pypa.io/get-pip.py”
Save the text file as “get-pip.py” and move it into your documents folder.
Open command prompt as administrator
Type “cd documents” press enter
Type “python get-pip.py” Press enter
Type “pip install selenium” press enter
After selenium is successfully installed move on to the next step
Open a new notepad file.
Copy and paste all the CODE from https://pastebin.com/RbNpyc60 into the notepad
Now the fun begins…
We have to decide what kind of data we want to scrape. For the sake of demonstration, I will use Ebay item prices.
Let’s say I want to sell my instrument, but I am not sure what the mean price is.
I can user the “crawler” to collect the prices for me.
In the notepad file, look for a line that says
landing_page_url = 'https://xxxxxxxxxxxxxx.com’
I am going to copy and paste the URL from the page I want to scrape here.
In the case it will be
This is the ebay search result page for “MPC 2000XL”. (the instrument I want to sell)
Every single thing you see on a webpage is called an
“element”. As such, they each have their own “address” or “position” on the page that is unique to each element. We want the bot to grab and record certain element, but not others. We do this by discerning for the bot which things we want it to grab.
Go to the notepad file and locate a line that says,
Item_price_element_list = browser.find_elements_by_css_selector("xxxx") # Find the search box
Now open chrome and navigate to the landing_page that you pasted earlier.
Right click the element you want to scrape and select
A new section should open and you should be able to view the page source code.
The element that you clicked on is now being highlighted in the source code window.
Right click that highlighted portion and hover over copy
Then select Copy CSS Selector
Paste it into the “xxxx” portion of the
Item_price_element_list like this:
Item_price_element_list = browser.find_elements_by_css_selector("#item3f88323b1e > ul.lvprices.left.space-zero > li.lvprice.prc > span") # Find the search box
Step 20: Enjoy
Believe it or not, we are done. This program will
successfully create a list of prices from the 1st result page for us.
Save the notepad file as crawler.py and move it to your Documents Folder
(CAUTION: IF YOU SAVE IT AS .TXT IT WILL NOT RUN)
Now open up CMD (does not have to be in administrative mode)
Type “cd documents”
Type ”python crawler.py”
You should see a list of prices
Now I can find the mean and the median and properly list my instrument for a fair price!