Introduction: Spidering an Ajax Website With a Asynchronous Login Form

About: Bilal Ghalib is interested in doing things that surprise him and inspire others. Let's create a future we want to live in together.

The problem: Spidering tools don't allow AJAX login authentication.
This instructable will show you how to login through an AJAX form using Python and a module called Mechanize.

Spiders are web automation programs that are becoming increasingly popular way for people to gather data online. They creep around the web gathering precious materials to fuel the most powerful web companies around. Others crawl around and gather specific sets of data to improve decision making, or infer what's currently "in", or find the cheapest travel routes.

Spiders (web crawlers, webbots, or screen scrapers) are great for turning HTML goop into some semblance of intelligent data, but we have a problem when it comes to AJAX enabled webpages that have JavaScript and cookie enabled sessions that are not navigable with the normal set of spidering tools. In this instructable we will be accessing our own member page at pubmatic.com. These steps will show you a method to follow, but your page will be different.

Have fun!

Step 1: Gather Materials

You will need to start supplementing your programming resources. You will need the following programs. Use their guides to help you install these...

Install Firebug
It's a Firefox addon

Install Python
Go to: python.orgGo to: python.org

Install the Mechanize Module
Get MechanizeGet Mechanize

Other useful Spidering tools:
BeautifulSoup

Step 2: Find the Headers Necessary to Create a Session.

A well crafted spider will access a webpage as if it were a browser being controlled by a human being keeping clues as to it's true origin hidden. Part of the interaction between browsers and servers happens through GET and POST requests that you can find in the headers (this information is rarely displayed on a browser, but is very important). You can view some this information by pressing Ctrl I (in firefox) to open up the Page Info window. To disguise yourself as a mild mannered browser you must identify yourself using the same credentials.

If you tried to log into pubmatic with javascript disabled in your browser you wouldn't get very far since the redirects are done through javascript. So considering that most spider browsers don't have javascript interpreters we will have to get by the login through an alternative rout.

Let's start by getting the header information sent from the browser when you click submit. If this were an ordinary browser login you would use Mechanize to fill out the form and click submit. Normal login forms are encapsulated within a <form> ... </form> tag and Mechanize would be able to submit this and poll the next page without trouble. Since we don't have a completed form tag, the submitting function is being handled by javascript. Let's check pubmatic's submitForm function. To do this, first open the webpage in firefox and turn on firebug by clicking the firefly in the lower right hand corner. Then click the script tab, copy all the code that appears and paste it into your favorite text editing bit of software. You can then delete all the code except the function submitForm. It starts with function "submitForm(theform) {" and everything in between this and the functions closing curly bracket "}".

On analyzing this function very primitively we notice that some authentication happens bringing back a variable called xmldoc that's being parsed as xml. This is a key feature of AJAX it has polled the server and brought back some XML document that contains a tree of information. The node session_id contains the session_id if the authentication was successful, you can tell this by looking at this bit of code: "if (session_id != null) { //login successful".

Now we want to prevent this bit of javascript from taking us anywhere so we can see what is being posted to the server during authentication. To do this we comment out any window redirects which look like this: "window.location=...". To comment this out add double slashes before them like so: "//window.location..." this prevents the code from being run.

You can download the Javascript file below which has these edits already made.

Copy and paste this edited bit of javascript into the console windows right hand side and click run. This overrides the javascript function already in the page with our new version. Now when you fill out your credentials and click submit you should see POST and GET header information fill the console, but you wont be going anywhere.

The POST information is the information shot to the server by the AJAX functions, you want to be as much like this as possible, copy and paste that information into the a notepad.

Step 3: Prepare the Code

Before we add the new headers we've found let's create a templated Mechanize login python code. We're doing this for two reasons, first so we have a component that works to add new stuff to and second so you see how you would normally login to a non AJAX-y webpage.

Open notepad or equivalent, and copy and paste the following. When you're done save it as youfilename.py somewhere you can find.

#!/usr/bin/python
# -*- coding: utf-8 -*-

#Start with your module imports:
from mechanize import Browser

#Create your browser instance through the Browser() function call;
br = Browser()

#Set the browser so that it ignores the spiders.txt requests
#Do this carefully, if the webpage doesn't like spiders, they might be upset to find you there

br.set_handle_robots(False)

#Open the page you want to login to
br.open("https://pubmatic.com/04_betasignin.jsp")

#Because I know the form name, I can simply select the form by the name
br.select_form("login")

#Using the names of the form elements I input the names of the form elements
br['email'] = "laser+pubmatic@instructables.com"
br['password'] = "Asquid22"

#br.submit() sends out the form and pulls the resulting page, you create a new browser instance
#response below contains the resulting page
response = br.submit()

#This will print the body of the webpage received
#print response.read()

Step 4: Send the Right Signals.

Mechanize has an easy function to add headers to the headers POST, this will enable us to appear to the same browser that you used to access the page the first time. Open up the file with headers you found using Firebug and edit this text file to match. Replace everything in the quotes with the proper item from the header list:

USER_AGENT = "Mozilla/5.0 (X11; U; Linux i686; tr-TR; rv:1.8.1.9) Gecko/20071102 Pardus/2007 Firefox/2.0.0.9"
HOST = "pubmatic.com"
ACCEPT = "text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5"
ACCEPT_LANGUAGE = "en-us,en;q=0.5"
ACCEPT_ENCODING = "gzip,deflate"
ACCEPT_CHARSET = "ISO-8859-1,utf-8;q=0.7,*;q=0.7"
KEEP_ALIVE = "300"
CONNECTION = "keep-alive"
CONTENT_TYPE = "application/x-www-form-urlencoded"
REFERER = "https://pubmatic.com/04_betasignin.jsp"
CONTENT_LENGTH = "60"
COOKIE = "utma=103266945.1970108054.1210113004.1212104087.1212791201.20; KADUSERCOOKIE=EA2C3249-E822-456E-847A-1FF0D4085A85; utmz=103266945.1210113004.1.1.utmccn=(direct)|utmcsr=(direct)|utmcmd=(none); JSESSIONID=60F194BE2A5D31C3E8618995EB82C3C1.TomcatTwo; utmc=103266945"
PRAGMA = "no-cache"
CACHE_CONTROL ="no-cache"

This creates a set of variables that you can then use to append to the header using this code:
br.add_header = [("Host", HOST)]

br.add_headers = [("User-agent", USER_AGENT)]

br.add_headers = [("Accept", ACCEPT)]

br.add_header = [("Accept-Language", ACCEPT_LANGUAGE)]
br.add_headers = [("Accept-Encoding", ACCEPT_ENCODING)]
br.add_headers = [("Accept-Charset", ACCEPT_CHARSET)]

br.add_header = [("Keep-Alive", KEEP_ALIVE)]
br.add_headers = [("Connection", CONNECTION)]

br.add_header = [("Content-Type", CONTENT_TYPE)]
br.add_header = [("Referer", REFERER)]
br.add_header = [("Content-Length", CONTENT_LENGTH)]
br.add_headers = [("Cookie", COOKIE)]

br.add_headers = [("Pragma", PRAGMA)]

br.add_headers = [("Cache-Control", CACHE_CONTROL)]

Now when we call the page open function the headers will be sent to the server as well.
br.open("https://pubmatic.com/04_betasignin.jsp")

Step 5: Mechanized Cookies

This step is because mechanize automates cookie handling, but it's important to know what's happening:

When the form is submitted you have the right headers as if you submitted using the javascript function. The server then authenticates this information and generates a session ID and saves it in a cookie if the username and password are correct. The good news is Mechanize automatically eats and regurgitates cookies so you don't need to worry about sending and receiving the cookie. So once you create a session ID that works you can then enter the members only section of the website.

Step 6: Key to the Heart

Now that we've acquired a session ID and Mechanize saved it into it's cookies we can follow the javascript to see where we need to go. Looking inside the "if (session_id != null) { //login successful" to see where to go on success. Looking at the window relocation code: "if (adurlbase.search(/pubmatic.com/) != -1) { window.location="http://pubmatic.com/05_homeloggedin.jsp" + "?v=" + Math.random()*10000;" we see that we need to go to a website located at http://pubmatic.com/05_homeloggedin.jsp?v=some random number. So let's just create a fake random number to enter and create a new browser instance to read the freshly opened page:

response2 = br.open("http://pubmatic.com/05_homeloggedin.jsp?v=2703")

And that should be it. Your code is now complete, by using the proper headers and mechanize cookie handler we can now access the innards of pubmatic.

Open up terminal, load the python package below and login away. To do this type python2.5 and then the filepath to the .py file.