Introduction: Beginning Web Page Scraping With Php.
We have done some web page scraping with bash and now we want to step up the power of the code with a web page scripting language called PHP. That is the P usally in the (W/M/L)amp stack of an Apache2 web server. I will show you the results of the script, disect the script and then finallty give you a hint about script debugging.
Note: This is not just for astrology. You can do this with any text based pages.
Step 1: A Simple PHP Script to Start.
Using the following script we can extract a section of a page without reading the whole page.Again we can let the computer act as a secretary or research clerk for us.
s6.php:
[code]
<?php
$data = file_get_contents('http://www.astrologycom.com/virgodaily.php');
$regex = '/(Thu Apr 5:.*)Fri Apr 6/';
preg_match($regex,$data,$match);
// var_dump($match);
echo $match[1];
?>
[/code]
Results:
Thu Apr 5: Top Gear
- You feel as though things are starting to move at last! In sporting or physical activities, take no risks. Not that you will be particularly accident-prone, but you might think you are capable of more than you actually are. With Mars active as Mercury begins to move, you feel ambitious and determined, so will work hard to gain your ends, especially where legal matters and partners are concerned. Fortunate colors are red coral and charcoal grey. Lucky numbers are 29 and 35.
Step 2: What Does the Script Do?
Now let's walk through the script.
Standard beginning of a php script. means you need php installed. A lamp web server will have PHP.
<?php
We need to identifiy and download the web page to parse for the information we need.
$data = file_get_contents('http://www.astrologycom.com/virgodaily.php');
Now we need to decide what data to extract. The $regex defines that. What at the first to include and then what at the end not to include. We want to include everything from the text "Thu Apr 5" and everything after that till we do not want to include "Fri Apr 6" at the end. Usually what we start or end with is something that never changes on the page.
$regex = '/(Thu Apr 5:.*)Fri Apr 6/';
Do the search
preg_match($regex,$data,$match);
Commented out with the //, but great for degugging to see what is in the varialbles yields:
array(2) { [0]=> string(668) "Thu Apr 5: Top Gear
- You feel as though things are starting to move at last! In sporting or physical activities, take no risks. Not that you will be particularly accident-prone, but you might think you are capable of more than you actually are. With Mars active as Mercury begins to move, you feel ambitious and determined, so will work hard to gain your ends, especially where legal matters and partners are concerned. Fortunate colors are red coral and charcoal grey. Lucky numbers are 29 and 35.
Fri Apr 6" [1]=> string(659) "Thu Apr 5: Top Gear
- You feel as though things are starting to move at last! In sporting or physical activities, take no risks. Not that you will be particularly accident-prone, but you might think you are capable of more than you actually are. With Mars active as Mercury begins to move, you feel ambitious and determined, so will work hard to gain your ends, especially where legal matters and partners are concerned. Fortunate colors are red coral and charcoal grey. Lucky numbers are 29 and 35.
// var_dump($match);
Show the result of the extraction;
echo $match[1];
End the script.
?>
Step 3:
If you do get a blank screen it means that obviously something went wrong. you can always test the scrept from the command line on the host machine.PHP5 is the version of php we are using.
$ php5 s6.php
PHP Parse error: syntax error, unexpected '>' in /var/www/pport/s6.php on line 7
$
Then you can go back and edit the script to correct it.
In this case, the last line was just ">" instead of "?>" . Details count.
Step 4: A Last Word.
Experiment with your own pages and see what happens. So far pretty simple.Next time, a little more complication.....
6 Comments
8 years ago
DOM is really easy for web scraping beginners.I am doing web scraping since last years.I have started web scraping with DOM and PHP. I am providing web scraping service to my clients.
Reply 8 years ago
Sometimes I like to do things that do not required additional software.
9 years ago on Introduction
Nice turorial for starting scraping with php.
SimpleHTML DOM is really easy library to develop php based scraper that uses Xpath.
Check pdf scraping using php on my blog
Reply 9 years ago on Introduction
Thanx for your comment.
11 years ago on Introduction
Hi!
The code as written produces an "Undefined Offset: 1" error.
If you comment out the "echo" statement it produces the following output:
array(0) { }
So I'm kind of confused. It doesn't produce the output you are showing here.
Reply 11 years ago on Introduction
Probably because the web page that was used is no longer in existence. Besides it is October now, not April. Astrologycom.com changed their website a while back. probably will have to update the example. Also too. Instructables has a tendency to not print code correctly. Really busy right now. Will try to look at it soon and use another web page.