Introduction: Beginning Web Page Scraping With Php.

About: computoman.blogspot.com Bytesize articles instead of a trilogy in one post.

We have done some web page scraping with bash and now we want to step up the power of the code with a web page scripting language called PHP. That is the P usally in the (W/M/L)amp stack of an Apache2 web server. I will show you the results of the script,  disect the script and then finallty give you a hint about script debugging.

Note: This is not just for astrology. You can do this with any text based pages.

Step 1: A Simple PHP Script to Start.

Using the following script we can extract a section of a page without reading the whole page.Again we can let the computer act as a secretary or research clerk for us.

s6.php:
[code]
<?php
 $data = file_get_contents('http://www.astrologycom.com/virgodaily.php');
 $regex = '/(Thu Apr 5:.*)Fri Apr 6/';
 preg_match($regex,$data,$match);
// var_dump($match);
 echo $match[1];
 ?>
[/code]

Results:

Thu Apr 5: Top Gear

You feel as though things are starting to move at last! In sporting or physical activities, take no risks. Not that you will be particularly accident-prone, but you might think you are capable of more than you actually are. With Mars active as Mercury begins to move, you feel ambitious and determined, so will work hard to gain your ends, especially where legal matters and partners are concerned. Fortunate colors are red coral and charcoal grey. Lucky numbers are 29 and 35.
Go to Top

Step 2: What Does the Script Do?

Now let's walk through the script.

Standard beginning of a php script. means you need php installed. A lamp web server will have PHP.

<?php

We need to identifiy and download the web page to parse for the information we need.

 $data = file_get_contents('http://www.astrologycom.com/virgodaily.php');

Now we need to decide what data to extract. The $regex defines that.  What at the first to include and then what at the end not to include.  We want to include everything from the text "Thu Apr 5" and everything after that till we do not want to include "Fri Apr 6" at the end. Usually what we start or end with is something that never changes on the page.

$regex = '/(Thu Apr 5:.*)Fri Apr 6/';

Do the search

 preg_match($regex,$data,$match);

Commented out with the //, but great for degugging to see what is in the varialbles yields:

array(2) { [0]=> string(668) "Thu Apr 5: Top Gear

You feel as though things are starting to move at last! In sporting or physical activities, take no risks. Not that you will be particularly accident-prone, but you might think you are capable of more than you actually are. With Mars active as Mercury begins to move, you feel ambitious and determined, so will work hard to gain your ends, especially where legal matters and partners are concerned. Fortunate colors are red coral and charcoal grey. Lucky numbers are 29 and 35.

Fri Apr 6" [1]=> string(659) "Thu Apr 5: Top Gear
You feel as though things are starting to move at last! In sporting or physical activities, take no risks. Not that you will be particularly accident-prone, but you might think you are capable of more than you actually are. With Mars active as Mercury begins to move, you feel ambitious and determined, so will work hard to gain your ends, especially where legal matters and partners are concerned. Fortunate colors are red coral and charcoal grey. Lucky numbers are 29 and 35.

// var_dump($match);

Show the result of the extraction;

echo $match[1];

End the script.

?>

Step 3:

If you do get a blank screen it means that obviously something went wrong. you can always test the scrept from the command line on the host machine.PHP5 is the version of php we are using.

$ php5 s6.php
PHP Parse error:  syntax error, unexpected '>' in /var/www/pport/s6.php on line 7
$

Then you can go back and edit the script to correct it.

In this case, the last line was just ">" instead of "?>" .  Details count.

Step 4: A Last Word.

Experiment with your own pages and see what happens. So far pretty simple.Next time, a little more complication.....