Ian Marsh

21 Dec, 2007

web scraper class - PHP4

Posted by: Ian In: PHP

The other week I found myself needing to populate a database with information found on various spots around the web. Unfortunately none of the information was in a standard format (RSS, XML) so I had to scrape the HTML pages the old-fashioned way. I wrote this little scraper class to help out with the parsing.
Here's a usage example:

PHP:
  1. $scraper = new Scraper();
  2. $scraper->getRemoteText("http://flickr.com/explore/");
  3. $scraper->jumpNextToken('"Interestingness"');
  4. $imgURL = $scraper->scrapeNext('src="', '"');

$imgURL would then contain the URL of the main image found on the explore page of flickr. There are of course much better ways to do things like this using XML and RSS when available, but sometime you've just got to take what you're given.


PHP:
  1. class Scraper
  2. {
  3.     /**
  4.      *  Scraper class, for scraping data out of bodies of text (html)
  5.      *  Author: Ian Marsh
  6.      *  Version 0.1
  7.      **/
  8.     var $haystack;  // the text we will be scraping
  9.     var $head;    // current position in the haystack
  10.    
  11.     /**
  12.      * Constructor
  13.      **/
  14.     function Scraper($text = "") {
  15.         $this->haystack = $text;
  16.         $this->head = 0;
  17.     }
  18.    
  19.     /**
  20.      * getRemoteText
  21.      **/
  22.     function getRemoteText($url, $timeout = 5) {
  23.         $ch = curl_init();
  24.         curl_setopt ($ch, CURLOPT_URL, $url);
  25.         curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
  26.         curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
  27.         $this->haystack = curl_exec($ch);
  28.         curl_close($ch);   
  29.     }
  30.    
  31.     /**
  32.      * setPosition
  33.      **/
  34.     function setPosition($newPos = 0) {
  35.         $this->head = $newPos;
  36.     }
  37.    
  38.     /**
  39.      * getPosition
  40.      **/
  41.     function getPosition() {
  42.         return $this->head;
  43.     }
  44.    
  45.     /**
  46.      * hasMoreTokens
  47.      **/
  48.     function hasMoreTokens($token) {
  49.         $nextPos = strpos($this->haystack, $token, $this->head);
  50.         if($nextPos != false)
  51.             return true;
  52.         else
  53.             return false;
  54.     }
  55.    
  56.     /**
  57.      * jumpNextToken
  58.      **/
  59.      function jumpNextToken($token = " ", $endOfToken = true) {
  60.         $find = strpos($this->haystack, $token, $this->head);
  61.         if($find> -1) {
  62.             if($endOfToken)
  63.                 $this->setPosition($find + strlen($token));
  64.             else
  65.                 $this->setPosition($find);
  66.             return true;
  67.         }
  68.         else {
  69.             return false;
  70.         }
  71.      }
  72.     
  73.      /**
  74.       * scrapeNext
  75.       **/
  76.      function scrapeNext($startToken, $endToken) {
  77.         $this->jumpNextToken($startToken);
  78.         $endScrape = strpos($this->haystack, $endToken, $this->head);
  79.         if($endScrape> $this->head) {
  80.             $scrape = substr($this->haystack, $this->head, $endScrape - $this->head);
  81.             $scrape = strip_tags($scrape);
  82.             $scrape = html_entity_decode($scrape);
  83.             $scrape = trim($scrape);
  84.            
  85.             $this->setPosition($endScrape);
  86.             return $scrape;
  87.         }
  88.         else {
  89.             return "";
  90.         }
  91.      }
  92. }

1 Response to "web scraper class - PHP4"

1 | Larry

January 19th, 2009 at 2:55 am

Avatar

Hi Ian,

Thanks for the class. It does however fall over at the moment (I realise it’s quite an old post and this may have been fixed).

For example it pulls out the first instance of the javascript not the first image, due to :

Being on the flickr page.

Comment Form


  • Bernard: Thank you for this wonderful game it is superbly. My partner is blind she can just see a little, I write the nine letters on an A4 sheet in felt pen a
  • Jean: Hi Ian, I was just checking, but Scoops is just for iTouch and iPhones? Is it possible to make it available for iPod Nanos? because they still have t
  • John GIBSON: Hi Ian; Once again, congratulations on Scoops. The game truly epitomises everything that the iPhone stands for. I understand you starting this