Web Scraping using PHP

crawl php

Here we will explain how to web scrape some simple data from a webpage

In this tutorial we will extract all the headings ‘h1’ in the webpage http://www.ajarunthomas.com/jquery/

First we will get the webpage content and store it as a DOM document


$html = file_get_contents("http://www.ajarunthomas.com/jquery");

if(!empty($html)){
    $aj_dom = new DOMDocument();
    $aj_dom->loadHTML($html);
}

 

Now we will define the xpath which is ‘h1’ since we want to get all the h1 headings


$aj_xpath = new DOMXPath($aj_dom);
$aj_row = $aj_xpath->query('//h1');

Now we will store all the headings to an array


if($aj_row->length > 0){
    foreach($aj_row as $row){
        $arr[] = $aj_dom->saveXML($row);
    }
}

 

And finally we will display the headings


$y = count($arr);
for($i = 0; $i < $y; $i++){
    echo $arr[$i];
}

 

If you want to exclude the libxml errors on the output page, then


libxml_use_internal_errors(TRUE);
libxml_clear_errors();

 

Therefore to conclude the whole code looks as below,


<?php
$html = file_get_contents("http://www.ajarunthomas.com/jquery");

libxml_use_internal_errors(TRUE);
libxml_clear_errors();

if(!empty($html)){
    $aj_dom = new DOMDocument();
    $aj_dom->loadHTML($html);
    $aj_xpath = new DOMXPath($aj_dom);
    $aj_row = $aj_xpath->query('//h1');
    if($aj_row->length > 0){
        foreach($aj_row as $row){
            $arr[] = $aj_dom->saveXML($row);
            $y = count($arr);
        }
    }
}

for($i = 0; $i < $y; $i++){
    echo $arr[$i];
}

?>

 

Leave a Reply

Your email address will not be published.