Publishing System Settings Logout Login Register
The Ultimate Guide to Parsing XML: Part 1 (Using SAX)
TutorialCommentsThe AuthorReport Tutorial
Tutorial Avatar
Rating
Add to Favorites
Posted on August 6th, 2005
3953 views
PHP Coding
The Ultimate Guide to Parsing XML – Part 1 – Using SAX

I. The Concept
II. A Brief Explanation of XML
III. SAX – The Event Based Parser


I. The Concept

The concept for this tutorial is to show the user how to parse an RSS feed so that one can display the content in their website. It will also show the use for parsing XML and the uses. XML knowledge isn’t required, but it is useful to have.

Here is the end result of this tutorial.

II. A Brief Explanation of XML

XML stands for eXtensible Markup Language. Basically it’s an HTML-like language, of course it'd be a more accurate statement to say that HTML is an XML-like language. Everything has an opening and closing tag. The tags have attributes and values and etc. The difference is that you make up your own tags for an XML document, hence the extensibility. The usefulness of this is that one can create one’s own file format and protocols to create a maximum portability. A company can use specific tags for invoices within the company. And XML has also spawned other things. RSS is just an XML file with specific tags and attributes to use. But it all depends on the parser to interpret the XML file.

The XML file we will be parsing will be in The Official Pixel2Life Latest 20 Tutorials feed.

III. SAX – The Event Based Parser

The Simple API for XML (also known as SAX) is an event-based parser. Meaning it will handle elements and attributes as it comes to them. The DOM (Document Object Module) on the other hand will read the whole file into memory and make a tree representation instead.

PHP 4.0 comes with its own SAX parser called expat. It is possible that you may need to recompile PHP to add support for the library.

We’re going to jump right in here with the basics of the parser.

$counter = 0;
$type = 0;
$tag = \"\";
$itemInfo = array();
$channelInfo = array();

$xmlParser = xml_parser_create();

xml_parser_set_option($xmlParser, XML_OPTION_CASE_FOLDING, TRUE);
xml_parser_set_option($xmlParser, XML_OPTION_SKIP_WHITE, TRUE);

xml_set_element_handler($xmlParser, \"opening_element\", \"closing_element\");
xml_set_character_data_handler($xmlParser, \"c_data\");


It can be a bit daunting when looking at the code, but it becomes pretty obvious and easy.

First thing the code does is it creates the variable $xmlParser and uses the function xml_parser_create() to make that variable a parser. Next we set two options for the parser with xml_parser_set_option(). Those two options are case folding and skipping white space. Skipping white space is pretty obvious. Case folding is basically and and would all be evaluated the same. Case folding, when set to true, makes elements uppercase. Funnily enough this goes against the XML specifications which says that , , and are different and should be evaluated differently.

The next two options are what really matter. xml_set_element_handler() is the function used to set the two functions used to parse the opening and closing tags. These functions are user defined. They tell the parser what to do when encountering specific elements. So this function says that the opening_element function is used for all of the opening elements in the file, and the closing_element function is used for all of the closing elements in a file. xml_set_character_data_handler() is the function used to set the user-defined function to handle the CDATA in-between the element tags. So obviously c_data is the function used for the character data. We’ll get to these functions next.

First we’ll start with the opening_element() function.

function opening_element($xmlParser, $name, $attribute){

global $tag, $type;

$tag = $name;

if($name == \"CHANNEL\"){
$type = 1;
}
else if($name == \"ITEM\"){
$type = 2;
}

}//end opening element


Now for the opening element function three arguments are passed. The first is the parser. This was defined as $xmlParser. Next is the variable to be used within the function is the name of the tag. And the final one to be passed will contain all of the attributes of the tag as an associative array.

So first off in our function we declare our global variables. One of the three marks of a programmer is laziness, and you can’t get much more lazy than globals. So the string tag and the integer type are made into global variables so we can manipulate them in the function. So here we make $tag equal to the name of the tag, or the variable $name as defined in the function. This will be used later in the parser. Next we query the name of the tag asking if it is “CHANNEL” or “ITEM”. This is because a channel and an item in an rss file both have title, description, and link attributes, so we have to get each separated. Also notice that CHANNEL and ITEM are in uppercases. This is important because if you remember, case folding is true, making each tag name parsed as being uppercase. If this tag is the channel tag, then the type will be 1, and if not, it will be 2.

Next we have the CDATA function. Since SAX is an event-based parser, the character data comes after the opening tag. So we’ll discuss this next.

function c_data($xmlParser, $data){

global $tag, $type, $channelInfo, $itemInfo, $counter;

$data = trim(htmlspecialchars($data));

if($tag == \"TITLE\" || $tag == \"DESCRIPTION\" || $tag == \"LINK\"){
if($type == 1){

$channelInfo[strtolower($tag)] .= $data;

}//end checking channel
else if($type == 2){

$itemInfo[$counter][strtolower($tag)] .= $data;

}//end checking for item
}//end checking tag
}//end cdata funct


The c_data() function has two arguments to pass. One, of course, being the parser and the second being the actual variable that will contain the data. So now, being the lazy programmers we are, we declare our global variables, these being the tag of course, the type of title/description/link, the array that will contain the channel information, and the array that will contain the item info.

Now we get to the meat of this. First we will sanitize the data variable by stripping the whitespace of the front and end of the string, and also converting html characters to their equivalents.

Next we check if the tag is a title, description, or link tag in case it’s a channel, item, or rss tag which doesn’t have any CDATA. After this we see if this info is the channel info, or info about an item. So first we make a trigger to see if the type is 1, a channel. If so, we put this value into the channelInfo array. We do this making it an associative array. The function strtolower() of course makes the arguments given lowercase. Then it is set with the data.

Now items are a bit different. We will use a counter variable that was made in the beginning of the parser defining it as 0. So since in arrays 0 is the first row, this is the first entry. It will increase after the closing tag for an item, which is in the next function for closing tags. Then the data is again put into another array and made lowercase, and the data is set. The . before the equals sign is needed or it can overwrite some data. Take out the period to see what I mean, then put it back. Try this with your own RSS Feed.

Almost done with SAX! The only things left to do is the closing element function and the process of parsing it. So here’s the closing element function.

function closing_element($xmlParser, $name){

global $tag, $type, $counter;

$tag = \"\";
if($name == \"ITEM\"){
$type = 0;
$counter++;
}
else if($name == \"CHANNEL\"){
$type = 0;
}
}//end closing_element


The closing element is again only passed two arguments, the parser variable, and the name of the tag.

First things first, let’s be lazy and declare our global variables. After this we make tag empty because the tag is now closed. Then we check if the name of this tag is item or channel. If it is, we set the type to 0. And if it is an item, we increase the counter. Since this will be the last thing coming when parsing an item, it makes sense to increase the counter so our itemInfo array will increase sequentially.

Now we get to parse the file. Let’s just see what we have so far before actually parsing it.

$counter = 0;
$type = 0;
$tag = \"\";
$itemInfo = array();
$channelInfo = array();

function opening_element($xmlParser, $name, $attribute){

global $tag, $type;

$tag = $name;

if($name == \"CHANNEL\"){
$type = 1;
}
else if($name == \"ITEM\"){
$type = 2;
}

}//end opening element


function closing_element($xmlParser, $name){

global $tag, $type, $counter;

$tag = \"\";
if($name == \"ITEM\"){
$type = 0;
$counter++;
}
else if($name == \"CHANNEL\"){
$type = 0;
}
}//end closing_element

function c_data($xmlParser, $data){

global $tag, $type, $channelInfo, $itemInfo, $counter;

$data = trim(htmlspecialchars($data));

if($tag == \"TITLE\" || $tag == \"DESCRIPTION\" || $tag == \"LINK\"){
if($type == 1){

$channelInfo[strtolower($tag)] = $data;

}//end checking channel
else if($type == 2){

$itemInfo[$counter][strtolower($tag)] .= $data;

}//end checking for item
}//end checking tag
}//end cdata funct

$xmlParser = xml_parser_create();

xml_parser_set_option($xmlParser, XML_OPTION_CASE_FOLDING, TRUE);
xml_parser_set_option($xmlParser, XML_OPTION_SKIP_WHITE, TRUE);

xml_set_element_handler($xmlParser, \"opening_element\", \"closing_element\");
xml_set_character_data_handler($xmlParser, \"c_data\");


Whew. That’s a lot. But it will all be rewarded with the parsing.


$fp = file(\"http://pixel2life.com/feeds/latest_20_tuts.xml\");

foreach($fp as $line){
if(!xml_parse($xmlParser, $line)){
die(\"Could not parse file.\");
}
}

?>
<html>
<head>
<title><?php echo $channelInfo[\"title\"];?></title>
</head>
<body>
<h1><a href=\"<?php echo $channelInfo[\"link\"];?>\"><?php echo $channelInfo[\"title\"]; ?></a></h1>
Description of Feed: <?php echo $channelInfo[\"description\"];?><br /><br />
<?php
foreach($itemInfo as $items){
echo \"<img src='\".$items[\"description\"].\"' height='40' width='40' alt='\".$items[\"description\"].\"' />\";
echo \"<a href='\".$items[\"link\"].\"'>\".$items[\"title\"].\"</a><br /><br />\";
}
?>
</body>
</html>


First we start off by opening the feed to read it with file(). Then we'll use a foreach loop going through each line of the file and parsing it. We also throw in a trigger to see if the file cannot be parsed. The xml_parse() function takes three variables, two of which are necessary. The first one of course is the parser we made way back when. The second is the string we are parsing, and the last is to check when the end of the file is reached.

Now we get to the display. After the functions have been called with the xml_parse function which will execute all the parsing functions we built, we have the two arrays $channelInfo and $itemInfo to output to the user. First we escape out of PHP and start our HTML output. We do all of the usual HTML stuff, the head, title, and body. Now we start with the first thing: We will link to the main site of the feed. So we make this a header, start our anchor tag, and for the link echo the link text which is held in the $channelInfo array under the association of “link”. We add the double quote and > sign to finish the beginning of the link, then for the link text we echo the channel title. After that it’s easy just to finish the link and the header. After this we finish off the channel’s info with the description.

Next we start to loop through the items. We do this with a foreach loop, making the array $itemInfo into the array $items. So instead of accessing everything with $itemInfo[0][“title”], it’s much easier just to use $items[“title”]. And since it is looping through, it will automatically go through each item. So we start by echoing our link with the a tag, end the string and use the period to concatenate it with the variable for the link accessed with $items[“link”], then we end it add the “>, print the text for the link, add the and a line break. Next we just echo the quick description of the items and add two line breaks so that the items will be a bit separated and easier to read. Quick and easily we end the loop, go back to HTML, and we end the body and HTML file.

Congratulations! You finished parsing an RSS file using PHP’s expat library. Let’s look at the code in full.

<?php

$counter = 0;
$type = 0;
$tag = \"\";
$itemInfo = array();
$channelInfo = array();

function opening_element($xmlParser, $name, $attribute){

global $tag, $type;

$tag = $name;

if($name == \"CHANNEL\"){
$type = 1;
}
else if($name == \"ITEM\"){
$type = 2;
}

}//end opening element


function closing_element($xmlParser, $name){

global $tag, $type, $counter;

$tag = \"\";
if($name == \"ITEM\"){
$type = 0;
$counter++;
}
else if($name == \"CHANNEL\"){
$type = 0;
}
}//end closing_element

function c_data($xmlParser, $data){

global $tag, $type, $channelInfo, $itemInfo, $counter;

$data = trim(htmlspecialchars($data));

if($tag == \"TITLE\" || $tag == \"DESCRIPTION\" || $tag == \"LINK\"){
if($type == 1){

$channelInfo[strtolower($tag)] = $data;

}//end checking channel
else if($type == 2){

$itemInfo[$counter][strtolower($tag)] .= $data;

}//end checking for item
}//end checking tag
}//end cdata funct

$xmlParser = xml_parser_create();

xml_parser_set_option($xmlParser, XML_OPTION_CASE_FOLDING, TRUE);
xml_parser_set_option($xmlParser, XML_OPTION_SKIP_WHITE, TRUE);

xml_set_element_handler($xmlParser, \"opening_element\", \"closing_element\");
xml_set_character_data_handler($xmlParser, \"c_data\");

$fp = file(\"http://pixel2life.com/feeds/latest_20_tuts.xml\");

foreach($fp as $line){
if(!xml_parse($xmlParser, $line)){
die(\"Could not parse file.\");
}
}

?>
<html>
<head>
<title><?php echo $channelInfo[\"title\"]; ?></title>
</head>
<body>
<h1><a href=\"<?php echo $channelInfo[\"link\"];?>\"><?php echo $channelInfo[\"title\"]; ?></a></h1>
Description of Feed: <?php echo $channelInfo[\"description\"];?><br /><br />
<?php
foreach($itemInfo as $items){
echo \"<img src='\".$items[\"description\"].\"' height='40' width='40' alt='\".$items[\"description\"].\"' />\";
echo \"<a href='\".$items[\"link\"].\"'>\".$items[\"title\"].\"</a><br /><br />\";
}
?>
</body>
</html>
Premium Publisher
Dig this tutorial?
Thank the author by sending him a few P2L credits!

Send
Blitz

I'm two months from 21 and these were written when I was 17. Fun how time flies huh?
View Full Profile Add as Friend Send PM
Pixel2Life Home Advanced Search Search Tutorial Index Publish Tutorials Community Forums Web Hosting P2L On Facebook P2L On Twitter P2L Feeds Tutorial Index Publish Tutorials Community Forums Web Hosting P2L On Facebook P2L On Twitter P2L Feeds Pixel2life Homepage Submit a Tutorial Publish a Tutorial Join our Forums P2L Marketplace Advertise on P2L P2L Website Hosting Help and FAQ Topsites Link Exchange P2L RSS Feeds P2L Sitemap Contact Us Privacy Statement Legal P2L Facebook Fanpage Follow us on Twitter P2L Studios Portal P2L Website Hosting Back to Top