25
Oct

Parsing mal-formed rss feeds with Magpie RSS

I recently discovered that Craigslist’s rss feeds leave much to be desired in terms of well formed xml. A large portion of the AdRavage back-end magic works by parsing the rss results of searches, and the bulk of this is achieved through Magpie RSS. Unfortunately, when returned feeds contain invalid encoding or have structural issues they become virtually impossible to parse – at least by default.

Fortunately one of my users created a search that returned parsing errors of this nature and allowed me to dive into a solution to this problem. First of all, considering Craigslist will not be modifying the user input mechanism used to builds its feeds, and certainly will ignore my request for better formatting, i took it apon myself to do a little research. By submitting the feed for validation on http://validator.w3.org/feed I was able to get the exact issue causing the feed to break:

I couldn’t tell exactly what the issue was (aside from ‘not well-formed (invalid token)‘) until I pasted the contents into Notepad++, which displayed a strangely encoded character within the string. To fix the issue I needed to cleanse the extracted RSS feed just before it gets passed to PHP’s xml_parser, and after tracing the execution back I was able to pin-point it within rss_parse.inc right around line 130 or so.

$status = xml_parse( $this->parser, $source );

The contents of $source is the data within the rss feed and the point just before the parsing is conducted. The solution was simple – str_replace() the contents of $source before passing it to the parser.

function MagpieRSS ($source, $output_encoding='ISO-8859-1',
                        $input_encoding=null, $detect_encoding=true)
    {
		/*
		 * Feed cleansing (encoding replacement stuff) here CLF - 10/25/2008
		 * Should execute it on $source (the content extracted from feed)
		 */	
 
		 $source = str_replace('[bad data]','', $source);
 
		 /*
		  * END - feed cleansing
		  */
 
...

I wasn’t able to show the actual character I needed to replace, but you get the idea – replace [bad char] with the character(s) you want to remove before parsing. If you can’t seem to find it try pasting the section of bad data into Notepad++ and it will display it for you – notepad and wordpad will not. Invalid encoding/chars is only one way that feeds can be mal-formed, so if you run into other problems you would need to modify $source accordingly.

There's 0 Comment So Far

Share your thoughts, leave a comment!