At work, I'm importing an xml feed from typepad.com to store it in a database and display it on another portal site. I'm using the PHP library MagpieRSS to import the XML/Atom feed. I store each news item in a database and update old items as they are downloaded again. All of this works great, very straight forward and easy. I got this far in about 2 hours or so. I then attempt to make a more general Symfony plugin that allows me to aggregate multiple feeds. Setting this up took another hour or two - just updating the database structure to handle multiple feeds and laying out the plugin file structure. Now the fun begins.
When I enter the second XML feed into the system to be aggregated together on the portal page, I notice a bunch of strange characters in the page. Any web developer will have seen these in Firefox before - they are the Microsoft "smart quotes" that appear when somebody pastes content from a word doc into a web form. These are not regular ASCII characters. That shouldn't be a problem since we can handle different character sets and encodings on the web, right? Well, one of the XML feeds comes in as ISO-8859-1 and the other comes in as a mix between UTF-8 and whatever windows encoding the crazy quotes are from. When combining all of these, the output looks like crap. You can only use one character encoding set on a given web page. Additionally, when I store them in the database, MySQL wants them all to be converted to the same collation/encoding. I have to store them as binary data if I don't want them being auto-converted to a bunch of question marks.
PHP has an mbstring library that helps, but I eventualy gave up on trying to re-encode everything yesterday evening. Maybe the answer will just hit me Monday when I revisit this problem. I just can't believe how difficult XML is to work still after the world has been using it for so long. XML was supposed to make everything better. Hell, I'm not even doing the difficult part of having to parse it - a library is doing the hard part for me. I guess one can't blame the XML spec because people put garbage data into that format. It's just damn frustrating.