XMLHTTP notes: responseXML and IE bug

Currently I'm working on debugging a very complicated script that's supposed to xmlhttprequest a few pages to be shown in a "Dashboard". I already wrote about another aspect of the project in my previous entry, but now that I'm concentrating on the XMLHTTP aspects of this project I found out a few very interesting things about responseXML, as well as a complicated Explorer bug.

Update: this entry was caused by my work on the KLM site, which is now online.

responseXML

Once an xmlhttprequest has come back from the server it's time to read out the server's reply and do something with it. There are two ways of reading out the reply: responseText and responseXML. Not surprisingly, the first gives the reply as plain text, while the second offers an XML document.

responseXML is only available if the server sends back an XML document with MIME type text/xml, and not if it sends back, for instance, an HTML document.

This is quite annoying, because it invalidates a xmlhttp/accessibility approach I've been considering for some time now.

The simplest way of getting this to work is using HTML pages for both situations. In order to work properly without xmlhttprequest, the HTML page should be complete, ie. it should contain <html>, <head>, and <body> tags. We should strip all these tags away, though, if the HTML page arrives by xmlhttprequest.

Unfortunately, as we just saw, HTML pages are not available as responseXML, but only as responseText, since HTML pages do not have MIME type text/xml. Therefore their document tree is inaccessible, and we can't change their structure.

overrideMimeType()

A partial solution is given by the overrideMimeType() method of xmlhttp. It basically says that the returned document should be treated as having the given MIME type, no matter what the server says. Use:

Unfortunately the overrideMimeType() method doesn't work in Explorer Windows. It doesn't work in Opera, either, but there responseXML is always available anyway, so we don't need this method.

The Explorer Windows bug

So far so bad. I believe that the original creator of the script I'm working on found out that responseXML didn't work on HTML pages in Explorer, and that he opted for another solution, one that at first sight seems quite useful, but that unfortunately triggers a bug.

The Dashboard should not show the complete HTML page, including style sheets, <body> tags and such. Instead, it should only show the <div id="content"> in the page. In itself this is an excellent idea. Extract this div, move its children to the Dashboard, and the site uses an xmlhttp script while retaining accessibility.

However, in order to use the W3C DOM to extract the correct div, we need an XML document, and as we saw before responseXML doesn't work for HTML pages, least of all in Explorer Windows. Only responseText is available.

We could write an entire XML parser to parse responseText, but that's rather a lot of work. What the creator of the script did instead, is loading the responseText into an intermediate element, a created <div> that's not part of the document but floats around in the hyperspace of the browser object model. Simplified code:

An interesting approach. Unfortunately it fails when the content contains images.

What I found out after a few hours of experimenting was, that Explorer Windows now load all images twice, once when they are written into container.innerHTML and once more when they are written into document.getElementById('dashboard').innerHTML.

Even worse: there seems to be some kind of glitch in the caching or the management of the (normal) http requests for the images. Every once in a while, especially on slow connections, Explorer Windows refuses to show the images in the Dashboard, because (I guess) it's still busy loading them from the server — or something.

I solved the problem by querying the readyState of all images in DashboardContent, and only allowing the dashboard to receive the data after all images reported they'd finished loading (readyState == 'complete'). This seems to work for the moment.

Nonetheless it should be noted that as soon as you load a bit of HTML which includes an <img> tag into any element's innerHTML, Explorer starts downloading the images, even if the element is not attached to the visible document. Once you move the bit of HTML to the visible document, Explorer may decide not to show the images because it's still busy loading them — or something.

Comments

1 Posted by Ced-le-pingouin on 2 September 2005 | Permalink

As I understand it, you - or the guy who originally designed the page - need the intermediate div element created on-the-fly to easily get the "content" div from the loaded file.

And it seems that bugs arise because innerHTML is used twice (once for the intermediate div, once for the final element in the page), and that's what cause the loading error, so you have to add extra code just to prevent that weird bug. Am I right?

At one point, you speak of creating an xml parser to retrieve the wanted div, and you say that's a lot of work (and you're right, especially for such a simple task as getting a div in an html text).

But I was thinking, since you're forced to use responseText because of the responseXML problems, why not retrieve the "content" div using the old dumb and prehistoric way : text search functions ?

I mean, couldn't you use a javascript regexp on responseText to get the "..." part ?

This way you wouldn't need the first innerHTML assignment, and the contained images would load only on the "real" page element assignment, so I guess they would load the usual way, and only once.

ps: I've been reading your site for some time. I find it great and useful, congrats and keep up the good work.

2 Posted by Ced-le-pingouin on 2 September 2005 | Permalink

"..." should be '[div...id="content"...]...[/div]' (with the square brackets replacing html tags symbols)

(maybe I'll get it right eventually, sorry for the triple post, delete my second post if you will)

3 Posted by Ryan on 2 September 2005 | Permalink

What I've done in the past to keep things accessible for non-xmlhttprequest browsers is to append something to the querystring when using the httprequest. That way, when the script sees that $_GET['xml'] = true then it knows not to send all the redundant html code. All it takes is a few more lines of php code or whatever. Of course, you can't do this with static html.

4 Posted by Alex Lein on 2 September 2005 | Permalink

THANK YOU PPK!

I have been working on a prototype webapp for the past 2-3 weeks and I could not figure this out. My initial tests using .xml files were working and when I uploaded to the server, it began using .aspx files (and then stopped working). You have saved me so much trouble I can't begin to thank you! Great timing on the article.

@Ced-le-pingouin: If you used a variable and cut out the [div id='content'] using .substring(), how would you know where the ending [/div] would be? there could be multiple. If thats the case, then you'd have to trim the footer, but if the footer isn't a static length, you'd have no way of trimming it properly... without writing an HTML parser.

5 Posted by Dimitri Glazkov on 2 September 2005 | Permalink

That's what HTTP content negotiation is for.

http://www.w3.org/Protocols/rfc2616/rfc2616-sec12.html

This is not a bug, it's just a very restricted (and precise) way of doing things.

Doing what you were thinking about doing is very much possible and fairly easy to do, once you have content negotiation figured out. I've done it as a demo for one of my seminars this summer:

http://glazkov.com/Resources/Code/ContentType/

Finally, IMHO overrideMimeType is an ugly hack and should not be used unless you are out of other options.

6 Posted by Lon on 2 September 2005 | Permalink

Being a pragmatist I would simply use a regular expression and get on with it.

7 Posted by Michael on 2 September 2005 | Permalink

I probably misunderstood the first problem, because I do not see what the issue is with responseXML being empty. responseXML is a property that hold an xml object. it just so happens that if the headers are not properly set then it is empty. why not simply do appropriate checks and then do xmlhttp.responseXML.loadXML(xmlhttp.responseText) ?
Please correct me if I am misunderstanding the problem.

8 Posted by Memet on 3 September 2005 | Permalink

I haven't tried coding this but here's an idea: why not use the replaceChild DOM method.
Something like this:
var content = doc.getElementById('content');
doc.replaceChild( doc.firstChild, content );

Just an idea. Maybe IE will not reparse using this method.

I've had similar problems of using HTML through HTTPRequest myself, but in different contexts. I did end up having to put a 'isDynamic=true' querystring flag on pages that could be displayed either via regular HTTP, or through the XML object.

9 Posted by Angus Turnbull on 3 September 2005 | Permalink

I ran through almost this exact same beating-head-against-wall scenario when designing my own AJAX-like script :). I even experimented with instantiating ActiveX MSXML DOMDocument objects and attempting to import the content into them!

In the end I sidestepped it and used hidden IFRAME buffers in MSIE (both Win and Mac), Opera 7, and Safari 1.0-1.1, and XMLHttpRequest in Mozilla/Safari1.2+/Opera8+. That approach worked very well for retrieving remote HTML content (which was my design goal too).

10 Posted by Jacob on 3 September 2005 | Permalink

I've had huge problems doing similar things in the past - the best solution I could come up with was to use innerHTML type methods and text search functions (regular expressions are very useful for this sort of thing!).

The other workable alternative I could find uses hidden iframes as a buffer - how I (and probably many others) used to do AJAX type stuff before XMLHttpRequest came along.

Finally - an idea I'm playing with at the moment uses a php script for non-savvy browsers to import the relevant file. It seems to work well for my purposes but you might have to think about server load if the pages need a lot of processing. SSI might be lighter on the server but that's pure speculation, and there could be security issues to beware of there as well.

11 Posted by Dave Johnson on 7 September 2005 | Permalink

Check out this post from about a year ago outlining the image loading / caching problems with innerHTML and some ideas to get around them:
http://www.bazon.net/mishoo/articles.epl?art_id=958

12 Posted by Brennan Stehling on 28 September 2005 | Permalink

Can you post the contents for getHTMLById? I am having a hard time trying to do something very similar and I was expecting getElementById to be available for any XML content, but it seems I was mistaken. You method would help me out a great deal.

13 Posted by execute on 14 December 2005 | Permalink

xmlhttp.responseXML.loadXML(xmlhttp.responseText)

This won't work. I've also tried loading the XMLDOM for both IE and FF, and THEN trying to parse it. But I don't think something.load or something.XMLload allows text input to be made into an XML document.

There must be a better solution. Hopefully I can figure it out soon.

14 Posted by Pavan Keely on 14 February 2006 | Permalink

xmlhttp.responseXML.loadXML(xmlhttp.responseText) will work without any problems. XML DOM document has a method "loadXML" which takes XML as a string and parses the XML. If it's not working for you, make sure that the XML sent by the server is well-formed.

15 Posted by BVen on 26 April 2006 | Permalink

use xmlhttp.responseXML.xml
instead of xmlhttp.responseXML