An explanation of how to retrieve/scrape website data using MSXML and HTMLDocument but not using the InternetExplorer Application Object.
The problem with the InternetExplorer.Application object
I have a few blog entries demonstrating how to get web site data using InternetExplorer and the HTMLDocument object. However, there are a few problems with this:
- Microsoft is going to Edge. Internet Explorer is not officially being supported or won’t be
- Internet Explorer often disconnects from the application object, causing the following error:
“The object Invoked has disconnected from its clients”
Using MSXML – and some problems
Using the MSMXL object is an upgrade and more generic web application object. But it has it’s own challenges. Such as:
- Loss of “getElementsByClassName” functionality on sub (child) element of HTMLDocument (the document object model).
- Problems creating sub-element objects to query – in order to simplify (reduce) the number of items to query.
I’m not going to explain these in detail. But, if you’ve found the blog and been trying to find a better way to scrape website data, it is likely you’ve run into one or more of these challenges.
Use the HTMLObjectElement to solve the problem
The HTMLObject Element works with the HTMLDocument and lets you create sub/child elements and use all the root level methods of the HTMLDocument object. Particularly, GetElementsByClassName.
The following code sample is just that… a sample. But it should provide a good starting point for how to use these items together. They are incomplete – no error checking, etc.
Sub BlogExample() Dim wpage As MSXML2.ServerXMLHTTP60 Dim hdoc As HTMLDocument Set wpage = New MSXML2.ServerXMLHTTP60 strWebURL = "http://www.theurltoscrape.com" 'just an example URL wpage.Open "GET", strWebURL, False wpage.send 'you can add loop to test readystate - but include an out in case it doesn't reach it Set hdoc = New HTMLDocument hdoc.body.innerHTML = wpage.responseText ' check to make sure page opened or has the data you need. Check for a "known" element. dtct = hdoc.getElementsByClassName("dtable").Length ppgloop = 0 'I'm going to check for this 4 additional times if I do not find it While dtct = 0 And ppgloop <= 4 'retry 4 times ppgloop = ppgloop + 1 'reopen URL, fill html document innerHTML and check for the element. wpage.Open "GET", strWebURL, False wpage.send hdoc.body.innerHTML = ipage.responseText dtct = ppg.getElementsByClassName("dtable").Length Wend If dtct > 0 Then ' Retrieve the document information 'assuming these elements are known - return the first table.. we already know it contains rows. Dim mytbl As HTMLObjectElement Set mytbl = hdoc.getElementById("aform").getElementsByTagName("table")(0) intRowCt = mytbl.getElementsByTagName("tr").Length For loop1 = 1 To intRowCt Dim myrow As HTMLObjectElement Set myrow = mytbl.getElementsByTagName("tr")(loop1 - 1) 'now I can address and search for classname or any other tag that is a child of the row ' this is assumptive of course. It assumes that I will find those classes and tags and that the first one ' contains the data I want. strLinkToRetrieve = myrow.getElementsByClassName("codata")(0).getElementsByTagName("a")(0).href Next Else ' end the procedure and/or open a different page. End If Set hdoc = Nothing Set wpage = Nothing End Sub
Of course, other programmatic controls and error checking can be added. But, using this method I just ran a procedure that opened a page with a list of nearly 60 pages of company profiles (20 per page) and opened each company profile page in a new object, retrieving that company information and placing it into a database for my client.
It ran with no errors and took about 35 minutes.
Using MSXML is faster and cleaner than using the InternetExplorer application object. It doesn’t “disconnect. Using the HTMLDocument and HTMLObjectElement allows me to continue to use GetElementsByClassName on smaller chunks of the document object model – creating much cleaner and maintainable code.
Let me know if you get this working.