VBA solution: Scraping website information using MSXML, HTMLDocument, and getElementsByClassName

Summary

An explanation of how to retrieve/scrape website data using MSXML and HTMLDocument but not using the InternetExplorer Application Object.

The problem with the InternetExplorer.Application object

I have a few blog entries demonstrating how to get web site data using InternetExplorer and the HTMLDocument object. However, there are a few problems with this:

  • Microsoft is going to Edge. Internet Explorer is not officially being supported or won’t be
  • Internet Explorer often disconnects from the application object, causing the following error:
    “The object Invoked has disconnected from its clients”

Using MSXML – and some problems

Using the MSMXL object is an upgrade and more generic web application object. But it has it’s own challenges. Such as:

  • Loss of “getElementsByClassName” functionality on sub (child) element of HTMLDocument (the document object model).
  • Problems creating sub-element objects to query – in order to simplify (reduce) the number of items to query.

I’m not going to explain these in detail. But, if you’ve found the blog and been trying to find a better way to scrape website data, it is likely you’ve run into one or more of these challenges.

Use the HTMLObjectElement to solve the problem

The HTMLObject Element works with the HTMLDocument and lets you create sub/child elements and use all the root level methods of the HTMLDocument object. Particularly, GetElementsByClassName.

The following code sample is just that… a sample. But it should provide a good starting point for how to use these items together. They are incomplete – no error checking, etc.


Sub BlogExample()
Dim wpage As MSXML2.ServerXMLHTTP60
Dim hdoc As HTMLDocument
Set wpage = New MSXML2.ServerXMLHTTP60
strWebURL = "http://www.theurltoscrape.com" 'just an example URL
wpage.Open "GET", strWebURL, False
wpage.send

'you can add loop to test readystate - but include an out in case it doesn't reach it
        

Set hdoc = New HTMLDocument
hdoc.body.innerHTML = wpage.responseText

' check to make sure page opened or has the data you need. Check for a "known" element.
dtct = hdoc.getElementsByClassName("dtable").Length
ppgloop = 0

'I'm going to check for this 4 additional times if I do not find it
While dtct = 0 And ppgloop <= 4 'retry 4 times ppgloop = ppgloop + 1 'reopen URL, fill html document innerHTML and check for the element. wpage.Open "GET", strWebURL, False wpage.send hdoc.body.innerHTML = ipage.responseText dtct = ppg.getElementsByClassName("dtable").Length Wend If dtct > 0 Then
    ' Retrieve the document information
    'assuming these elements are known - return the first table.. we already know it contains rows.
    Dim mytbl As HTMLObjectElement
    Set mytbl = hdoc.getElementById("aform").getElementsByTagName("table")(0)
    intRowCt = mytbl.getElementsByTagName("tr").Length
    For loop1 = 1 To intRowCt
        Dim myrow As HTMLObjectElement
        Set myrow = mytbl.getElementsByTagName("tr")(loop1 - 1)
        'now I can address and search for classname or any other tag that is a child of the row
        ' this is assumptive of course. It assumes that I will find those classes and tags and that the first one
        ' contains the data I want.
        strLinkToRetrieve = myrow.getElementsByClassName("codata")(0).getElementsByTagName("a")(0).href
        
    Next

Else
    ' end the procedure and/or open a different page.
End If
Set hdoc = Nothing
Set wpage = Nothing

End Sub

Of course, other programmatic controls and error checking can be added. But, using this method I just ran a procedure that opened a page with a list of nearly 60 pages of company profiles (20 per page) and opened each company profile page in a new object, retrieving that company information and placing it into a database for my client.

It ran with no errors and took about 35 minutes.

Conclusion

Using MSXML is faster and cleaner than using the InternetExplorer application object. It doesn’t “disconnect. Using the HTMLDocument and HTMLObjectElement allows me to continue to use GetElementsByClassName on smaller chunks of the document object model – creating much cleaner and maintainable code.

Let me know if you get this working.

Posted in Consulting, Microsoft Access, Tips and Tools and tagged , .

Leave a Reply

Your email address will not be published. Required fields are marked *