Summary: Our client wanted to extract data from a web portal. We wrote a VBA script to the requested information and store it in an Access database. The example below demonstrates the same technology, but the data is stored in Excel.
Edit: 9/9/2016: I posted a download to the file here.
- Microsoft Access: Data and VBA
- Microsoft Internet Controls
- Microsoft HTML Document Control*
An understanding of the HTML Document Object Model (DOM) will help you A LOT.
Scrape web content using VBA
Screen scraping used to be a way to grab characters off of mainframes and AS/400’s. It is much easier to scrape web site information due to the DOM. Actually, it’s not scraping at all – more of a structured retrieval of data from HTML.
I cannot provide the exact code from the above project as it was a bought and paid for by our client and belongs to them. I’ll explain the methodology and provide sample code to retrieve web content. I’ll also demonstrate the retrieving information from the DOM.
I will provide the source code for you to retrieve information from my blog’s root page – the one listing the most recent 10 blog entries found at the link below:
I’ll explain a couple critical differences in the code provided and the code we built for our client.
Create an Access table: tblWebData
With the following fields:
- wd_uno (autonumber, key)
- ArticleTitle (Short Text, 255)
- ArticleURL (Short Text, 255)
- ArticleAuthor (Short Text, 255)
- ArticleSummary (Long Text)
- DateRetrieved (date/time)
Note, I would size them differently but I was in a hurry. I’ll leave that to you.
Create an Access form (frmWebTest)
I put a single button on the form:btnGetWebData
This is the code for that button plus another sub-routine.
Private Sub btnGetWebData_Click() Dim strURL Dim iPage As InternetExplorer Dim iHTML As HTMLDocument strURL = "http://www.pulseinfomatics.com/blog/" Set iPage = New InternetExplorer iPage.Navigate strURL While iPage.ReadyState &amp;amp;lt;&amp;amp;gt; READYSTATE_COMPLETE ' I'll make a big assumption that it will get to the page, otherwise, you'd want to have a way to catch a timeout DoEvents Wend Set iHTML = iPage.Document 'put the HTML from the navigated webpage into the HTML Document object 'look for the "article" tag and return number of articles on the page intNumA = iHTML.getElementsByTagName("article").length For x = 1 To intNumA 'loop through each article strAURL = iHTML.getElementsByTagName("article").Item(x - 1).getElementsByTagName("header").Item(0).getElementsByTagName("a").Item(0).href strATitle = FixQuote(iHTML.getElementsByTagName("article").Item(x - 1).getElementsByTagName("header").Item(0).getElementsByTagName("a").Item(0).innerText) strAAuthor = iHTML.getElementsByTagName("article").Item(x - 1).getElementsByTagName("header").Item(0).getElementsByTagName("a").Item(1).innerText strASummary = FixQuote(iHTML.getElementsByTagName("article").Item(x - 1).getElementsByTagName("p").Item(0).innerText) SaveWebInfo strAURL, strATitle, strAAuthor, strASummary Next Set iHTML = Nothing Set iPage = Nothing End Sub Sub SaveWebInfo(inURL, inTitle, inAuthor, inSummary) Dim strSQL As String strSQL = "INSERT INTO tblWebData (ArticleURL,ArticleTitle,ArticleAuthor,ArticleSummary,DateRetrieved) " & _ "VALUES ('" & inURL & "','" & inTitle & "', '" & inAuthor & "','" & inSummary & "', #" & Now() & "#)" DoCmd.SetWarnings False DoCmd.RunSQL strSQL DoCmd.SetWarnings True End Sub
Variations of this functionality can be found in various places around the web. I grabbed mine from this blog.
It ensures you don’t get errors when trying to save text with quotes.
' courtesy "dschmidt333", utteraccess.com Function FixQuote(FQText As String) As String On Error GoTo Err_FixQuote FixQuote = Replace(FQText, "'", "''") FixQuote = Replace(FixQuote, """", """""") Exit_FixQuote: Exit Function Err_FixQuote: MsgBox Err.Description, , "Error in Function Fix_Quotes.FixQuote" Resume Exit_FixQuote Resume 0 '.FOR TROUBLESHOOTING End Function
Some DOM elements and nesting explained
My goal is not to teach you how to navigate the DOM. You can start here to learn more about the DOM.
intNumA = iHTML.getElementsByTagName(“article”).length
Returns the number of “article” tags on the page. For my blog, each article is contained in an article tag.
Returns an array of any elements with the tag name (“article”)
You can now loop through each article to retrieve article-specific data.
iHTML.GetElementsByTagName(“article”).Item(x – 1)
Refers to a specific article tag and it’s associated HTML. x started counting 1. Arrays start at 0. x – 1 ensures you are referring to the correct article tag.
Once you refer to a specific article (x – 1) the objects you refer to after that relate only to that specific article. Elements that are not encapsulated (nested) in that “article” tag are ignored. Remember that when reading the href and innerTEXT references below.
The first a tag href is the link to the blog entry.
The second href is the blog author. Subsequent hrefs, if they exist, link to categories. But I am only concerned with item(0) and item(1).
The innerTEXT of each hrefs returns the linked text. The article URL’s innerTEXT is the title. The author URL’s innerTEXT is the author name.
References the only paragraph in each article, the summary text.
That gives you some idea of how to reference items in the HTML Document Object Model.
How my client’s project differed.
In order to retrieve several pages of profiles, I had to:
- navigate to a page
- loop through information from a list
- present the user information and choices as to what they wanted to copy
- retrieve the data from a list
- navigate to sub-pages if they exist
- retrieve information from sub-pages
- move to the next page and repeat
Here’s an example where I navigate the DOM on our website