VBA webscraping, html text to excel: How to extract child element while ignoring parent?

I'm a beginner and trying to do a basic webscrape from an html table to excel and struggling with a specific websites design. From what I understand, most tables are organized with tags like so: ( table -> tr -> td ) as in, table then row then cell.

I can deal with that easy enough, but the 'main-table' im trying to extract from has has tables and rows contained within a cell, like ('main table -> row -> cell -> sub-table ->sub-row -> sub-cell). For the life of me, I can't get a clean excel sheet

Here's what the html looks like:

detailed html

overview html

What I need to do is extract only the NCI and $392,764. But So far, I'm extracting duplicates of these values using the .innertext command. I'm hoping that someone can help me to write a simple macro that grabs only the lastchild element from each parent cell... Thanks!

Here's what I've got so far...

This is where the trouble comes about:

Sub processhtmlpage(htmlpage As mshtml.HTMLDocument)

Dim htmlTable As mshtml.IHTMLElement
Dim htmlTables As mshtml.IHTMLElementCollection
Dim HTMLRow As mshtml.IHTMLElement
Dim htmlrows As mshtml.IHTMLElementCollection
Dim htmlcell As mshtml.IHTMLElement
Dim rownum As Long, colnum As Integer


Set htmlTables = htmlpage.getElementsByTagName("table")
Set HTMLInnerTables = htmlpage.getElementsByTagName("table")
Set HTMLInnerRows = htmlpage.getElementsByTagName("tr")


    For Each htmlTable In htmlTables
        Worksheets.Add
        Range("a1").Value = htmlTable.className
        Range("b1").Value = Now
        rownum = 2
            For Each HTMLRow In htmlTable.getElementsByTagName("tr")
            colnum = 1
                For Each htmlcell In HTMLRow.getElementsByTagName("td")
                    Cells(rownum, colnum) = htmlcell.innerText
                    colnum = colnum + 1
                Next htmlcell
                rownum = rownum + 1
            Next HTMLRow
    Next htmlTable
End Sub

I've made some crazy attempts to circumvent this by using if...then commands, but after hours, I'm just totally lost. There's got to be a better way. Please help!!! In case its helpful, here's the cluster**** I ended up with:

Sub processhtmlpage(htmlpage As mshtml.HTMLDocument)

Dim htmlTable As mshtml.IHTMLElement
Dim htmlTables As mshtml.IHTMLElementCollection
Dim HTMLRow As mshtml.IHTMLElement
Dim htmlrows As mshtml.IHTMLElementCollection
Dim htmlcell As mshtml.IHTMLElement
Dim rownum As Long, colnum As Integer
Dim HTMLInnerTables As mshtml.IHTMLElementCollection
Dim HTMLInnerTable As mshtml.IHTMLElement
Dim HTMLInnerRow As mshtml.IHTMLElement
Dim HTMLInnerows As mshtml.IHTMLElementCollection
Dim innerhtmlcell As mshtml.IHTMLElement

Set htmlTables = htmlpage.getElementsByTagName("table")
Set HTMLInnerTables = htmlpage.getElementsByTagName("table")
Set HTMLInnerRows = htmlpage.getElementsByTagName("tr")

    For Each htmlTable In htmlTables
        If htmlTable.getAttribute("id") <> "main-table" Then
        GoTo line4
        End If

        Worksheets.Add
        Range("a1").Value = htmlTable.className
        Range("b1").Value = Now
        rownum = 2

            For Each HTMLRow In htmlTable.getElementsByTagName("tr")
                If HTMLRow.getAttribute("bgcolor") = "#ffffff" Or HTMLRow.getAttribute("class") = "lop" Then
                    GoTo line6
                End If
                colnum = 1

              For Each htmlcell In HTMLRow.getElementsByTagName("td") [line 6]
                        If htmlcell.getAttribute("nowrap") = "nowrap" Then
                        GoTo line1
                        Else
                        If htmlcell.getAttribute("colspan") = 2 Then
                            Cells(rownum, colnum) = htmlcell.innerText
                            rownum = rownum + 1
                            Call stupidcell
                            Else
                        End If
                        For Each HTMLInnerTable In htmlcell.getElementsByTagName("table")
                            If HTMLInnerTable.getAttribute("id") <> "main-table" Then
                                GoTo line1
                            End If
                        Next HTMLInnerTable
                            For Each HTMLInnerRow In htmlcell.getElementsByTagName("tr")
                                If HTMLInnerRow.getAttribute("bgcolor") = "#ffffff" Then
                                    GoTo line1
                                End If
                            Next HTMLInnerRow [line5]
                        Next HTMLInnerTable
                        Cells(rownum, colnum) = htmlcell.innerText [line2]
                        colnum = colnum + 1
                    Next htmlcell [line1]
                    rownum = rownum + 1
            Next HTMLRow [line3]
        Next htmlTable [line4]
End Sub

3 answers

  • answered 2018-11-08 01:05 ashleedawg

    This is not an answer and so will probably get flagged, but this is the only way to leave a comment including graphics, so maybe the S.O. Gods will let it slide (otherwise, I'll just delete!)


    When I have a complicated set of nested For's/If's, I'll copy it to a text editor (preferably Notepad++) and remove all the code except for the nested parts, and organize it so I can see where my problem is.

    The first procedure is nested properly:
    img

    The second procedure has a problem, noted in Red:
    img

    You're closing the For for HTMLInnerTable twice.

  • answered 2018-11-08 01:51 Tim Williams

    Here's a very basic start at extracting the text from the table - you just need to target the cells of interest and perform a little cleanup on the text.

    Definitely not a "web standards" approach, but sometimes a rough and ready method is easier to manage, and there's little point getting too elaborate when a small change in the page layout can ruin the whole thing.

    Sub NIHTable()
    
        Dim htmlpage, tbl, rw, cl, rownum, cellNum
    
        'populate htmlpage
    
        Set tbl = htmlpage.document.getElementById("main-table")
        rownum = 0
        For Each rw In tbl.Rows
            rownum = rownum + 1
            Debug.Print "------ Row# " & rownum
            cellNum = 0
            For Each cl In rw.Cells
                cellNum = cellNum + 1
                Debug.Print , cellNum, Trim(Replace(cl.innerText, vbCrLf, ";"))
            Next cl
        Next rw
    
    End Sub
    

    Sample row output:

    ------ Row# 9
                   1            
                   2            1 R43 CA23616401
                   3            
                   4            DEVELOPMENT OF TARGETED, SAFE AND EFFECTIVE DRUGS AGAINST PANCREATIC DUCTAL ADENOCARCINOMA (PDAC) BY LEVERAGING A NOVEL, COMPREHENSIVE, COMPUTATIONAL DRUG DISCOVERY APPROACH
                   5            HEUER, TIM S.
                   6            TWOXAR, INC.
                   7            2018
                   8            NCI
                   9            ,NCI ,,$225,030
                   10           
    

  • answered 2018-11-08 06:51 QHarr

    Without the actual HTML to use this is not tested. I especially can't see what other elements may match the following CSS selector pattern.

    The items you want have table tagged parents. More specifically, they are in a child td element with class lop that has an attribute align, that is inside a tr element with class lop. Using CSS descendant combinator syntax that looks like:

    table tr.lop [align]td.lop
    

    You could gather elements matching this pattern with querySelectorAll method of HTMLDocument as follows:

    Dim nodeList As Object, i As Long
    Set nodeList = htmlpage.querySelectorAll("table tr.lop [align]td.lop
    ")
    For i = 0 To nodeList.Length-1
        Debug.Print Trim$(nodeList.item(i).innerText)
    Next
    

    If duplicated values are still a problem consider loading retrieved values into a dictionary and emptying that at the end

    Dim nodeList As Object, i As Long, dict As Object
    Set nodeList = htmlpage.querySelectorAll("table tr.lop [align]td.lop")
    Set dict = CreateObject("Scripting.Dictionary")
    
    For i = 0 To nodeList.Length - 1
        dict(Trim$(nodeList.item(i).innerText)) = vbNullString
    Next
    Dim arr()
    arr = dict.keys '<== retrieve unique values