The context is that I have a boatload of data dictionaries for various data bases strewn across multiple Word document files. Those files have (among other extraneous information) tables that contain the actual code book parts that I need. I want to be able to open a Word doc containing these data dictionaries (one Word doc might have 4-5 tables corresponding to separate data sets), extract the tables to Excel workbook format, depositing the separate tables to sheets. Ultimately, I will do this for 100+ separate Word documents.
The code below is a good start. It gets tables in separate sheets in an Excel workbook file. From there I will be generating a sheet that will act as an index. That index page will contain a list of variable names that link to the appropriate sheet containing said variable for a specific data set. The index sheet will have a lay out that makes for easy sorting (so columns for original data set name, variable name, maybe others).
I have tried, and failed, to get the code below to name each sheet in a way that will facilitate my end goal without manual editing. Right now it spits them out as "Sheet1,,,,SheetN." I need them to be something like "DictionaryName1,,,DictionaryNameN" where "DictionaryName" is based on the Word file from which they originate. That will make automating the eventual creation of the index page much easier.
Right now, this is set up to have a dialog that asks to select the originating file. Since this needs to be done with so many files, the eventual goal is to instead select a directory and have this all happen for each Word doc in that directory (or select a set of Word docs via giving a path, whatever). Ideally, everything will end up in a single Excel workbook (rather than having separate workbooks per each originating Word doc, which is how this is currently geared). Also, I want to leave the separate sheets as close to unformatted as possible so that I would have the option to chop them back out as plain text .csv files that could be easily ingested by R, SAS, or other stats packages (for automating generation of custom data sets composed of variables from multiple source data sets).
There are other details, such as figuring out an intelligent way to discard tables in the Word docs that are not the data dictionaries (e.g. some of the Word docs contain what is intended to be a table of contents that is formatted as a table), but I'm setting that aside until I get at least this much worked out.
Sub CopyTables()
Dim oWord As Word.Application
Dim WordNotOpen As Boolean
Dim oDoc As Word.Document
Dim oTbl As Word.Table
Dim fd As Office.FileDialog
Dim FilePath As String
Dim wbk As Workbook
Dim wsh As Worksheet
' make a prompt to select dictionary file... could automate this to do over all files in a directory instead
Set fd = Application.FileDialog(msoFileDialogOpen)
With fd
.Filters.Clear
.Filters.Add "Word Documents (*.docx)", "*.docx", 1
.Title = "Choose a Word File"
If .Show = True Then
FilePath = .SelectedItems(1)
Else
Beep
Exit Sub
End If
End With
On Error Resume Next
Application.ScreenUpdating = False
' make new workbook
Set wbk = Workbooks.Add(Template:=xlWBATWorksheet)
' start up Word or get it if open
Set oWord = GetObject(Class:="Word.Application")
If Err Then
Set oWord = New Word.Application
WordNotOpen = True
End If
On Error GoTo Err_Handler
' open document up
Set oDoc = oWord.Documents.Open(Filename:=FilePath)
' loop over tables from document
For Each oTbl In oDoc.Tables
' make a sheet
Set wsh = wbk.Worksheets.Add(After:=wbk.Worksheets(wbk.Worksheets.Count))
' copy/paste table into sheet
oTbl.Range.Copy
wsh.Paste
'tried using e.g. oTbl.Title, FilePath, etc. here to name the sheets, but no luck...
Next oTbl
' gets rid of first sheet
Application.DisplayAlerts = False
wbk.Worksheets(1).Delete
Application.DisplayAlerts = True
Exit_Handler:
On Error Resume Next
oDoc.Close SaveChanges:=False
If WordNotOpen Then
oWord.Quit
End If
' clean up object references
Set oTbl = Nothing
Set oDoc = Nothing
Set oWord = Nothing
Application.ScreenUpdating = True
Exit Sub
Err_Handler:
MsgBox "something went wrong... " & Err.Description, vbCritical, "Error: " & Err.Number
Resume Exit_Handler
End Sub
I am a novice with VBA. Barely ever touch it, but was asked to use it for this. Like mentioned above, there are further steps to complete the overall task. Other than what was previously mentioned, another thing that jumps to mind is that some of the tables that are the data code book parts have headers (as in a first row with the name of the specific data set) while others do not (e.g. the row 1 reads like column names, with varname, var_description, type, format, comments). So that will need to be sorted eventually in order for this task to be reproducible in the future from the source Word docs. Changing the workflow is not an option at this point, unfortunately. I have also considered having the end result simply be a single, large table. That could be accomplished by adding a column(s) to denote source data set and combining all of the separate data code books together. Either way, being able to intelligently name the sheets is my current sticking point.