Python – Extracting data from MS Word

ms-wordpythonpywin32vba

I am looking for a way to extract / scrape data from Word files into a database. Our corporate procedures have Minutes of Meetings with clients documented in MS Word files, mostly due to history and inertia.

I want to be able to pull the action items from these meeting minutes into a database so that we can access them from a web-interface, turn them into tasks and update them as they are completed.

Which is the best way to do this:

  1. VBA macro from inside Word to create CSV and then upload to the DB?
  2. VBA macro in Word with connection to DB (how does one connect to MySQL from VBA?)
  3. Python script via win32com then upload to DB?

The last one is attractive to me as the web-interface is being built with Django, but I've never used win32com or tried scripting Word from python.

EDIT: I've started extracting the text with VBA because it makes it a little easier to deal with the Word Object Model. I am having a problem though – all the text is in Tables, and when I pull the strings out of the CELLS I want, I get a strange little box character at the end of each string. My code looks like:

sFile = "D:\temp\output.txt"
fnum = FreeFile
Open sFile For Output As #fnum

num_rows = Application.ActiveDocument.Tables(2).Rows.Count

For n = 1 To num_rows
    Descr = Application.ActiveDocument.Tables(2).Cell(n, 2).Range.Text
    Assign = Application.ActiveDocument.Tables(2).Cell(n, 3).Range.Text
    Target = Application.ActiveDocument.Tables(2).Cell(n, 4).Range.Text
    If Target = "" Then
        ExportText = ""
    Else
        ExportText = Descr & Chr(44) & Assign & Chr(44) & _
            Target & Chr(13) & Chr(10)
        Print #fnum, ExportText
    End If
Next n

Close #fnum

What's up with the little control character box? Is some kind of character code coming across from Word?

Best Solution

Word has a little marker thingy that it puts at the end of every cell of text in a table.

It is used just like an end-of-paragraph marker in paragraphs: to store the formatting for the entire paragraph.

Just use the Left() function to strip it out, i.e.

 Left(Target, Len(Target)-1))

By the way, instead of

 num_rows = Application.ActiveDocument.Tables(2).Rows.Count
 For n = 1 To num_rows
      Descr = Application.ActiveDocument.Tables(2).Cell(n, 2).Range.Text

Try this:

 For Each row in Application.ActiveDocument.Tables(2).Rows
      Descr = row.Cells(2).Range.Text