Issue
Is there a reasonable way to extract plain text from a Word file that doesn’t depend on COM automation? (This is a a feature for a web app deployed on a non-Windows platform – that’s non-negotiable in this case.)
Antiword seems like it might be a reasonable option, but it seems like it might be abandoned.
A Python solution would be ideal, but doesn’t appear to be available.
Solution
I use catdoc or antiword for this, whatever gives the result that is the easiest to parse. I have embedded this in python functions, so it is easy to use from the parsing system (which is written in python).
import os
def doc_to_text_catdoc(filename):
(fi, fo, fe) = os.popen3('catdoc -w "%s"' % filename)
fi.close()
retval = fo.read()
erroroutput = fe.read()
fo.close()
fe.close()
if not erroroutput:
return retval
else:
raise OSError("Executing the command caused an error: %s" % erroroutput)
# similar doc_to_text_antiword()
The -w switch to catdoc turns off line wrapping, BTW.
Answered By – codeape
This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0