python自动化生成Word报告

前言

这里简单记录下如何使用python脚本去解析一个xml文件并把解析出来的数据以变量的形式插入到docx文件。这样做的主要原因是在工作中需要把某些工具的输出结果(.xml文件)转为docx文件并以报告的形式输出。一般这种报告都存在某种固定的格式，只是一些数据是需要动态填入的。所以我们可以把这些需要动态填入的字段以变量的形式表示，然后通过python脚本解析xml文件并把数据插入到以变量表示的字段处，最后生成最终文档。这样我们就不用做一些无趣的复制粘贴工作，大大提高了工作效率。

XML解析

首先第一步，通过python如何解析一个xml文件？
目前我知道的有2种方法。
1.通过python标准库SAX(simple API for XML)解析器解析。SAX用事件驱动模型，通过在解析XML的过程中触发一个个事件并调用用户定义的回调函数来处理XML文件，这种方法解析XML的好处就是以流式读取XML文件比较快，并且占内存少，但需要用户定义回到函数。
2.通过DOM(Document Object Model)方式将XML文件载入到内存中，并解析为一棵树的形式，通过对树的操作来实现解析XML文件。这种方式由于需要把XML数据全部映射到内存中，所以会比较慢，并且比较耗内存。但感觉比SAX要灵活。

使用SAX解析XML

1.SAX是基于事件驱动实现的，利用SAX解析XML文档会涉及到两个部分：解析器和事件处理器。
解析器负责解析XML文件，并在关键时候向事件处理器发送时间比如在元素开始或元素结束时。
时间处理器则负责对接收到的事件做响应，主要是调用用户注册的回调函数。
以下列出一些比较关键的函数：
1.文档启动时：startDocument()方法
2.到达文档结尾时：endDocument()方法
3.遇到XML开始标签时：startElement(name，attrs)方法
4.遇到XML结尾标签时：endElement(name)方法
5.创建新的解析器对象：make_parser()方法
6.创建一个SAX解析器并解析XML文档：parser()方法
实例：
目标XML文件

<?xml version="1.0" encoding="UTF-8"?>
<results version="2">
    <errors>
        <error id="arrayIndexOutOfBounds" severity="error" msg="Array &apos;a[10]&apos; accessed at index 10, which is out of bounds." verbose="Array &apos;a[10]&apos; accessed at index 10, which is out of bounds." cwe="119">
            <location file=" /cpp/arrayindexoutofbounds.cpp" line="11" info="Array index out of bounds"/>
            <location file=" /cpp/arrayindexoutofbounds.cpp" line="7" info="Assignment &apos;max=10&apos;, assigned value is 10"/>
            <symbol>a</symbol>
        </error>
        <error id="returnDanglingLifetime" severity="error" msg="Returning pointer to local variable &apos;sz&apos; that will be invalid when returning." verbose="Returning pointer to local variable &apos;sz&apos; that will be invalid when returning." cwe="562">
            <location file=" /cpp/autovar.cpp" line="5"/>
            <location file=" /cpp/autovar.cpp" line="3" info="Variable created here."/>
            <location file=" /cpp/autovar.cpp" line="5" info="Array decayed to pointer here."/>
        </error>
        <error id="bufferAccessOutOfBounds" severity="error" msg="Buffer is accessed out of bounds: sz" verbose="Buffer is accessed out of bounds: sz">
            <location file=" /cpp/bufferaccessoutofbounds.cpp" line="5"/>
            <symbol>sz</symbol>
        </error>
        <error id="nullPointer" severity="error" msg="Null pointer dereference" verbose="Null pointer dereference" cwe="476">
            <location file=" /cpp/nonthreadsafefunc.cpp" line="5" info="Null pointer dereference"/>
        </error>
        <error id="resourceLeak" severity="error" msg="Resource leak: pFile" verbose="Resource leak: pFile" cwe="775">
            <location file=" /cpp/resourceleak.cpp" line="8"/>
            <symbol>pFile</symbol>
        </error>
        <error id="stlOutOfBounds" severity="error" msg="When ii==foo.size(), foo[ii] is out of bounds." verbose="When ii==foo.size(), foo[ii] is out of bounds." cwe="788">
            <location file=" /cpp/stloutofbounds.cpp" line="7"/>
            <symbol>foo</symbol>
        </error>
    </errors>
</results>

解析脚本：

#!/usr/bin/python
# -*- coding: UTF-8 -*-
import xml.sax                          # 需要先导入xml.sax库
class XMLHandler(xml.sax.ContentHandler):
	def __init__(self):
		self.CurrentData = ""
		self.location = ""
		self.symbol = ""
		
	def startElement(self,tag,attributes):
		self.CurrentData = tag
		if tag == "error":
			print "\n\n\n********** error **********"
			XMLHandler.echoInfo(self,"id",attributes,"id:")
			XMLHandler.echoInfo(self,"severity",attributes,"severity:")
			XMLHandler.echoInfo(self,"msg",attributes,"msg:")
			XMLHandler.echoInfo(self,"verbose",attributes,"verbose:")
			XMLHandler.echoInfo(self,"cwe",attributes,"cwe:")
			
		if tag == "location":
			XMLHandler.echoInfo(self,"file",attributes,"location.file:")
			XMLHandler.echoInfo(self,"line",attributes,"location.line:")
			XMLHandler.echoInfo(self,"info",attributes,"location.info:")			
#	def characters(self,content):
#		if self.CurrentData == "location":
#			file = content["file"]
#			print "location.file:",file
#		elif self.CurrentData == "symbol":
#			self.symbol = content
#	def endElement(self,tag):
#		if self.CurrentData == "location":
#			print "location:",self.location
#		elif self.CurrentData == "symbol":
#			print "symbol:",self.symbol
	def echoInfo(self,tag,attributes,str):
		if tag in attributes:
			data = attributes[tag]
			print str,data
		else:
			print str + " None"
if( __name__ == "__main__"):
	parser = xml.sax.make_parser()          # 创建一个XMLReader
	parser.setFeature(xml.sax.handler.feature_namespaces,0)
	Handler = XMLHandler()
	parser.setContentHandler(Handler)
	parser.parse("log.xml")					# 解析目标xml

输出

********** error **********
id: arrayIndexOutOfBounds
severity: error
msg: Array 'a[10]' accessed at index 10, which is out of bounds.
verbose: Array 'a[10]' accessed at index 10, which is out of bounds.
cwe: 119
location.file:  /cpp/arrayindexoutofbounds.cpp
location.line: 11
location.info: Array index out of bounds
location.file:  /cpp/arrayindexoutofbounds.cpp
location.line: 7
location.info: Assignment 'max=10', assigned value is 10
[................................skip...................................]
********** error **********
id: stlOutOfBounds
severity: error
msg: When ii==foo.size(), foo[ii] is out of bounds.
verbose: When ii==foo.size(), foo[ii] is out of bounds.
cwe: 788
location.file:  /cpp/stloutofbounds.cpp
location.line: 7
location.info: None

使用DOM解析XML

文本对象模型(Document Object Model)：一个DOM解析器在解析XML文件时需要把整个XML文件一次性载入到内存中，把文档中所有元素都保存在内存中的一棵树结构中，后续可以通过DOM提供的API去解析这颗树，通过不同的函数来读取或修改文档中的内容，这样做的一个缺点就是比较耗内存。
实例：
同样的目标，不同的解析

#!/usr/bin/python
# -*- coding: UTF-8 -*-
from xml.dom.minidom import parse
import xml.dom.minidom
# 通过minidom解析器打开目标XML文件
DOMTree = xml.dom.minidom.parse("log.xml")
collection = DOMTree.documentElement
if collection.hasAttribute("errors"):
	print "Root element: %s" % collection.getAttribute("errors")
	
# 获取所有元素
error = collection.getElementsByTagName("error")
for err in error:
	print "********* error **********"
	if err.hasAttribute("id"):
		print "id: %s" % err.getAttribute("id") 
	if err.hasAttribute("severity"):
		print "severity: %s" % err.getAttribute("severity") 
	if err.hasAttribute("msg"):
		print "msg: %s" % err.getAttribute("msg") 
	if err.hasAttribute("verbose"):
		print "verbose: %s" % err.getAttribute("verbose") 
	if err.hasAttribute("cwe"):
		print "cwe: %s" % err.getAttribute("cwe")
	location = err.getElementsByTagName("location")
	for loc in location:
		if loc.hasAttribute("file"):
			print "file: %s" % loc.getAttribute("file")
		else:
			print "file: " + 'null'
		if loc.hasAttribute("line"):
			print "line: %s" % loc.getAttribute("line")
		else:
			print "line: " + 'null'
		if loc.hasAttribute("info"):
			print "info: %s" % loc.getAttribute("info")
		else:
			print "info: " + 'null'

输出

********* error **********
id: arrayIndexOutOfBounds
severity: error
msg: Array 'a[10]' accessed at index 10, which is out of bounds.
verbose: Array 'a[10]' accessed at index 10, which is out of bounds.
cwe: 119
file:  /cpp/arrayindexoutofbounds.cpp
line: 11
info: Array index out of bounds
file:  /cpp/arrayindexoutofbounds.cpp
line: 7
info: Assignment 'max=10', assigned value is 10
[..................................skip......................................]
********* error **********
id: stlOutOfBounds
severity: error
msg: When ii==foo.size(), foo[ii] is out of bounds.
verbose: When ii==foo.size(), foo[ii] is out of bounds.
cwe: 788
file:  /cpp/stloutofbounds.cpp
line: 7
info: null

Word报告自动生成

这里需要使用python的一个库(docxtpl),该库可以按指定的word模板填充内容设定好的符号字段，一般用来把一些工具跑出的结果填充到word模板中完成工作报告的生成。
首先通过以下命令安装该第三方库：

1	pip install docxtpl

使用：
WORD模板文件：

Python脚本文件：

#!/usr/bin/python
# -*- coding: UTF-8 -*-
import time
from docxtpl import DocxTemplate, InlineImage
class CreateDocx():
	def __init__(self,TemplateFileName,NewFileName):
		self.TemplateFileName = TemplateFileName
		self.NewFileName = NewFileName
	
	def post(self):
		tpl = DocxTemplate(self.TemplateFileName)			# 加载模板文件
		localtime = time.asctime( time.localtime(time.time()))
		context = {'date_1':localtime,'version_1':'v1.0','total_1':'0xFFFF','error_1':'0xFFFF','warning_1':'0xFFFF','style_1':'0xFFFF'}
		tpl.render(context)						# 填充数据
		tpl.save(self.NewFileName + '_' + str(time.time()) + '.docx')	# 保存目标文件
	
	def writeData(self):
		pass
if( __name__ == "__main__"):
	Docx = CreateDocx('./CodeScan.docx','CodeScan');
	Docx.post();

WORD结果文件：

总结

简单记录下，不然感觉自己每天都不知道干了些啥。唉。。。。。。。。。
完整的SAX API链接：https://docs.python.org/3/library/xml.sax.html
完整的DOM API链接：https://docs.python.org/3/library/xml.dom.html