前天碰到一个项目归档的要求,归档方要求所有的word,excel,ppt文件都必须有对应的PDF版本。
由于项目内含有数千个word,exce,ppt文件,而且存放在不同的路径下,手动转换显然耗时耗力。
因此,笔者尝试调用Microsoft Office的.com接口来完成自动批量转换工作,实际上,python,matlab,php等语言都支持调用.com接口的。这里以python为例。
先摆上源代码
#office.py
import comtypes.client
import os
def PPTtoPDF(powerpoint,inputFileName, formatType = 32):
#powerpoint = comtypes.client.CreateObject("Powerpoint.Application")
#powerpoint.Visible = 1
filename, file_extension = os.path.splitext(inputFileName)
outputFileName = filename + ".pdf"
deck = powerpoint.Presentations.Open(inputFileName)
deck.SaveAs(outputFileName, formatType) # formatType = 32 for ppt to pdf
deck.Close()
def WordtoPDF(word,inputFileName, formatType = 17):
#word = comtypes.client.CreateObject("Word.Application")
#word.Visible = 1
filename, file_extension = os.path.splitext(inputFileName)
outputFileName = filename + ".pdf"
deck = word.Documents.Open(inputFileName)
deck.SaveAs(outputFileName, formatType) # formatType = 17 for word to pdf
deck.Close()
def ExceltoPDF(excel,inputFileName, formatType = 0):
#excel = comtypes.client.CreateObject("Excel.Application")
#excel.Visible = 1
filename, file_extension = os.path.splitext(inputFileName)
outputFileName = filename + ".pdf"
books = excel.Workbooks.Open(inputFileName)
books.ExportAsFixedFormat(formatType,outputFileName,0,True,True)
books.Close()
Microsoft Office的.com接口设计的其实挺怪的,Word中代表文件的类是Documents,Excel中是Workboos,PowerPoint中是Presentations,不知道为什么不统一定义为Files?
另外一个奇怪的地方就是SaveAs方法的参数,PPT中保存pdf文件对应的formatType=32,但word中保存pdf文件对应的formatType=17,不知道为什么不统一,也是比较奇葩。更为奇葩的是,excel中的Saveas函数没有提供保存pdf的选项,迷醉。但pdf的保存方法被放在了ExportAsFixedFormat方法里。
有了转换函数后,再写个脚本即可,代码如下
import os,csv
import comtypes.client
from office import WordtoPDF,PPTtoPDF,ExceltoPDF
# 启动com服务器
word = comtypes.client.CreateObject("Word.Application")
excel = comtypes.client.CreateObject("Excel.Application")
ppt = comtypes.client.CreateObject("Powerpoint.Application")
word.Visible = 1
excel.Visible = 1
ppt.Visible = 1
# 路径
workdir = r"D:\0.tem\zdzx\课题2017ZX05049006归档文件-20210811"
# log 文件
f = open('convert_log.csv','w',encoding='utf-8',newline="")
csv_write = csv.writer(f)
for root, dirs, files in os.walk(workdir):
for name in files:
fileadd = os.path.join(root,name)
filename, file_extension = os.path.splitext(fileadd)
if not os.path.exists(filename+'.pdf'):
if file_extension == '.doc' or file_extension == '.docx' or file_extension == '.xls' or file_extension=='.xlsx' or file_extension == '.ppt' or file_extension =='.pptx':
try:
if file_extension == '.doc' or file_extension == '.docx':
WordtoPDF(word,fileadd)
elif file_extension == '.xls' or file_extension == '.xlsx':
ExceltoPDF(excel,fileadd)
elif file_extension == '.ppt' or file_extension == '.pptx':
PPTtoPDF(ppt,fileadd)
else:
print('this is not office file')
csv_write.writerow([fileadd,name,1])
except:
csv_write.writerow([fileadd,name,0])
f.close()
word.Quit()
excel.Quit()
ppt.Quit()
项目的github地址,这个主要我自己记录一下:
rename/covert at master · gouff/rename (github.com)
最后,提供OFFICE COM接口的类和方法的查询地址。
Office Visual Basic for Applications (VBA) reference | Microsoft Docs