The article is not intended to prove the importance and necessity of the existence of the document “Program Text” and, moreover, the need to develop it in the form of a document that someone will ever use. The purpose of the article is to show the basics of automating the processing of documents in * .doc (* .docx) format using scripts written in Python.
The document “Program Text” (hereinafter – the Document) is included in the set of program documents, the list of which is determined GOST 19.101-77… GOST, undoubtedly, is quite old, however, its requirements are still in demand in the development of software products, respectively, the development of the document in question is also required.
The content of the Document is determined GOST 19.401-78 and must include either a symbolic notation in the original language, or a symbolic notation in intermediate languages, or a symbolic representation of machine codes (etc.).
Given the volume of modern web applications developed using multiple frameworks, it is rather difficult to prepare this document without using automation.
Perhaps two options document creation:
developers collect the content of all project files into one text file and transfer them to a technical writer for processing;
the developers give the technical writer an archive with the project files, which the technical writer must process himself.
First option depends on the workload of developers. To generate a text file with the program code, someone from the development team must take a break, check out the project code from the repository, write a program that will process the unloaded code and produce a text file. The file can be either a couple of megabytes or a couple of hundred megabytes in size, you need to insert it into the Document and arrange it somehow. Inserting this amount of information into a Microsoft Word file can take 30 minutes or several hours. Moreover, when the insertion is completed, the Document will be unstructured and unreadable.
Second option depends on the technical skills of the contractor, but is preferable from the point of view of the absence of the need to involve people who are not related to the development of documentation, as a result of which we will automate this particular option.
Automation of the Document in question consists in transferring the text of the program code into a document with a given formatting and structure, so that it can be used in the future (an unlikely, but viable use case).
A Microsoft Word file (* .doc, * .docx) is a set of data in xml markup. The file has attributes such as paragraph, figure, table, style. To access the file and its attributes, Microsoft once developed the OLE (Object Linking and Embedding) technology. OLE allows you to transfer some of the work from one editing program to another and return the results back.
Libraries that interact with files of an office suite Microsoft Office using OLE technology are available for almost all programming languages. For Python, this is the python-docx library. The library description is available by link. You can install the library with the command:
!pip install python-docx
The general algorithm for the development of the Document is a sequence of steps:
Document template preparation;
Preparing a build script in Python;
Updating the content and number of pages;
Saving to PDF.
The final Document is easier to generate based on a previously prepared template (for example, a file named template.docx). Formatting and attributes (program name, developer, decimal number, annotation, body text) of the Document template must comply with GOST 19.104-78…
In addition to the GOST requirements, the Document template must meet the following requirements (the sequence of steps is described for Microsoft Word 2019):
headings of level 1, 2, 3 should be carried over into the table of contents (Links → Table of Contents → Custom Table of Contents → Options → Select created styles and assign them a level);
the document template must contain a phrase, in place of which the program code will be inserted, executed in the required style, for example “
The project folder contains a significant number of files, some of which are not code. These include graphic files of the program interface, framework files, as well as files of programs used in the development process and leaving traces behind (GIT, SVN, Docker). All these files need to be filtered.
The function, which will filter out only files that meet the condition, looks like this:
def check(string): result = False if string[-3:] == '.js': result = True if string[-4:] == '.vue': result = True if string[-5:] == '.json': result = True if string[-4:] == '.css': result = True return result
To get a list of files that match the filter conditions, you first need to get a list of all directories in the project directory:
folder =  for i in os.walk(folder_name): folder.append(i)
From the resulting list of directories, we read the names of all files in all directories of the project, and those that meet the filter conditions are added to a separate list:
paths =  for address, dirs, files in folder: for file in files: if check(file): paths.append(address+'\'+file)
Before processing the resulting list of files, you need to prepare a function that will read the contents of the file into one line while preserving the markup of line breaks. This is necessary to further speed up the insertion of code into Word.
def read_file(filename): f = codecs.open(filename, "r", "utf_8_sig" ) file =  for line in f: file.append(line) file="".join(file) f.close() return file
Using the read_file function, we read the contents of the filtered files, along the way inserting directory and file separator lines. This will be needed to automatically generate content.
total_code_file = 
for i, path in enumerate(paths): if i == 0: catalog = path[folder_name_len:].split('') total_code_file.append('Каталог '+catalog+'n') if path[folder_name_len:].split('') != catalog: catalog = path[folder_name_len:].split('') total_code_file.append('Каталог '+catalog+'n') total_code_file.append('Файл '+path[folder_name_len:]+'n') total_code_file.append(read_file(path)) total_code_file += 'n'
To transfer the resulting program code, we connect to the document, look for a control phrase, replace it with an empty string and start inserting the resulting code. If the code contains the word “Directory”, we format it in the style of the 2nd level heading, if it contains the word “File” – in the style of the 3rd level heading, the rest of the text is formatted in the style of the program code:
doc = Document(sample_file) for p in doc.paragraphs: if '<КОД ПРОГРАММЫ>' in p.text: p.text="" for line in total_code_file: if line.rstrip() > '' and line.split() == 'Каталог': p.insert_paragraph_before(line.rstrip(), 'ЗАГ_2') elif line.rstrip() > '' and line.split() == 'Файл': p.insert_paragraph_before(line.rstrip(), 'ЗАГ_3') else: p.insert_paragraph_before(line.rstrip(), 'КОД')
When finished, save the document. Program processing of the document is completed.
After pasting the program text into the Document, you need to open it in Microsoft Word, select all (Ctrl + A) and update the autocomplete fields (F9). This operation must be performed twice, since the fields are updated sequentially, and after the formation of the content, the total number of pages will change.
This operation takes time because Word performs page calculations, sequentially processing the document to the end.
After the completion of updating the auto-filled fields, you must save the Document in * .pdf format using Word. This is necessary to fix the formatting and prevent further work with the file in Word, since Word will recalculate the number of pages each time the file is opened or modified. There will be no such problem with * .pdf.
* .pdf has bOLarger size, but easy to open with any suitable program. Links in content work after saving.
The complete project code is available at link.
The described automation option does not handle errors associated with different file encodings, and has development options:
handling errors related to file encoding;
automatic unloading of the project archive;
programmatically starting recalculation of autocomplete fields.