Editing Word Documents and PDF Files with Python

Working with DOCX and PDF files in Python

PDF (portable document format) and DOCX (Office Open XML format) are two of (probably) the most common document format on the planet. Businesses use it to send communications internally/externally, to write contracts, etc. Job hunters write their CVs using these formats, and recruiters create apps to read CV contents. There are many more use cases where PDF and DOCX formats are involved in storing unstructured information, but you get the idea. In this article, we’ll see an introduction on how to work with these two types using Python.

Getting the libraries

PDF and DOCX aren’t simple text files, so you can’t simply read them with the open() function. To work with PDF files, we need the pypdf2 package. To work with DOCX files, we need the python-docx package. So, let’s import those; from a terminal or cmd window, type the following

pip install python-docx PyPDF2

If everything goes well, you should see a screen that looks like the screenshot below.

Text

Description automatically generated

Working with PDF files

For starters, let’s open a PDF file, find out the total number of pages, and then print some text. That’s not a bad start. To do that, we first import the PyPDF2 package, then open a sample PDF file for reading.

import PyPDF2

pdf_file = open(‘sample1.pdf’, ‘rb’)

The second statement (of the snippet above) opens a PDF file named ‘sample1.pdf’ — it’s actually one of those free ebooks that I get for subscribing to some newsletters, feel free to use whatever PDF file you have lying around — we open the file for reading using the built-in open() function, we pass ‘rb’ as the second argument because we want to open the file in binary mode.

Next, we create a PDF file reader object using the PyPDF2 library; like this

pdf_reader = PyPDF2.PdfFileReader(pdf_file)

We pass the pdf_file object as an argument to the PdfFileReader() function, and that should give us a PDF reader object; with that object, we can do some useful things already. Let’s get the total number of pages in the PDF document.

num_pages = pdf_reader.numPages

print(“Number of pages = {0}”.format(num_pages))

The number of pages in the PDF file is a property called numPages of the PDF reader object; in our code sample above, I simply assigned it to a variable and printed it (with some Python String formatting).

Running my code thus far prints the following results

Text

Description automatically generated

My sample ebook has a total of 253 pages, Let’s try to print the first page. To do that, we need to add codes that will extract the text from the pages.

page_object = pdf_reader.getPage(0)

content = page_object.extractText()

print(content)

The first line (of the snippet above) gets a page object using the PDF reader object. I passed the integer literal zero (which means the very first page of the PDF document); like arrays, PDF objects in PyPDF2 are zero-based.

The next line invokes the extractText() function, which, well, extracts the texts of the specified page (which is page 0), and the last line simply prints the textual content of the extracted page.

Let’s run it.

Graphical user interface, application

Description automatically generated

Hmm. That’s a bit anticlimactic. Nothing. Nada; but a quick inspection of my sample pdf shows the reason for the disappointment. The first page of the PDF is an image. That’s why nothing was shown; because no text was extracted. This is a good time to mention that PyPDF2, powerful and mighty as it is, cannot extract images, charts, or other media from PDF files. It can only extract texts and return them as Python Strings.

Alright. Let’s adjust our code then.

page_object = pdf_reader.getPage(1)

content = page_object.extractText()

print(content)

And then running it gives us this.

Text

Description automatically generated

That’s more like it. Now my little program can extract text from PDF files. All you have to do now is write a loop that goes through all the PDF pages and then extract the text. You can certainly do that all by yourself now, can you? You might need the full code listing of our little sample; here it is.

#!usr/bin/env python3

import PyPDF2

pdf_file = open(‘sample1.pdf’, ‘rb’)

pdf_reader = PyPDF2.PdfFileReader(pdf_file)

num_pages = pdf_reader.numPages

print(“Number of pages = {0}”.format(num_pages))

page_object = pdf_reader.getPage(1)

content = page_object.extractText()

print(content)

Working with DOCX

Python can work really well with Office Open XML format, or more popularly known as the DOCX format. This format is heavily associated with Microsoft word, but it’s actually an open format, as the name suggests. You can work with this format using various editors, MS Word (of course), OpenOffice, LibreOffice, Apple’s Pages, and Google Docs, to name a few.

Reading textual information from a DOCX is straightforward. Let’s read some data from a sample file — I have a sample DOCX file (stored on the same folder as my python script) named “tomorrow3x.docx”, the content is shown below.

Text

Description automatically generated

To read this file, we can do it with the following code.

import docx. #(1)

doc = docx.Document(‘tomorrow3x.docx’) #(2)

print(len(doc.paragraphs)) #(3)

print(doc.paragraphs[0].text) #(4)

(1) This line imports the python-docx library. Notice that we don’t write import python-docx; we simply import docx

(2) This line opens our sample DOCX and stores the resulting object into the doc variable; you can name the variable anything you like, of course, but I named it doc because it’s easier to type

(3) The built-in len() function can tell us how many paragraphs our documents have

(4) This line prints the contents of paragraph zero (like in the PDF sample, paragraphs are zero-based)

Running our program prints the following results.

Graphical user interface, text

Description automatically generated

A paragraph in python-docx is marked by the ENTER or RETURN escape char — our sample code actually read the very first line in the DOCX file.

Now that we can extract text from a single paragraph, surely, you can find a way to loop through all the paragraphs and extract the text. Right? Okay, just because I left you by yourself in the PDF example, just this time, I’ll show you the codes on how to get the full text from a DOCX file. Here it is.

#!usr/bin/env python3

import docx

def readText(filename):

doc = docx.Document(filename)

fullText = []

for par in doc.paragraphs:

fullText.append(par.text)

return ‘\n’.join(fullText)

print(readText(‘tomorrow3x.docx’))

Running our code prints the following.

Text

Description automatically generated

Related Articles

Responses

Your email address will not be published. Required fields are marked *