Opening and preprocessing text (300 PDFs) in Python

Issue

I am supposed to preprocess some PDFs in a folder. I am supposed to remove punctuation, make everything lower case and remove stopwords, and add some extra data from another CSV to it (as metadata). But I cannot even open them. All the googling does not help, since I do not understand the error message (none of the examples from other people helped, since they had different data types).

This is my code so far:

import PyPDF2
import re

for k in range(1,312):
    # open the pdf file
    object = PyPDF2.PdfFileReader("/Users/n_n/Desktop/Digitalization/reserve" % (k))
    

and this is what happens


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [37], in <cell line: 4>()
      2 import re
      4 for k in range(1,312):
      5     # open the pdf file
----> 6     object = PyPDF2.PdfFileReader("/Users/n_n/Desktop/Digitalization/reserve" % (k))

TypeError: not all arguments converted during string formatting

Solution

object = PyPDF2.PdfFileReader("/Users/n_n/Desktop/Digitalization/reserve%s" % str(k))

Answered By – Jeanne

This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply

(*) Required, Your email will not be published