
The two other converters require some information about the structure of the document for display purposes, so they gather more detailed data. The problem with TextConverter (and PDFPageAggregator) is that they don't recurse deep enough to the structure of the document to properly extract the different columns. TextConverter, XMLConverter, and HTMLConverter also output the result in a file (or in a string stream as in your example) and do some more elaborate parsing for the contents. The basic device class is the PDFPageAggregator class, which simply parses the text boxes in the file. PDFMiner uses classes called "devices" to parse the pages in a pdf fil. I recently struggled with a similar problem, although my pdf had slightly simpler structure. I have also tried pdf2txt.py but unable to get the formatted output. Pdf = PdfFileReader(open(filename, "rb"))ĮxtractedText = pdf.getPage(i).extractText()Ĭontent = " ".join(content.replace("\xa0", " ").strip().split()) Here is the sample code for PyPDF2 from PyPDF2.pdf import PdfFileReader I have also tried PyPdf2, but faced the same issue. from nverter import TextConverterįrom pdfminer.pdfinterp import PDFResourceManager, process_pdfĭevice = TextConverter(rsrcmgr, retstr, codec=codec) Here is the code which returns the extracted text as string for me but for some reason, columns are merged. I am good with any type of output (file/string). I am using the pdf file from the following link. I am trying to parse the pdf file text using pdfMiner, but the extracted text gets merged.
