Splitting a PDF document based on font size using Python

I have some PDFs with lots of articles on it. Every article starts with a heading with a bigger font size. So what I need to do is to somehow identify the parts between the start of 2 characters on a bigger font size and separate it on different txt files.

I'm having trouble to identify the font size though. Any ideas?

1 answer

  • answered 2019-10-08 03:29 cerofrais

    assuming you have scanned pdf's and not word pdf's, if word pdf's convert pdf to word using some converter and i think that will give you the fount sizes as well,

    I suggest first convert pdf to image using python pdf2img module, then go through the images, draw bounding boxes/contours around the text and based on the height and width divided by the number of chars in the box you can find font sizes.

    hope this helps