I have multiple PDF documents in a folder that have a certain structure:
Now I want to be able to parse the information from the PDF. Please note that the paragraphs have varying lengths.
Obviously I am not asking you to solve the problem for me, but I do need some pointers as to how this can be achieved.
I have used nokogiri before and technically I need something like that but for PDFs.
So the pseudo result for my example would look like this:
- ItemA - Title: ItemA - File: 123456789.pdf - Image: ImageA.png (the image was stored on disk) - Subtitle1: Content for subtitle 1 - Subtitle2: Content for subtitle 2 - Subtitle3: Content for subtitle 3 - TitleB - [...]