I am currently building a .NET application and one of the requirement is that it has to convert a pdf file to XML file. Has anyone had success doing this? If so what have you used?
C# – pdf to xml conversion using .NET
.netc++pdfxml
Related Question
- C# – What are the correct version numbers for C#
- C# – Case insensitive ‘Contains(string)’
- C# – How to remedy “The breakpoint will not currently be hit. No symbols have been loaded for this document.” warning
- Xml – What does in XML mean
- C# – DateTime vs DateTimeOffset
- C# – Why not inherit from List
- .net – the difference between .NET Core and .NET Standard Class Library project types
Best Solution
I have done this kind of project a lot of times before.
Things you need to do:
1.) Check out this project Extract Text from PDF in C#. The project uses ITextSharp.
PDFParser class
2.) Parse the extracted text and create and xml file.
Some of my concerns before are the pdf's which contains broken links or urls inside the pages. Now, just in case if you are also concern on this issue, regular expression can easily solve your problem but I suggest you deal with it later on.
Now here is a sample code on how to create an xml. Understand how the code works so later on you will know on how to deal with your own code.
Hope it helps :)