Skip to content
Advertisement

Extract the first page content from docx file by XML parsing

I need to extract the first page content from the docx file and save it as a seperate document. I need everything from the first page( images, tables, text) to be saved as it is in new docx file.

What i tried is : I looked into the xml of the unzipped docx file. Since word document is reflowable i couldnt find a page break after each page ends. So i couldnt find the end of each page via the document.xml

Is there any way to get the XML content of the first page of the document alone using java XML DOM parser ?

Advertisement

Answer

Do not write a new parser, there are tons of already existing tools for that (e.g., what if your input changes from XML to binary Word files?).

Use Apache POI for example, as @JFB suggested.

User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement