I have a word document which was generated with docx4j, when i unzip the docx file, the contents of folder is
the contents of ./word/document.xml is as below
the relationship xml has below relationship
<Relationship Target="../chunk.docx" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/aFChunk" Id="rId3"/>
when we unzip chunk.docx it has below file contents
and the ./word/document.xml has below contents
relationship document xml has below contents
<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships"> <Relationship Target="styles.xml" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/styles" Id="rId1"/> <Relationship Target="settings.xml" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/settings" Id="rId2"/> <Relationship Target="../chunk.docx" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/aFChunk" Id="rId3"/> <Relationship Target="../chunk2.docx" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/aFChunk" Id="rId4"/> <Relationship Target="../chunk3.docx" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/aFChunk" Id="rId5"/> <Relationship Target="../chunk4.docx" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/aFChunk" Id="rId6"/> <Relationship Target="../chunk5.docx" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/aFChunk" Id="rId7"/>
similarly when i unzip the chunk.docx it has below file contents
and ./word/document.xml has below contents
How to read the content of word document via java code
i have tried below way
File docxFile = new File(filePath); WordprocessingMLPackage wordprocessingMLPackage = WordprocessingMLPackage.load(docxFile); MainDocumentPart mainDocumentPart = wordprocessingMLPackage.getMainDocumentPart(); List<Object> textNodes = mainDocumentPart.getJAXBNodesViaXPath(TEXT_NODEX_XPATH, true);
But it is giving 0 textNodes, Can anyone help how can i read this type of word docx using java
Advertisement
Answer
Your docx contains altChunks of type docx.
It contains those because that would’ve been done explicitly when whoever created it did so using docx4j, using code such as https://github.com/plutext/docx4j/blob/VERSION_11_4_7/docx4j-samples-docx4j/src/main/java/org/docx4j/samples/AltChunkAddOfTypeDocx.java
Ordinarily you wouldn’t do that.
Generally, if you want to handle such a docx using approaches like XPath, you’d first convert those altChunks into normal content. Word can do this, as can Docx4j Enterprise.
But if you control the generating application, the best approach would be to revisit it, changing it so it doesn’t create altChunks. At least understand why they wrote it that way.