Read a word (.docx) file in java

I have a word document which was generated with docx4j, when i unzip the docx file, the contents of folder is

the contents of ./word/document.xml is as below

the relationship xml has below relationship

<Relationship Target="../chunk.docx" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/aFChunk" Id="rId3"/>

when we unzip chunk.docx it has below file contents

and the ./word/document.xml has below contents

relationship document xml has below contents

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
<Relationship Target="styles.xml" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/styles" Id="rId1"/>
<Relationship Target="settings.xml" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/settings" Id="rId2"/>
<Relationship Target="../chunk.docx" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/aFChunk" Id="rId3"/>
<Relationship Target="../chunk2.docx" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/aFChunk" Id="rId4"/>
<Relationship Target="../chunk3.docx" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/aFChunk" Id="rId5"/>
<Relationship Target="../chunk4.docx" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/aFChunk" Id="rId6"/>
<Relationship Target="../chunk5.docx" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/aFChunk" Id="rId7"/>

similarly when i unzip the chunk.docx it has below file contents

and ./word/document.xml has below contents

How to read the content of word document via java code

i have tried below way

File docxFile = new File(filePath);
        WordprocessingMLPackage wordprocessingMLPackage = WordprocessingMLPackage.load(docxFile);
        MainDocumentPart mainDocumentPart = wordprocessingMLPackage.getMainDocumentPart();
        List<Object> textNodes = mainDocumentPart.getJAXBNodesViaXPath(TEXT_NODEX_XPATH, true);

But it is giving 0 textNodes, Can anyone help how can i read this type of word docx using java

Answer

Your docx contains altChunks of type docx.

It contains those because that would’ve been done explicitly when whoever created it did so using docx4j, using code such as https://github.com/plutext/docx4j/blob/VERSION_11_4_7/docx4j-samples-docx4j/src/main/java/org/docx4j/samples/AltChunkAddOfTypeDocx.java

Ordinarily you wouldn’t do that.

Generally, if you want to handle such a docx using approaches like XPath, you’d first convert those altChunks into normal content. Word can do this, as can Docx4j Enterprise.

But if you control the generating application, the best approach would be to revisit it, changing it so it doesn’t create altChunks. At least understand why they wrote it that way.

Advertisement

Answer