Skip to content
Advertisement

How can I read html tags from within an xml file?

I have an xml file that I am reading with java code. A fragment of what I am reading and the code is below:

 <?xml version="1.0" encoding="UTF-8"?>
 <caml:MeasureDoc version="1.0" xsi:schemaLocation="http://lc.ca.gov/legalservices/schemas/caml.1# xca.1.xsd"
     xmlns:caml="http://lc.ca.gov/legalservices/schemas/caml.1#"
     xmlns:xlink="http://www.w3.org/1999/xlink"
     xmlns:xhtml="http://www.w3.org/1999/xhtml"
     xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
     <caml:BudgetItem id="id_6D1BA0B6-8097-43E3-8A48-13249E6CAD6B" num="2240-002-0890">
         <caml:Content>
             <table cellspacing="0" class="Abutted" id="id_8C3F2551-7554-4A16-9256-0B408C6CD7BB" width="416">
                 <tbody>
                     <tr style="keep-together.within-page:always;">
                         <td colspan="7" valign="top" width="336">
                             <p class="Stub">
                                 <caml:NumSpan>2240-002-0890</caml:NumSpan>—For state operations, Department of Housing and Community Development, payable from the Federal Trust Fund.
                                 <span class="DottedLeaders"/>
                             </p>
                          </td>
                          <td align="right" valign="bottom" width="80">0</td>
                      </tr>
                      <tr style="keep-with-next.within-page:always;">
                          <td valign="top" width="24"/>
                          <td colspan="7" valign="top" width="392">Schedule:</td>
                      </tr>
                  </tbody>
             </table>
         </caml:Content>
     <caml:BudgetItem>
 </caml:MeasureDoc>

java code:

 import javax.xml.parsers.DocumentBuilderFactory; // etc, etc.
 ...
 DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
 DocumentBuilder builder = factory.newDocumentBuilder();
 ByteArrayInputStream input = new ByteArrayInputStream(billXml.getBytes("UTF-8"));
 Document doc = builder.parse(input);
 Element root = doc.getDocumentElement();
 Node bill = LU.subNodeWithName(root, "caml:Bill");
 Node budgetInfoNode = LU.findBudgetInfoNode(bill); // (my helper method)
 Node contentNode = budgetInfoNode.getChildNodes().item(0);
 Node tableNode = contentNode.getChildNodes().item(0);
 System.out.println(tableNode.toString());

output:

 [table: null]

if I get the table’s getTextContent(), I get:

 2240-002-0890?For state operations, Department of Housing and Community Development, payable from the
 Federal Trust Fund.0Schedule:(1)1665-Financial Assistance Program0Provisions:1.The funds appropriated
 in this item shall be made available to administer the State Rental Assistance Program.2.Upon order of the
 Department of Finance, amounts transferred to this item may be transferred to Schedule (1) of
 Item 2240-102-0890.3.Any amounts transferred to Schedule (1) of this item pursuant to Provision 2 of
 Item 2240-102-0890 shall be available for encumbrance and expenditure until June 30, 2022.

Neither of these is what I want. I want the html within the XML node.

There seems to be no “getRealContent” method like the “getTextContent” method, but showing the tags. Apologies if I am missing something obvious.

How can I read the table tag and the tags within it?

Bonus if anyone knows the property to set to get this to stop. I am seeing this over and over and over and over again:

JAXP: find factoryId =javax.xml.transform.TransformerFactory
JAXP: found system property, value=org.apache.xalan.processor.TransformerFactoryImpl
JAXP: created new instance of class org.apache.xalan.processor.TransformerFactoryImpl using ClassLoader: null

Unfortunately XMLProperties.ShutTheHeckUpAlready does not exist. More’s the pity.

Advertisement

Answer

This may not be very intuitive solution but if we convert the required node object to document and apply transform to convert this to string, we can get the html.

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
ByteArrayInputStream input = new ByteArrayInputStream(billXml.getBytes("UTF-8"));
Document doc = builder.parse(input);
Element root = doc.getDocumentElement();
NodeList budgetItem = root.getElementsByTagName("caml:BudgetItem");
for (int temp = 0; temp < budgetItem.getLength(); temp++) {
    Node node = budgetItem.item(temp);
    if (node.getNodeType() == Node.ELEMENT_NODE) {
        Element eElement = (Element) node;
        NodeList table = eElement.getElementsByTagName("table");
        Node item = table.item(0);

        String content = getHTMLContent(factory, item);
        System.out.println(content);

    }
}

private static String getHTMLContent(DocumentBuilderFactory factory, Node item) throws ParserConfigurationException, TransformerException {
    DocumentBuilder builder = factory.newDocumentBuilder();
    Document newDocument = builder.newDocument();
    Node importedNode = newDocument.importNode(item, true);
    newDocument.appendChild(importedNode);

    Transformer transformer = TransformerFactory.newInstance().newTransformer();
    transformer.setOutputProperty(OutputKeys.METHOD, "html");

    StreamResult result = new StreamResult(new StringWriter());

    DOMSource source = new DOMSource(newDocument);
    transformer.transform(source, result);
    return result.getWriter().toString();
}    
Advertisement