Skip to content
Advertisement

Filtering out formatting tags from JSoup selectors

JSoup here. I have the following HTML I’m trying to parse:

<html><head>
<title>My Soup Materials</title>
<!--mstheme--><link rel="stylesheet" type="text/css" href="../../_themes/ice/ice1011.css"><meta name="Microsoft Theme" content="ice 1011, default">
</head>
<body><center><table width="92%%"><tbody>
<tr>
<td><h2>My Soup Materials</h2>

<table width="100%%%%" cellspacing="0" cellpadding="0">

<tbody>
<tr>
<td align="left"><b>Origin:</b> Belgium</td>
<td align="left"><b>Count:</b> 2 foos</td>
</tr>

<tr>
<td align="left"><b>Supplier:</b> </td>
<td align="left"><b>Must Burninate:</b> Yes</td>
</tr>

<tr>
<td align="left"><b>Type:</b> Fizzbuzz</td>
<td align="left"><b>Add Afterwards:</b> No</td>
</tr>

</tbody>
</table>
<br>
<b><u>Notes</b></u><br>Drink more ovaltine</td>
</tr>

</tbody>
</table>
</center></body>
</html>

Unfortunately its actually slightly malformed (missing some closing tags, opening and closing tags on <b> and <u> are out of order, etc.) but I’m hoping JSoup can handle that. I don’t have control over the HTML.

I have the following Java model/POJO:

@Data // lombok; adds ctors, getters, setters, etc.
public class Material {
  private String name;
  private String origin;
  private String count;
  private String supplier;
  private Boolean burninate;
  private String type;
  private Boolean addAfterwards;
}

I am trying to get JSoup to parse this HTML and provide a Material instance from that parsing.

To grab the data inside the <table> I’m pretty close:

Material material = new Material();

Elements rows = document.select("table").select("tr");
for (Element row : rows) {

    // row 1: origin & count
    Elements cols = row.select("td");
    for (Element col : cols) {
        material.setOrigin(???);
        material.setCount(???);
    }

}

So I’m able to get each <tr>, and for each <tr> get all of its <td> cols. But where I’m hung up is:

<td align="left"><b>Origin:</b> Belgium</td>

So the col.text() for the first <td> would be <b>Origin:</b> Belgium. How do I tell JSoup that I only want the “Belgium”?

Advertisement

Answer

I think you’re looking for tdNode.ownText(). There’s also simply text(), but as the docs state this combines all text nodes of the node and all its children and normalizes them. In other words, tdNode.text() gives you the string "Origin: Belgium". tdNode.ownText() gives you just "Belgium" and tdNode.child(0).ownText() gets you just "Origin:".

You can also use wholeText(), which is non-normalized, but I think you want the normalization here (that primarily involves getting rid of whitespace).

User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement