Regex for tree structures?



Are there regular expression equivalents for searching and modifying tree structures? Concise mini-languages (like perl regex) are what I am looking for.

Here is an example that might clarify what I am looking for.

<root>
  <node name="1">
    subtrees ....
  </node>
  <node name="2">
    <node name="2.1">
     data
    </node>
    other subtrees...
  </node>
</root>

An operation that would be possible on the above tree is “move subtree at node 2.1 into the subtree at node 1.” The result of the operation might look something like..

<root>
  <node name="1">
    subtrees ....
    <node name="2.1">
     data
    </node>
  </node>
  <node name="2">
    other subtrees...
  </node>
</root>

Search and replace operations like find all nodes with atleast 2 children, find all nodes whose data starts with “a” and replace it with “b” if the subtrees have atleast 2 other siblings, etc. should be supported.

For strings, where the only dimension is across the length of the string, we can do many of above operations (or their 1D equivalents) using regular expressions. I wonder if there are equivalents for trees. (instead of a single regex, you might need to write a set of transformation rules, but that is ok).

I would like to know if there is some simple mini language (not regex per.se, but something that is as accessible as regex via libraries, etc..). to perform these operations? Preferably, as a python library.

Answer

TSurgeon and Tregex from Stanford is capable of doing that. You can download the library from http://nlp.stanford.edu/software/tregex.shtml



Source: stackoverflow