I am trying to read an XML file into an Apache Beam pipeline. Some elements have namespaces and the namespace declaration is declared at the root node. I am able to parse the xml outside of Apache Beam using the standard JAXB parser. However, when I use XmlIO.read() function with beam I get the following exception:
com.ctc.wstx.exc.WstxParsingException: Undeclared namespace prefix “g”.
JavaScript
x
<rss version="2.0" xmlns:g="http://base.google.com/ns/1.0">
<channel>
<item>
<!-- Basic Product Information -->
<g:id><![CDATA[SAMI9000NAVKIT]]></g:id>
<title><![CDATA[Original Samsung Galaxy S i9000 Navigation Kit]]></title>
<link><![CDATA[https://www.mobileciti.com.au/original-samsung-galaxy-s-i9000-navigation-kit]]></link>
<description><![CDATA[<p>SAMSUNG Galaxy S (i9000) Navigation Kit - Consists of handset cradle, window shield mount and car charger.</p>]]></description>
<g:product_category><![CDATA[Electronics > Communications > Telephony > Mobile Phone Accessories]]></g:product_category>
<g:product_type><![CDATA[Accessories > Car Kits]]></g:product_type>
.
</item>
</channel>
</rss>
Beam code:
JavaScript
.from(<Full file path>)
.withRootElement("rss")
.withRecordElement("item").withRecordClass(Item.class));
XML without namespace works fine. Any pointers is much appreciated. Thanks
Advertisement
Answer
Looking at XmlSource code, unfortunately, I don’t think it supports XML namespaces by default if you only specify a root element.
Though, as a workaround you can try to do something like this:
JavaScript
.withRootElement("rss version="2.0" xmlns:g="http://base.google.com/ns/1.0"")
and probably it will work.