Skip to content
Advertisement

Tag: apache-beam

Reading XML with namespace using Apache Beam XmlIO

I am trying to read an XML file into an Apache Beam pipeline. Some elements have namespaces and the namespace declaration is declared at the root node. I am able to parse the xml outside of Apache Beam using the standard JAXB parser. However, when I use XmlIO.read() function with beam I get the following exception: com.ctc.wstx.exc.WstxParsingException: Undeclared namespace prefix

How to avoid warning message when read BigQuery data to custom data type: Can’t verify serialized elements of type BoundedSource

I defined a custom data type reference the document here. https://github.com/apache/beam/blob/master/examples/java/src/main/java/org/apache/beam/examples/snippets/Snippets.java#L127 And read data from BigQuery using the code below. https://github.com/apache/beam/blob/master/examples/java/src/main/java/org/apache/beam/examples/snippets/Snippets.java#L375 Warning message: Can’t verify serialized elements of type BoundedSource have well defined equals method. This may produce incorrect results on some PipelineRunner. This message occurs at step TriggerIdCreation/Read(CreateSource)/Read(CreateSource)/Read(BoundedToUnboundedSourceAdapter)/StripIds.out0 I tried to add equals() method to the custom data type

Dataflow: string to pubsub message

I’m trying to make unit testing in Dataflow. For that test, at the begging, I will start with a simple hardcoded string. The problem is that I would need to transform that string to a pubsub message. I got the following code to do that: But I get the following error: How I should create the pubsub message from the

Beam – Error while branching PCollections

I have a pipeline that reads data from kafka. It splits the incoming data into processing and rejected outputs. Data from Kafka is read into custom class MyData and output is produced as KV<byte[], byte[]> Define two TupleTags with MyData. InvalidDataDoFn has application logic that splits MyData data into processing and rejected OutputDoFn converts MyData into KV<byte[], byte[]>. While running

How do I use MapElements and KV in together in Apache Beam?

I wanted to do something like: Where User is a custom datatype with Arvo coder and a constructor that takes a string into account. However, I get the following error: Cannot select from parameterized type I tried to change it to TypeDescriptor.of(KV.class) instead, but then I get: Incompatible types; Required PCollection> but ‘apply’ was inferred to OutputT: no instance(s) of

Sending PubSub message manually in Dataflow

How can I send a PubSub message manually (that is to say, without using a PubsubIO) in Dataflow ? Importing (via Maven) google-cloud-dataflow-java-sdk-all 2.5.0 already imports a version of com.google.pubsub.v1 for which I was unable to find an easy way to send messages to a Pubsub topic (this version doesn’t, for instance, allow to manipulate Publisher instances, which is the

Apache Beam framework – sort in descending order

How do you sort in descending order using the Apache Beam framework? I managed to create a word count pipeline which sorts alphabetically the output by word, but did not figure out how to invert the sorting order. Here is the code: This code produces a list of words sorted alphabetically with its respective counts: Edit 1: After some debugging

Advertisement