Tag: apache-beam

Reading XML with namespace using Apache Beam XmlIO

I am trying to read an XML file into an Apache Beam pipeline. Some elements have namespaces and the namespace declaration is declared at the root node. I am able to parse the xml outside of Apache Beam using the standard JAXB parser. However, when I use XmlIO.read() function with beam I get the following exception: com.ctc.wstx.exc.WstxParsingException: Undeclared namespace prefix

How to avoid warning message when read BigQuery data to custom data type: Can’t verify serialized elements of type BoundedSource

apache-beam google-cloud-dataflow java

I defined a custom data type reference the document here. https://github.com/apache/beam/blob/master/examples/java/src/main/java/org/apache/beam/examples/snippets/Snippets.java#L127 And read data from BigQuery using the code below. https://github.com/apache/beam/blob/master/examples/java/src/main/java/org/apache/beam/examples/snippets/Snippets.java#L375 Warning message: Can’t verify serialized elements of type BoundedSource have well defined equals method. This may produce incorrect results on some PipelineRunner. This message occurs at step TriggerIdCreation/Read(CreateSource)/Read(CreateSource)/Read(BoundedToUnboundedSourceAdapter)/StripIds.out0 I tried to add equals() method to the custom data type

Dataflow writing a pCollection of GenericRecords to Parquet files

apache-beam dataflow java parquet

In apache beam step I have a PCollection of KV<String, Iterable<KV<Long, GenericRecord>>>>. I want to write all the records in the iterable to the same parquet file. My code snippet is given below now I want to write all the Records in the Iterable in the same parquet file(derive the file name by the key of KV). Answer I found

Dataflow: string to pubsub message

apache-beam google-cloud-dataflow java

I’m trying to make unit testing in Dataflow. For that test, at the begging, I will start with a simple hardcoded string. The problem is that I would need to transform that string to a pubsub message. I got the following code to do that: But I get the following error: How I should create the pubsub message from the

Beam – Error while branching PCollections

apache-beam apache-beam-io java

I have a pipeline that reads data from kafka. It splits the incoming data into processing and rejected outputs. Data from Kafka is read into custom class MyData and output is produced as KV<byte[], byte[]> Define two TupleTags with MyData. InvalidDataDoFn has application logic that splits MyData data into processing and rejected OutputDoFn converts MyData into KV<byte[], byte[]>. While running

Apache Beam: Kafka consumer restarted over and over again

apache-beam apache-kafka apache-pulsar java

I have this very simple Beam Pipeline that reads records from a Kafka Topic and writes them to a Pulsar Topic: From my understanding this should create exactly one Kafka Consumer that pushes it’s values down the Pipeline. Now for some reason the Pipeline seems to restart over and over again creating multiple Kafka Consumers and multiple Pulsar Producers. Here

No repackaged dependencies when building Apache Beam Cassandra JAR

apache-beam cassandra gradle guava java

Trying to compile and use the snapshot for Apache Beam Cassandra JAR. Seems like the build does not pack the Guava dependencies within the JAR. This causes compilation to fail when the JAR is used by other code – see following Exception: I couldn’t find anyway to make the gradle build package the required dependencies within the JAR. Building using

How do I use MapElements and KV in together in Apache Beam?

apache-beam java java-8

I wanted to do something like: Where User is a custom datatype with Arvo coder and a constructor that takes a string into account. However, I get the following error: Cannot select from parameterized type I tried to change it to TypeDescriptor.of(KV.class) instead, but then I get: Incompatible types; Required PCollection> but ‘apply’ was inferred to OutputT: no instance(s) of

Apache Beam framework – sort in descending order

apache-beam java sorting

How do you sort in descending order using the Apache Beam framework? I managed to create a word count pipeline which sorts alphabetically the output by word, but did not figure out how to invert the sorting order. Here is the code: This code produces a list of words sorted alphabetically with its respective counts: Edit 1: After some debugging