I am trying to read an XML file into an Apache Beam pipeline. Some elements have namespaces and the namespace declaration is declared at the root node. I am able to parse the xml outside of Apache Beam using the standard JAXB parser. However, when I use XmlIO.read() function with beam I get the following exception: com.ctc.wstx.exc.WstxParsingException: Undeclared namespace prefix
Tag: apache-beam
How to avoid warning message when read BigQuery data to custom data type: Can’t verify serialized elements of type BoundedSource
I defined a custom data type reference the document here. https://github.com/apache/beam/blob/master/examples/java/src/main/java/org/apache/beam/examples/snippets/Snippets.java#L127 And read data from BigQuery using the code below. https://github.com/apache/beam/blob/master/examples/java/src/main/java/org/apache/beam/examples/snippets/Snippets.java#L375 Warning message: Can’t verify serialized elements of type BoundedSource have well defined equals method. This may produce incorrect results on some PipelineRunner. This message occurs at step TriggerIdCreation/Read(CreateSource)/Read(CreateSource)/Read(BoundedToUnboundedSourceAdapter)/StripIds.out0 I tried to add equals() method to the custom data type
Dataflow writing a pCollection of GenericRecords to Parquet files
In apache beam step I have a PCollection of KV<String, Iterable<KV<Long, GenericRecord>>>>. I want to write all the records in the iterable to the same parquet file. My code snippet is given below now I want to write all the Records in the Iterable in the same parquet file(derive the file name by the key of KV). Answer I found
Dataflow: string to pubsub message
I’m trying to make unit testing in Dataflow. For that test, at the begging, I will start with a simple hardcoded string. The problem is that I would need to transform that string to a pubsub message. I got the following code to do that: But I get the following error: How I should create the pubsub message from the
Beam – Error while branching PCollections
I have a pipeline that reads data from kafka. It splits the incoming data into processing and rejected outputs. Data from Kafka is read into custom class MyData and output is produced as KV<byte[], byte[]> Define two TupleTags with MyData. InvalidDataDoFn has application logic that splits MyData data into processing and rejected OutputDoFn converts MyData into KV<byte[], byte[]>. While running
Apache Beam: Kafka consumer restarted over and over again
I have this very simple Beam Pipeline that reads records from a Kafka Topic and writes them to a Pulsar Topic: From my understanding this should create exactly one Kafka Consumer that pushes it’s values down the Pipeline. Now for some reason the Pipeline seems to restart over and over again creating multiple Kafka Consumers and multiple Pulsar Producers. Here
No repackaged dependencies when building Apache Beam Cassandra JAR
Trying to compile and use the snapshot for Apache Beam Cassandra JAR. Seems like the build does not pack the Guava dependencies within the JAR. This causes compilation to fail when the JAR is used by other code – see following Exception: I couldn’t find anyway to make the gradle build package the required dependencies within the JAR. Building using
How do I use MapElements and KV in together in Apache Beam?
I wanted to do something like: Where User is a custom datatype with Arvo coder and a constructor that takes a string into account. However, I get the following error: Cannot select from parameterized type I tried to change it to TypeDescriptor.of(KV.class) instead, but then I get: Incompatible types; Required PCollection> but ‘apply’ was inferred to OutputT: no instance(s) of
Sending PubSub message manually in Dataflow
How can I send a PubSub message manually (that is to say, without using a PubsubIO) in Dataflow ? Importing (via Maven) google-cloud-dataflow-java-sdk-all 2.5.0 already imports a version of com.google.pubsub.v1 for which I was unable to find an easy way to send messages to a Pubsub topic (this version doesn’t, for instance, allow to manipulate Publisher instances, which is the
Apache Beam framework – sort in descending order
How do you sort in descending order using the Apache Beam framework? I managed to create a word count pipeline which sorts alphabetically the output by word, but did not figure out how to invert the sorting order. Here is the code: This code produces a list of words sorted alphabetically with its respective counts: Edit 1: After some debugging