Tag: apache-spark

Spark 2.0.0 : Read from Cassandra in Cluster mode

apache-spark cassandra java spark-cassandra-connector

I have some troubles running a Spark Application that reads data from Cassandra in Spark 2.0.0. My code work as follow : DataFrameReader readerCassandra = SparkContextUtil.getInstance().read() …

Apache Spark Streaming with Java & Kafka

apache-kafka apache-spark java

I’m trying to run Spark Streaming example from the official Spark website Those are the dependencies I use in my pom file: This is my Java code: When I try to run it from Eclipse I get following exception: I run this from my IDE (eclipse). Do I have to create and deploy the JAR into spark to make it

How to specify different user library path for different actions in an oozie workflow

apache-spark hue java oozie oozie-workflow

How to specify different user library path for different actions in an oozie workflow I have a spark action and a java action How can I specify different library paths for two actions. I have conflicitng jars in these two assembly jars. Answer Making the action s sub-workflow helps resolves the jar incosistencies

Data type mismatch while transforming data in spark dataset

apache-spark apache-spark-dataset apache-spark-sql java parquet

I created a parquet-structure from a csv file using spark: I’m reading the parquet-structure and I’m trying to transform the data in a dataset: Unfortunately I get a data type mismatch error. Do I have to explicitly assign data types? 17/04/12 09:21:52 INFO SparkSqlParser: Parsing command: SELECT *, md5(station_id) as hashkey FROM tmpview Exception in thread “main” org.apache.spark.sql.AnalysisException: cannot resolve

How to use join with gt condition in Java?

apache-spark apache-spark-sql java

I want to join two dataframes based on the following condition: if df1.col(“name”)== df2.col(“name”) and df1.col(“starttime”) is greater than df2.col(“starttime”). the first part of the condition is ok, I use “equal” method of the column class in spark sql, but for the “greater than” condition, when I use the following syntax in java”: It does not work, It seems “gt”

NoSuchMethodError in shapeless seen only in Spark

apache-spark java scala shapeless

I am trying to write a Spark connector to pull AVRO messages off a RabbitMQ message queue. When decoding the AVRO messages, there is a NoSuchMethodError error that occurs only when running in Spark. I could not reproduce the Spark code exactly outside of spark, but I believe the two examples are sufficiently similar. I think this is the smallest

Spark SASL not working on the emr with yarn

apache-spark hadoop java yarn

So first, I want to say the only thing I have seen address this issue is here: Spark 1.6.1 SASL. However, when adding the configuration for the spark and yarn authentication, it is still not working. …

Spark – Divide int with column?

apache-spark apache-spark-sql dataframe java

I’m trying to divide a constant with a column. I know I can do but how can I do (90).divide(df.col(“col1”)) (obviously this is incorrect). Thank you! Answer Use o.a.s.sql.functions.lit: or o.a.s.sql.functions.expr:

Spark (JAVA) – dataframe groupBy with multiple aggregations?

apache-spark java

I’m trying to write a groupBy on Spark with JAVA. In SQL this would look like But what is the Spark/JAVA style equivalent of this query? Let’s say the variable table is a dataframe, to see the relation to the SQL query. I’m thinking something like: Which is obviously incorrect, since you can’t use aggregate functions like .count or .max