Tag: apache-spark

How do I apply multiple columns in GroupBy/PartitionBy in Spark Java API

If I have a list/Seq of columns in Scala like: I can easily use it in partitionBy or groupBy like But if I want to do the same thing in Spark Java API what should I do? Answer partitionBy has two signatures: So you may choose between one of the two. Let’s say that partitions is a list of String.

Symbol ‘type scala.package.Serializable’ is missing from the classpath

apache-spark java maven sbt scala

my classpath is missing serializable and cloneable classes.. i am not sure how to fix this. i have a sbt application which looks like this when i do a sbt build i am getting.. my dependency tree only shows jars, but this seems to be a class/package conflict or missing.. Answer You’re using an incompatible Scala version (2.13.6). From the

Read values from Java Map using Spark Column using java

apache-spark java

I have tried below code to get Map values via spark column in java but getting null value expecting exact value from Map as per key search. and Spark Dataset contains one column and name is KEY and dataset name dataset1 values in dataset : Java Code – Current Output is: Expected Output : please me get this expected output.

Compare schema of dataframe with schema of other dataframe

apache-spark apache-spark-sql dataframe java scala

I have schema from two dataset read from hdfs path and it is defined below: val df = spark.read.parquet(“/path”) df.printSchema() Answer Since your schema file seems like a CSV : use isSchemaMatching for further logic

why is my maven sub dependency version for spark connector package different from others

apache-spark cassandra java maven

I am trying to use a pom file from a existing project and I am getting an error “Cannot resolve org.yaml:snakeyaml:1.15” What I find out about this error is that the com.datastax.spark:spark-cassandra-connector_2.11:2.5.0 uses a couple dependencies and a couple levels down it is using snakeyaml:1.15 which is quarantined by company proxy. Is there a way to specify for a given

UnsupportedOperationException while creating a dataset manually using Java SparkSession

apache-spark dataframe java

I am trying to create a Dataset from Strings like below in my JUnit test. But I am seeing this below error: What am I missing here? My main method works fine, but this test is failing. Looks like something is not read from the classpath correctly. Answer I fixed it by excluding this below dependency from all dependencies related

Spark UDF function fail on Standalone Spark

apache-spark java spring-boot

I have spring boot java application myapp.jar with something udf function. SparkConfuration.java ToIntegerUdf.java sparkJars contains path to myJar.jar. Application build with Maven. Spark library version is 3.02 and scala version is 2.12.10. When I running application on Spark Standalone 3.0.2 I have an error: In spark worker log I see, worker fetch myJar: 21/03/23 19:33:24 INFO Executor: Fetching spark://demo.phoenixit.ru:39597/jars/myJar.jar with

How to compile spark-testing-base in Java project built with maven?

apache-spark java junit maven

I don’t have a lot of experience with Java, but I built a Spark application using Java. I want to write some unit tests for my Spark application. I saw that spark-testing-base is very useful for that purpose. I have added the following to my pom.xml: I’m using Junit framework and my tests fail when trying to reach jsc(). My

Caused by: java.lang.ClassNotFoundException: play.api.libs.functional.syntax.package

apache-spark hadoop java playframework scala

I am getting this following error (Caused by: java.lang.ClassNotFoundException: play.api.libs.functional.syntax.package) while I am trying to run my code I have right dependencies and added right Jar …

Issue with Spark Big Query Connector with Java

apache-spark google-cloud-dataproc google-cloud-platform java

Getting Below issue with the Spark Big Query connector in Dataproc cluster with below configuraton. Image: 1.5.21-debian10 Spark Version: 2.4.7 Scala Version: 2.12.10 This is working fine from local but failing when I deploy this in dataproc cluster.Can someone suggest some pointers for this issue? pom.xml: Here is the sample Code: Answer Can you please replace the Spark BigQuery connector