Tag: apache-spark

Removing run / fork := true from sbt results in [runtime, notfound] exception, and if not removed i can’t take user input from console

*if run / fork := true is removed from sbt then: Caused by: java.io.FileNotFoundException: /Users/ajitkumar/Downloads/flice/sensor-nws/target/bg-jobs/sbt_4be36759/target/135c9252/81ecd14d/hadoop-client-api-3.3.1.jar (No such file or directory) if not removed the below code results in Answer The problem gets solved after adding run / connectInput := true to build.sbt. more on this: https://github.com/sbt/sbt/issues/229

Running unit tests with Spark 3.3.0 on Java 17 fails with IllegalAccessError: class StorageUtils cannot access class sun.nio.ch.DirectBuffer

apache-spark java java-17

According to the release notes, and specifically the ticket Build and Run Spark on Java 17 (SPARK-33772), Spark now supports running on Java 17. However, using Java 17 (Temurin-17.0.3+7) with Maven (3.8.6) and maven-surefire-plugin (3.0.0-M7), when running a unit test that uses Spark (3.3.0) it fails with: java.lang.IllegalAccessError: class org.apache.spark.storage.StorageUtils$ (in unnamed module @0x1e7ba8d9) cannot access class sun.nio.ch.DirectBuffer (in module

How to create a struct column from a list of column names in Spark with Java?

apache-spark java spark-java

I have a DataFrame with multiple columns, e.g. I also have a list of the column names which corresponds to bowling stats: List bowlingParams = new ArrayList(Arrays.asList(“bowlingAvg”, “bowlingSR”, “wickets”)); Expected Schema: I can do it like this However, I want to use the list to dynamically select the column for struct. I know we can do it like this in

Using createOrReplaceTempView to replace a temp view not working as expected

apache-spark apache-spark-sql java rdd

I have a dataset something similar to this My spark code is I am trying to replace the people view by calling createOrReplaceTempView but i get the following error as below How do I replace the view in spark? Answer So I got the solution to the above question by the following code

BiGrams Spark using java

apache-spark bigdata java n-gram

I already have the sentences in a RDD and the output looks like: RT @DougJ7777: If Britain wins #Eurovision then we have to rejoin the EU. It’s in the rules. #Eurovision2018 RT @Mystificus: Of course I’ll watch #eurovision tonight. After all, 200 million people can’t be wrong, can they? Er…ðð«… RT @KlNGNEUER: Me when Europeans make fun of Eurovision VS

spark java how to select newly added column using withcolumn

apache-spark apache-spark-sql java pyspark

I am trying to create java spark program and trying to add anew column using and when I am trying to select with its saying Cannot reslove column name newColumn. Can some one please help me how to do this in Java? Answer qdf is the dataframe before you added the newColumn which is why you are unable to select

Scala No Method Found Exception

apache-spark java scala

I am using getting the below line of error : My Pom : Any help regarding this ? Answer Check mvn dependency:tree. All your Scala libs will be suffixed with the major Scala version: All of them need to be the same major version otherwise you’ll get binary incompatible libs at runtime. Your maven pom should have all Scala libs

A parquet file of a dataset having a String field containing leading zeroes returns that field without leading zeroes, if it is paritionned by it

apache-spark java parquet scala

I have a Dataset gathering informations about French cities, and the field that is troubling me is the department one (CodeDepartement). When the Dataset isn’t partitioned by this String field codeDepartement: everything is working well When that function runs, if I don’t attempt to partition the dataset (the required statements for partitioning are commented here), everything goes fine: The content

Unable to connect to a database using JDBC within Spark with Scala

apache-spark databricks java jdbc scala

I’m trying to read data from JDBC in Spark Scala. Below is the code written in Databricks. I’m getting the following error message: Could someone please let me know how to resolve this issue. Answer The certificate used by your host is not trusted by java. Solution 1 (Easy, not recommended) Disabled certificate checking and always trust the certificate provided

Spark Dataset Foreach function does not iterate

apache-spark apache-spark-dataset foreach java lambda

Context I want to iterate over a Spark Dataset and update a HashMap for each row. Here is the code I have: Issue My issue is that the foreach doesn’t iterate at all, the lambda is never executed and I don’t know why. I implemented it as indicated here: How to traverse/iterate a Dataset in Spark Java? At the end,