Tag: apache-spark

Apache Spark: StackOverflowError when trying to indexing string columns

apache-spark apache-spark-mllib java scala

I have csv file with about 5000 rows and 950 columns. First I load it to DataFrame: After that I search all string columns and want to index them. For that I create indexers for each string column and create pipeline But when I try to transform my initial dataframe with this pipeline I get StackOverflowError What am I doing

How can I access values in a scala.collection.mutable.WrappedArray of WrappedArray’s in java

apache-spark java json scala

I am parsing a json file in SparkSQL in JAVA and I need to be able to access coordinates that are returned in what appears to be a WrappedArray of WrappedArrays. Here is the code: OUTPUT: WrappedArray(WrappedArray(30.74806, 40.79944)) file.json Answer Spark SQL Row has getList method, that returns a Java list instead of WrappedArray. So, in the above example, one

Spark read file from S3 using sc.textFile (“s3n://…)

apache-spark hortonworks-data-platform java rdd scala

Trying to read a file located in S3 using spark-shell: The IOException: No FileSystem for scheme: s3n error occurred with: Spark 1.31 or 1.40 on dev machine (no Hadoop libs) Running from the Hortonworks Sandbox HDP v2.2.4 (Hadoop 2.60) which integrates Spark 1.2.1 out of the box Using s3:// or s3n:// scheme What is the cause of this error? Missing