Spark read file from S3 using sc.textFile (“s3n://…)

Question

Trying to read a file located in S3 using spark-shell: The IOException: No FileSystem for scheme: s3n error occurred with: Spark 1.31 or 1.40 on dev machine (no Hadoop libs) Running from the Hortonworks Sandbox HDP v2.2.4 (Hadoop 2.60) which integrates Spark 1.2.1 out of the box Using s3:// or s3n:// scheme W…

Accepted Answer

Confirmed that this is related to the Spark build against Hadoop 2.60. Just installed Spark 1.4.0 &#8220;Pre built for Hadoop 2.4 and later&#8221; (instead of Hadoop 2.6). And the code now works OK. sc.textFile("s3n://bucketname/Filename") now raises another error:java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively).The code below uses the S3 URL format to show that Spark can read S3 file. Using dev machine (no Hadoop libs).scala> val lyrics = sc.textFile("s3n://MyAccessKeyID:MySecretKey@zpub01/SafeAndSound_Lyrics.txt")lyrics: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[3] at textFile at <console>:21scala> lyrics.countres1: Long = 9Even Better: the code above with AWS credentials inline in the S3N URI will break if the AWS Secret Key has a forward &#8220;/&#8221;. Configuring AWS Credentials in SparkContext will fix it. Code works whether the S3 file is public or private.sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "BLABLA")sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "....") // can contain "/"val myRDD = sc.textFile("s3n://myBucket/MyFilePattern")myRDD.count

Advertisement

Answer