Trying to read a file located in S3 using spark-shell:
scala> val myRdd = sc.textFile("s3n://myBucket/myFile1.log") lyrics: org.apache.spark.rdd.RDD[String] = s3n://myBucket/myFile1.log MappedRDD at textFile at <console>:12 scala> myRdd.count java.io.IOException: No FileSystem for scheme: s3n at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2607) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2614) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91) ... etc ...
The IOException: No FileSystem for scheme: s3n error occurred with:
What is the cause of this error? Missing dependency, Missing configuration, or mis-use of
Or may be this is due to a bug that affects Spark build specific to Hadoop 2.60 as this post seems to suggest. I am going to try Spark for Hadoop 2.40 to see if this solves the issue.
Confirmed that this is related to the Spark build against Hadoop 2.60. Just installed Spark 1.4.0 “Pre built for Hadoop 2.4 and later” (instead of Hadoop 2.6). And the code now works OK.
sc.textFile("s3n://bucketname/Filename") now raises another error:
java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively).
The code below uses the S3 URL format to show that Spark can read S3 file. Using dev machine (no Hadoop libs).
scala> val lyrics = sc.textFile("s3n://MyAccessKeyID:MySecretKey@zpub01/SafeAndSound_Lyrics.txt") lyrics: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD at textFile at <console>:21 scala> lyrics.count res1: Long = 9
Even Better: the code above with AWS credentials inline in the S3N URI will break if the AWS Secret Key has a forward “/”. Configuring AWS Credentials in SparkContext will fix it. Code works whether the S3 file is public or private.
sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "BLABLA") sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "....") // can contain "/" val myRDD = sc.textFile("s3n://myBucket/MyFilePattern") myRDD.count