Spark – Transforming Complex Data Types

Question

Goal The goal I want to achieve is to read a CSV file (OK) encode it to Dataset<Person>, where Person object has a nested object Address[]. (Throws an exception) The Person CSV file In a file called person.csv, there is the following data describing some persons: The first line is the schema and address is a nested structure. Data classes

Accepted Answer

After trying a lot of different ways and spending some hours researching over the Internet, I have the following conclusions:UserDefinedFunction is good but are from the old world, it can be replaced by a simple map() function where we need to transform object from one type to another.The simplest way is the following SparkSession spark = SparkSession.builder().appName("CSV to Dataset").master("local").getOrCreate(); Encoder fileFormatEncoder = Encoders.bean(FileFormat.class); Dataset rawFile = spark.read() // .format("csv") // .option("inferSchema", "true") // .option("header", "true") // first line has headers .load("src/test/resources/encoding-tests/persons.csv") // .as(fileFormatEncoder); LOG.info("=============== Print schema ============="); rawFile.printSchema(); LOG.info("================ Print data =============="); rawFile.show(); LOG.info("================ Print name =============="); rawFile.select("name").show(); // when final SerializableFunction> asAddress = (String text) -> Arrays .stream(text.split(Pattern.quote("||"), -1)) // .map(object -> object.split("~")) // .map(Address::fromArgs) // .map(a -> a.orElse(null)).collect(Collectors.toList()); final MapFunction personMapper = (MapFunction) row -> new Person(row.name, row.age, asAddress .apply(row.address)); final Encoder personEncoder = Encoders.bean(Person.class); Dataset persons = rawFile.map(personMapper, personEncoder); persons.show(); // then assertThat(persons.isEmpty(), is(false)); assertThat(persons.count(), is(2L)); final List names = persons.select("name").as(Encoders.STRING()).collectAsList(); assertThat(names, hasItems("name1", "name2")); final List ages = persons.select("age").as(Encoders.INT()).collectAsList(); assertThat(ages, hasItems(10, 20)); final Encoder

addressEncoder = Encoders.bean(Address.class); final MapFunction firstAddressMapper = (MapFunction) person -> person.addresses.get(0); final List

addresses = persons.map(firstAddressMapper, addressEncoder).collectAsList(); assertThat(addresses, hasItems(new Address("streetA", "cityA"), new Address("streetC", "cityC")));

Advertisement

Answer