Skip to content
Advertisement

Fetching specific fields from an S3 document

I am using AWS Java SDK in my application to talk to one of my S3 buckets which holds objects in JSON format.

A document may look like this:

{
    "a" : dataA,
    "b" : dataB,
    "c" : dataC,
    "d" : dataD,
    "e" : dataE
} 

Now, for a certain document lets say document1 I need to fetch the values corresponding to field a and b instead of fetching the entire document.

This sounds like something that wouldn’t be possible because S3 buckets can have any type of documents in them and not just JSONs.

Is this something that is achievable though?

Advertisement

Answer

That’s actually doable. You could do selects like you’ve described, but only for particular formats: JSON, CSV, Parquet.

Imagine having a data.json file in so67315601 bucket in eu-central-1:

{
  "a": "dataA",
  "b": "dataB",
  "c": "dataC",
  "d": "dataD",
  "e": "dataE"
}

First, learn how to select the fields via the S3 Console. Use “Object Actions” → “Query with S3 Select”:

enter image description here enter image description here


AWS Java SDK 1.x

Here is the code to do the select with AWS Java SDK 1.x:

@ExtendWith(S3.class)
class SelectTest {
    @AWSClient(endpoint = Endpoint.class)
    private AmazonS3 client;

    @Test
    void test() throws IOException {
        // LINES: Each line in the input data contains a single JSON object
        // DOCUMENT: A single JSON object can span multiple lines in the input
        final JSONInput input = new JSONInput();
        input.setType(JSONType.DOCUMENT);

        // Configure input format and compression
        final InputSerialization inputSerialization = new InputSerialization();
        inputSerialization.setJson(input);
        inputSerialization.setCompressionType(CompressionType.NONE);

        // Configure output format
        final OutputSerialization outputSerialization = new OutputSerialization();
        outputSerialization.setJson(new JSONOutput());

        // Build the request
        final SelectObjectContentRequest request = new SelectObjectContentRequest();
        request.setBucketName("so67315601");
        request.setKey("data.json");
        request.setExpression("SELECT s.a, s.b FROM s3object s LIMIT 5");
        request.setExpressionType(ExpressionType.SQL);
        request.setInputSerialization(inputSerialization);
        request.setOutputSerialization(outputSerialization);

        // Run the query
        final SelectObjectContentResult result = client.selectObjectContent(request);

        // Parse the results
        final InputStream stream = result.getPayload().getRecordsInputStream();

        IOUtils.copy(stream, System.out);
    }
}

The output is:

{"a":"dataA","b":"dataB"}

AWS Java SDK 2.x

The code for the AWS Java SDK 2.x is more cunning. Refer to this ticket for more information.

@ExtendWith(S3.class)
class SelectTest {
    @AWSClient(endpoint = Endpoint.class)
    private S3AsyncClient client;

    @Test
    void test() throws Exception {
        final InputSerialization inputSerialization = InputSerialization
            .builder()
            .json(JSONInput.builder().type(JSONType.DOCUMENT).build())
            .compressionType(CompressionType.NONE)
            .build();

        final OutputSerialization outputSerialization = OutputSerialization.builder()
            .json(JSONOutput.builder().build())
            .build();

        final SelectObjectContentRequest select = SelectObjectContentRequest.builder()
            .bucket("so67315601")
            .key("data.json")
            .expression("SELECT s.a, s.b FROM s3object s LIMIT 5")
            .expressionType(ExpressionType.SQL)
            .inputSerialization(inputSerialization)
            .outputSerialization(outputSerialization)
            .build();
        final TestHandler handler = new TestHandler();

        client.selectObjectContent(select, handler).get();

        RecordsEvent response = (RecordsEvent) handler.receivedEvents.stream()
            .filter(e -> e.sdkEventType() == SelectObjectContentEventStream.EventType.RECORDS)
            .findFirst()
            .orElse(null);

        System.out.println(response.payload().asUtf8String());
    }

    private static class TestHandler implements SelectObjectContentResponseHandler {
        private SelectObjectContentResponse response;
        private List<SelectObjectContentEventStream> receivedEvents = new ArrayList<>();
        private Throwable exception;

        @Override
        public void responseReceived(SelectObjectContentResponse response) {
            this.response = response;
        }

        @Override
        public void onEventStream(SdkPublisher<SelectObjectContentEventStream> publisher) {
            publisher.subscribe(receivedEvents::add);
        }

        @Override
        public void exceptionOccurred(Throwable throwable) {
            exception = throwable;
        }

        @Override
        public void complete() {
        }
    }
}

As you see, it’s possible to make S3 selects programmatically!

You might be wondering what are those @AWSClient and @ExtendWith( S3.class )?

This is a small library to inject AWS clients in your tests, named aws-junit5. It would greatly simplify your tests. I am the author. The usage is really simple — try it in your next project!

User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement