I am using AWS Java SDK in my application to talk to one of my S3 buckets which holds objects in JSON format.
A document may look like this:
{ "a" : dataA, "b" : dataB, "c" : dataC, "d" : dataD, "e" : dataE }
Now, for a certain document lets say document1
I need to fetch the values corresponding to field a
and b
instead of fetching the entire document.
This sounds like something that wouldn’t be possible because S3 buckets can have any type of documents in them and not just JSONs.
Is this something that is achievable though?
Advertisement
Answer
That’s actually doable. You could do selects like you’ve described, but only for particular formats: JSON, CSV, Parquet.
Imagine having a data.json
file in so67315601
bucket in eu-central-1
:
{ "a": "dataA", "b": "dataB", "c": "dataC", "d": "dataD", "e": "dataE" }
First, learn how to select the fields via the S3 Console. Use “Object Actions” → “Query with S3 Select”:
AWS Java SDK 1.x
Here is the code to do the select with AWS Java SDK 1.x:
@ExtendWith(S3.class) class SelectTest { @AWSClient(endpoint = Endpoint.class) private AmazonS3 client; @Test void test() throws IOException { // LINES: Each line in the input data contains a single JSON object // DOCUMENT: A single JSON object can span multiple lines in the input final JSONInput input = new JSONInput(); input.setType(JSONType.DOCUMENT); // Configure input format and compression final InputSerialization inputSerialization = new InputSerialization(); inputSerialization.setJson(input); inputSerialization.setCompressionType(CompressionType.NONE); // Configure output format final OutputSerialization outputSerialization = new OutputSerialization(); outputSerialization.setJson(new JSONOutput()); // Build the request final SelectObjectContentRequest request = new SelectObjectContentRequest(); request.setBucketName("so67315601"); request.setKey("data.json"); request.setExpression("SELECT s.a, s.b FROM s3object s LIMIT 5"); request.setExpressionType(ExpressionType.SQL); request.setInputSerialization(inputSerialization); request.setOutputSerialization(outputSerialization); // Run the query final SelectObjectContentResult result = client.selectObjectContent(request); // Parse the results final InputStream stream = result.getPayload().getRecordsInputStream(); IOUtils.copy(stream, System.out); } }
The output is:
{"a":"dataA","b":"dataB"}
AWS Java SDK 2.x
The code for the AWS Java SDK 2.x is more cunning. Refer to this ticket for more information.
@ExtendWith(S3.class) class SelectTest { @AWSClient(endpoint = Endpoint.class) private S3AsyncClient client; @Test void test() throws Exception { final InputSerialization inputSerialization = InputSerialization .builder() .json(JSONInput.builder().type(JSONType.DOCUMENT).build()) .compressionType(CompressionType.NONE) .build(); final OutputSerialization outputSerialization = OutputSerialization.builder() .json(JSONOutput.builder().build()) .build(); final SelectObjectContentRequest select = SelectObjectContentRequest.builder() .bucket("so67315601") .key("data.json") .expression("SELECT s.a, s.b FROM s3object s LIMIT 5") .expressionType(ExpressionType.SQL) .inputSerialization(inputSerialization) .outputSerialization(outputSerialization) .build(); final TestHandler handler = new TestHandler(); client.selectObjectContent(select, handler).get(); RecordsEvent response = (RecordsEvent) handler.receivedEvents.stream() .filter(e -> e.sdkEventType() == SelectObjectContentEventStream.EventType.RECORDS) .findFirst() .orElse(null); System.out.println(response.payload().asUtf8String()); } private static class TestHandler implements SelectObjectContentResponseHandler { private SelectObjectContentResponse response; private List<SelectObjectContentEventStream> receivedEvents = new ArrayList<>(); private Throwable exception; @Override public void responseReceived(SelectObjectContentResponse response) { this.response = response; } @Override public void onEventStream(SdkPublisher<SelectObjectContentEventStream> publisher) { publisher.subscribe(receivedEvents::add); } @Override public void exceptionOccurred(Throwable throwable) { exception = throwable; } @Override public void complete() { } } }
As you see, it’s possible to make S3 selects programmatically!
You might be wondering what are those @AWSClient
and @ExtendWith( S3.class )
?
This is a small library to inject AWS clients in your tests, named aws-junit5
. It would greatly simplify your tests. I am the author. The usage is really simple — try it in your next project!