Skip to content

Streaming large JSON from input stream efficiently in Java

In order to save memory and avoid an OOM error, I want to stream a large JSON from an input stream and extract the desired things from it. More exactly, I want to extract and save some strings from that JSON:

  1. files.content.fileContent.subList.text = “some text in file”
  2. files.content.fileContent.subList.text = “some text in file2”

and save them into a String variable:

String result = "some text in file rnsome text in file2"

I tried to parse the JSON using Jackson:

        JsonFactory jsonFactory = new JsonFactory();

        StringBuilder result = new StringBuilder();
        try (JsonParser jsonParser = jsonFactory.createParser(jsonAsInputStream)) {
            String fieldName;
            while (jsonParser.nextToken() != JsonToken.END_OBJECT) {
                jsonParser.nextToken();
                fieldName = jsonParser.getCurrentName();
                if ("files".equals(fieldName)) {

                    while (true) {
                        jsonParser.nextToken();
                        fieldName = jsonParser.getCurrentName();
                        if ("content".equals(fieldName)) {
                            jsonParser.nextToken();
                            fieldName = jsonParser.getCurrentName();
                            while (true) {
                                if ("text".equals(fieldName)) {
                                    result.append(jsonParser.getText());
                                }
                            }
                        }
                    }
                }
            }
            LOGGER.info("result: {}", result);
        } catch (JsonParseException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }

The above is not working at all, that solution gets complicated. Is there any simple way to parse the JSON inputStream and extract some text from it?

Below is the JSON attached:

{
"id": "1",
"name": "TestFile.xlsx",
"files": [
    {
        "id": "1",
        "fileName": "TestFile.xlsx",
        "types": {
            "fileId": "1",
            "context": [
                {
                    "id": 1,
                    "contextId": "xyz",
                    "metadata": {
                        "abc": "1"
                    }
                },
                {
                    "id": 2,
                    "contextId": "abc"
                }
            ],
            "fileSettings": [
                {
                    "id": 1,
                    "settingId": 1
                },
                {
                    "id": 2,
                    "settingId": 2
                }
                
            ],
            "fileAttachments": [
                {
                    "id": 1,
                    "canDelete": true,
                    "canAttach": []
                }
            ],
            "replacements": [
                {
                    "id": 1,
                    "replacementText": "xcv"
                }
            ]
        },
        "content": [
            {
                "id": "1",
                "contextList": [
                    1,
                    2,
                    3
                ],
                "fileContent": {
                    "contentType": "text",
                    "subList": [
                        {
                            "id": "1",
                            "subList": [
                                {
                                    "id": "1",
                                    "text": "some text in file",
                                    "type": "text"
                                }
                            ]
                        }
                    ]
                },
                "externalContent": {
                    "id": "1",
                    "children": [
                        {
                            "id": "1",
                            "contentType": "text corrupted",
                            "children": []
                        }
                    ]
                }
            },
            {
                "id": "2",
                "contextList": [
                    1,
                    2
                ],
                "fileContent": {
                    "contentType": "text",
                    "subList": [
                        {
                            "id": "2",
                            "subList": [
                                {
                                    "id": "1",
                                    "text": "some text in file2",
                                    "type": "text"
                                }
                            ]
                        }
                    ]
                },
                "externalContent": {
                    "id": "2",
                    "children": [
                        {
                            "id": "2",
                            "contentType": "text corrupted2",
                            "children": []
                        }
                    ]
                }
            }
        ]
    }
]

}

Answer

In short,

  • your code does not work because it implements a wrong algorithm;
  • JsonPath, as it has been suggested, seems to be a good DSL implementation, but it uses a DOM approach collecting entire JSON tree into memory, therefore you’ll run into OOM again.

You have two solutions:

  • implement a proper algorithm within your current approach (and I agree you were on a right way);
  • try implementing something similar to what JsonPath implements breaking down the problem to smaller ones supporting really streaming approach.

I wouldn’t document much of my code since it’s pretty easy to understand and adapt to other libraries, but you can develop a more advanced thing of the following code using Java 17 (w/ preview features enabled) and javax.json (+ some Lombok for Java boilerplate):

@RequiredArgsConstructor(access = AccessLevel.PRIVATE)
public final class PathJsonParser
        implements JsonParser, Iterator<JsonParser.Event> {

    private static final int DEFAULT_PATH_LENGTH = 32;

    private final JsonParser jsonParser;
    private final AbstractPathElement[] path;
    private int last;

    public static PathJsonParser create(final JsonParser jsonParser) {
        final int maxPathLength = DEFAULT_PATH_LENGTH;
        final PathJsonParser pathJsonParser = new PathJsonParser(jsonParser, new AbstractPathElement[maxPathLength]);
        pathJsonParser.path[0] = AbstractPathElement.Root.instance;
        for ( int i = 1; i < maxPathLength; i++ ) {
            pathJsonParser.path[i] = new AbstractPathElement.Container();
        }
        return pathJsonParser;
    }

    @Override
    public Event next() {
        final Event event = jsonParser.next();
        switch ( event ) {
        case START_ARRAY -> {
            path[last].tryIncreaseIndex();
            path[++last].reset(JsonValue.ValueType.ARRAY);
        }
        case START_OBJECT -> {
            path[last].tryIncreaseIndex();
            path[++last].reset(JsonValue.ValueType.OBJECT);
        }
        case KEY_NAME -> path[last].setKeyName(jsonParser.getString());
        case VALUE_STRING -> path[last].tryIncreaseIndex();
        case VALUE_NUMBER -> path[last].tryIncreaseIndex();
        case VALUE_TRUE -> path[last].tryIncreaseIndex();
        case VALUE_FALSE -> path[last].tryIncreaseIndex();
        case VALUE_NULL -> path[last].tryIncreaseIndex();
        case END_OBJECT -> --last;
        case END_ARRAY -> --last;
        default -> throw new AssertionError(event);
        }
        return event;
    }

    public boolean matchesRoot(final int at) {
        @Nullable
        final AbstractPathElement e = tryElementAt(at);
        return e != null && e.matchesRoot();
    }

    public boolean matchesIndex(final int at, final IntPredicate predicate) {
        @Nullable
        final AbstractPathElement e = tryElementAt(at);
        return e != null && e.matchesIndex(predicate);
    }

    public boolean matchesName(final int at, final Predicate<? super String> predicate) {
        @Nullable
        final AbstractPathElement e = tryElementAt(at);
        return e != null && e.matchesName(predicate);
    }

    // @formatter:off
    @Override public boolean hasNext() { return jsonParser.hasNext(); }
    @Override public String getString() { return jsonParser.getString(); }
    @Override public boolean isIntegralNumber() { return jsonParser.isIntegralNumber(); }
    @Override public int getInt() { return jsonParser.getInt(); }
    @Override public long getLong() { return jsonParser.getLong(); }
    @Override public BigDecimal getBigDecimal() { return jsonParser.getBigDecimal(); }
    @Override public JsonLocation getLocation() { return jsonParser.getLocation(); }
    @Override @SuppressWarnings("MethodDoesntCallSuperMethod") public JsonObject getObject() { return jsonParser.getObject(); }
    @Override @SuppressWarnings("MethodDoesntCallSuperMethod") public JsonValue getValue() { return jsonParser.getValue(); }
    @Override @SuppressWarnings("MethodDoesntCallSuperMethod") public JsonArray getArray() { return jsonParser.getArray(); }
    @Override @SuppressWarnings("MethodDoesntCallSuperMethod") public Stream<JsonValue> getArrayStream() { return jsonParser.getArrayStream(); }
    @Override @SuppressWarnings("MethodDoesntCallSuperMethod") public Stream<Map.Entry<String, JsonValue>> getObjectStream() { return jsonParser.getObjectStream(); }
    @Override @SuppressWarnings("MethodDoesntCallSuperMethod") public Stream<JsonValue> getValueStream() { return jsonParser.getValueStream(); }
    @Override @SuppressWarnings("MethodDoesntCallSuperMethod") public void skipArray() { jsonParser.skipArray(); }
    @Override @SuppressWarnings("MethodDoesntCallSuperMethod") public void skipObject() { jsonParser.skipObject(); }
    @Override public void close() { jsonParser.close(); }
    // @formatter:on

    @Nullable
    private AbstractPathElement tryElementAt(final int at) {
        final int pathAt;
        if ( at >= 0 ) {
            pathAt = at;
        } else {
            pathAt = last + at + 1;
        }
        if ( pathAt < 0 || pathAt > last ) {
            return null;
        }
        return path[pathAt];
    }

    private abstract static sealed class AbstractPathElement
            permits AbstractPathElement.Root, AbstractPathElement.Container {

        abstract void reset(JsonValue.ValueType valueType);

        abstract void setKeyName(String keyName);

        abstract void tryIncreaseIndex();

        abstract boolean matchesRoot();

        abstract boolean matchesIndex(IntPredicate predicate);

        abstract boolean matchesName(Predicate<? super String> predicate);

        @RequiredArgsConstructor(access = AccessLevel.PRIVATE)
        private static final class Root
                extends AbstractPathElement {

            private static final AbstractPathElement instance = new Root();

            @Override
            void reset(final JsonValue.ValueType valueType) {
                throw new UnsupportedOperationException();
            }

            @Override
            void setKeyName(final String keyName) {
                throw new UnsupportedOperationException();
            }

            @Override
            void tryIncreaseIndex() {
                // do nothing
            }

            @Override
            boolean matchesRoot() {
                return true;
            }

            @Override
            boolean matchesIndex(final IntPredicate predicate) {
                return false;
            }

            @Override
            boolean matchesName(final Predicate<? super String> predicate) {
                return false;
            }

        }

        @RequiredArgsConstructor(access = AccessLevel.PACKAGE)
        private static final class Container
                extends AbstractPathElement {

            private static final String NO_KEY_NAME = null;
            private static final int NO_INDEX = -1;

            private JsonValue.ValueType valueType;
            private String keyName = NO_KEY_NAME;
            private int index = NO_INDEX;

            @Override
            void reset(final JsonValue.ValueType valueType) {
                this.valueType = valueType;
                keyName = NO_KEY_NAME;
                index = NO_INDEX;
            }

            @Override
            void setKeyName(final String keyName) {
                this.keyName = keyName;
            }

            @Override
            void tryIncreaseIndex() {
                if ( valueType == JsonValue.ValueType.ARRAY ) {
                    index++;
                }
            }

            @Override
            boolean matchesRoot() {
                return false;
            }

            @Override
            boolean matchesIndex(final IntPredicate predicate) {
                return switch ( valueType ) {
                    case ARRAY -> index != NO_INDEX && predicate.test(index);
                    case OBJECT -> false;
                    case STRING, NUMBER, TRUE, FALSE, NULL -> throw new AssertionError(valueType);
                };
            }

            @Override
            boolean matchesName(final Predicate<? super String> predicate) {
                return switch ( valueType ) {
                    case ARRAY -> false;
                    case OBJECT -> !Objects.equals(keyName, NO_KEY_NAME) && predicate.test(keyName);
                    case STRING, NUMBER, TRUE, FALSE, NULL -> throw new AssertionError(valueType);
                };
            }

        }

    }

}

Example of use:

public final class PathJsonParserTest {

    // $.files.0.content.0.fileContent.subList.0.subList.0.text
    private static boolean matches(final PathJsonParser parser) {
        return parser.matchesName(-1, name -> name.equals("text"))
                && parser.matchesIndex(-2, index -> true)
                && parser.matchesName(-3, name -> name.equals("subList"))
                && parser.matchesIndex(-4, index -> true)
                && parser.matchesName(-5, name -> name.equals("subList"))
                && parser.matchesName(-6, name -> name.equals("fileContent"))
                && parser.matchesIndex(-7, index -> true)
                && parser.matchesName(-8, name -> name.equals("content"))
                && parser.matchesIndex(-9, index -> true)
                && parser.matchesName(-10, name -> name.equals("files"))
                && parser.matchesRoot(-11);
    }

    @Test
    public void test()
            throws IOException {
        try ( final PathJsonParser parser = PathJsonParser.create(JsonParsers.openFromResource(PathJsonParserTest.class, "input.json")) ) {
            for ( ; parser.hasNext(); parser.next() ) {
                if ( matches(parser) ) {
                    parser.next();
                    System.out.println(parser.getValue());
                }
            }
        }
    }

}

Of course, not that cool-looking as JsonPath is, but you can do the following:

  • implement a matcher builder API to make it look nicer;
  • implement a JSON Path-compliant parser to build matchers;
  • wrap the for/if/next() pattern into a generic algorithm (similar to what BufferedReader.readLine() implements or wrap it for Stream API);
  • implement some kind of simple JSON-to-objects deserializer.

Or, if possible, find a good code generator that can generate a streamed parser having as small runtime cost as possible (its outcome would be very similar to yours, but working). (Ping me please if you are aware of any.)