Gremlin Driver blocks while initializing ConnectionPool with multiple endpoints

Tags: , , ,



We are running a neptune DB in AWS. We have one writer and 3 reader instances. A few weeks ago, we found out, that the load balancing does not work as expected. We figured out, that our software instance is connecting to just one reader and keeps this connection until EOL. So the other reader instances were never be taken. Considering following link https://docs.aws.amazon.com/neptune/latest/userguide/feature-overview-endpoints.html. There is described, that for neptune load balancing, you have to do it client side and one precondition is, that you have to disable DNS cache. The client side implementation is described here https://docs.amazonaws.cn/en_us/neptune/latest/userguide/best-practices-gremlin-java-multiple.html respectively https://docs.aws.amazon.com/neptune/latest/userguide/best-practices-gremlin-java-separate.html because we handle the writer and reader cluster separately. Our software is written in java. So we implemented the described problem as follows:

disbale DNS cache in jvm:

java.security.Security.setProperty("networkaddress.cache.ttl", "0");

pom.xml looks like:

<properties>
    <gremlin.version>3.4.10</gremlin.version>
</properties>
<dependencies>
    <dependency>
        <groupId>org.apache.tinkerpop</groupId>
        <artifactId>gremlin-driver</artifactId>
        <version>${gremlin.version}</version>
    </dependency>
    <dependency>
        <groupId>org.apache.tinkerpop</groupId>
        <artifactId>tinkergraph-gremlin</artifactId>
        <version>${gremlin.version}</version>
    </dependency>
    <dependency>
        <!-- aws neptune db -->
        <groupId>org.apache.tinkerpop</groupId>
        <artifactId>gremlin-core</artifactId>
        <version>${gremlin.version}</version>
    </dependency>
</dependencies>

Connecting to database via gremlin driver:

    Cluster.Builder writer = Cluster.build().port(8182)
            .maxInProcessPerConnection(32).maxSimultaneousUsagePerConnection(32).maxContentLength(4 * 1024 * 1024)
            .serializer(Serializers.GRAPHBINARY_V1D0)
            .addContactPoint("some aws instance enpoint -- 1 --");

    Cluster.Builder reader = Cluster.build().port(8182)
            .maxInProcessPerConnection(32).maxSimultaneousUsagePerConnection(32).maxContentLength(4 * 1024 * 1024)
            .serializer(Serializers.GRAPHBINARY_V1D0)
            .addContactPoint("some aws instance enpoint -- 2 --")
            .addContactPoint("some aws instance enpoint -- 3 --");

    final Cluster writerCluster = writer.create();
    final Cluster readerCluster = reader.create();

    DriverRemoteConnection writerConn = DriverRemoteConnection.using(writerCluster);
    DriverRemoteConnection readerConn = DriverRemoteConnection.using(readerCluster);

    gWriter = AnonymousTraversalSource.traversal().withRemote(writerConn);
    gReader = AnonymousTraversalSource.traversal().withRemote(readerConn);

    for(int i = 0; i < 10; i++){
        NeptuneAdapter.getInstance().setGraph(gWriter);
        System.out.println(gWriter.addV("TestVertex" + i + 1).iterate());
        System.out.println("Vertex added, now: " + gWriter.V().count().next().toString());
        NeptuneAdapter.getInstance().setGraph(gReader);
        System.out.println(gReader.V().count().next().toString());
        System.out.println(gReader.V().count().next().toString());
        System.out.println(gReader.V().count().next().toString());
        System.out.println(gReader.V().count().next().toString());
        System.out.println(gReader.V().count().next().toString());
        System.out.println(gReader.V().count().next().toString());
        Thread.sleep(1000);
    }

Problem is, while running this code, nothing happens at the first time of getting the graph. After some debugging we found out that in the constructor of ConnectionPool is the blocking code. In it, dependent on the minPoolSize, there is a CompletableFuture created for each Connection. In it, the Connection is established via a Host. While execution through the Clusters Manager ScheduledExecutor, the ConnectionPool constructor is joining all futures. As described here I want do something as future done order in CompletableFuture List the implementations seems to be right. But there must be happen something blocking. After checking out the gremlin-driver and comment the joining-code-line out and set up a simple Thread.sleep(), the code does work as expected. And now, the load balancing thing is working too. After adding some outputs, the output of the executed code above looks like:

CONNECTION_POOL --- constructor --- poolLabel: {address=endpoint -- 1 -- /IP:PORT}
Opening connection pool
LoadBalancingStrategy adding host: Host{address=endpoint -- 1 -- /IP:PORT} host size is now 1
CONNECTION_POOL --- borrowConnection --- host: Host{address=endpoint -- 1 -- /IP:PORT} for next Query
[RemoteStep(DriverServerConnection-address=endpoint -- 1 -- /IP:PORT [graph=g])]
CONNECTION_POOL --- borrowConnection --- host: Host{address=endpoint -- 1 -- /IP:PORT} for next Query
Vertex added, now: 1
CONNECTION_POOL --- constructor --- poolLabel: {address=endpoint -- 2 -- /IP:PORT}
CONNECTION_POOL --- constructor --- poolLabel: {address=endpoint -- 3 -- /IP:PORT}
Opening connection pool
LoadBalancingStrategy adding host: Host{address=endpoint -- 2 -- /IP:PORT} host size is now 1
Opening connection pool
LoadBalancingStrategy adding host: Host{address=endpoint -- 3 -- /IP:PORT} host size is now 2
CONNECTION_POOL --- borrowConnection --- host: Host{address=endpoint -- 2 -- /IP:PORT} for next Query
1
CONNECTION_POOL --- borrowConnection --- host: Host{address=endpoint -- 3 -- /IP:PORT} for next Query
1
CONNECTION_POOL --- borrowConnection --- host: Host{address=endpoint -- 2 -- /IP:PORT} for next Query
1
CONNECTION_POOL --- borrowConnection --- host: Host{address=endpoint -- 3 -- /IP:PORT} for next Query
1
CONNECTION_POOL --- borrowConnection --- host: Host{address=endpoint -- 2 -- /IP:PORT} for next Query
1
CONNECTION_POOL --- borrowConnection --- host: Host{address=endpoint -- 3 -- /IP:PORT} for next Query
1
CONNECTION_POOL --- borrowConnection --- host: Host{address=endpoint -- 1 -- /IP:PORT} for next Query
[RemoteStep(DriverServerConnection-address=endpoint -- 1 -- /IP:PORT [graph=g])]
CONNECTION_POOL --- borrowConnection --- host: Host{address=endpoint -- 1 -- /IP:PORT} for next Query
Vertex added, now: 2
CONNECTION_POOL --- borrowConnection --- host: Host{address=endpoint -- 2 -- /IP:PORT} for next Query
1
CONNECTION_POOL --- borrowConnection --- host: Host{address=endpoint -- 3 -- /IP:PORT} for next Query
1
CONNECTION_POOL --- borrowConnection --- host: Host{address=endpoint -- 2 -- /IP:PORT} for next Query
1
CONNECTION_POOL --- borrowConnection --- host: Host{address=endpoint -- 3 -- /IP:PORT} for next Query
2
CONNECTION_POOL --- borrowConnection --- host: Host{address=endpoint -- 2 -- /IP:PORT} for next Query
2
CONNECTION_POOL --- borrowConnection --- host: Host{address=endpoint -- 3 -- /IP:PORT} for next Query
2

The question is now, are we using the gremlin driver in a wrong way or is this a bug and we should add an issues to the tinkerpop-master repository? Or is there some other magic we do not understand?

Answer

We had hit this issue with Neptune load balancing for reader nodes in the past. We addressed it by making use of

https://github.com/awslabs/amazon-neptune-tools/tree/master/neptune-gremlin-client/gremlin-client

and we had to tweak our reader client a bit in order to handle load balancing at client side.

The updated way of creating a reader client looks something like this:

GremlinClient client;
GremlinCluster cluster;
ClusterEndpointsRefreshAgent clusterEndpointRefreshAgent;
String clusterId = "<your_cluster_id>";

     private void createReaderClient(boolean isIAMAuthEnabled) {
            EndpointsSelector endpointsSelector = EndpointsType.ReadReplicas;
            clusterEndpointRefreshAgent = new ClusterEndpointsRefreshAgent(clusterId, endpointsSelector);
            Collection<String> addresses = clusterEndpointRefreshAgent.getAddresses().get(endpointsSelector);
            if (isIAMAuthEnabled) {
                cluster = createNeptuneGremlinClusterBuilder(addresses);
            } else {
                cluster = createGremlinClusterBuilder(addresses);
            }
    
            client = cluster.connect();
            clusterEndpointRefreshAgent.startPollingNeptuneAPI(
                addrs -> client.refreshEndpoints(addrs.get(endpointsSelector)), 300,
                TimeUnit.SECONDS);
        }
    
     private GremlinCluster createGremlinClusterBuilder(Collection<String> addresses) {
            GremlinClusterBuilder builder = GremlinClusterBuilder.build().port(8182)
                .addContactPoints(addresses).enableSsl(true);
            //set other required properties of GremlinCluster
            return builder.create();
        }
    
     private GremlinCluster createNeptuneGremlinClusterBuilder(Collection<String> addresses) {
            NeptuneGremlinClusterBuilder builder = NeptuneGremlinClusterBuilder.build()
                .port(8182).addContactPoints(addresses)
                .enableSsl(true).enableIamAuth(true);
            // set other required properties of NeptuneGremlinClusterBuilder
            return builder.create();
        }

And this reader client can be created before creating the GraphTraversalSource something like this:

    GraphTraversalSource g;
    GraphTraversalSource getGraphTraversalSource(boolean isIAMAuthEnabled) {
        if (g == null) {
            createReaderClient(isIAMAuthEnabled);
            g = AnonymousTraversalSource.traversal().withRemote(DriverRemoteConnection.using(client));
        }
        return g;
    }


Source: stackoverflow