How to fix Ignite performance issues?

Tags: ,



We use Ignite 2.7.6 in both server and client modes: two server and six clients.

At first, each app node with client Ignite inside had 2G heap. Each Ignite server node had 24G offheap and 2G heap.

With last app update we introduced new functionality which required about 2000 caches of 20 entires (user groups). Cache entry has small size up to 10 integers inside. These caches are created via ignite.getOrCreateCache(name) method, so they have default cache configurations (off-heap, partitioned).

But in an hour after update we got OOM error on a server node:

[00:59:55,628][SEVERE][sys-#44759][GridDhtPartitionsExchangeFuture] Failed to notify listener: o.a.i.i.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$2@3287dcbd
java.lang.OutOfMemoryError: Java heap space

Heaps are increased now to 16G on Ignite server nodes and to 12G on app nodes.

As we can see, all server nodes have high CPU load about 250% now (20% before update) and long G1 Young Gen pauses up to 5 millisecond (300 microseconds before update).

Server config is:

<beans xmlns="http://www.springframework.org/schema/beans" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation=" http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd">
  <bean id="grid.cfg" class="org.apache.ignite.configuration.IgniteConfiguration">
    <property name="workDirectory" value="/opt/qwerty/ignite/data"/>
    <property name="gridLogger">
      <bean class="org.apache.ignite.logger.log4j2.Log4J2Logger">
        <constructor-arg type="java.lang.String" value="config/ignite-log4j2.xml"/>
      </bean>
    </property>
    <property name="dataStorageConfiguration">
      <bean class="org.apache.ignite.configuration.DataStorageConfiguration">
        <property name="defaultDataRegionConfiguration">
          <bean class="org.apache.ignite.configuration.DataRegionConfiguration">
            <property name="maxSize" value="#{24L * 1024 * 1024 * 1024}"/>
            <property name="pageEvictionMode" value="RANDOM_LRU"/>
          </bean>
        </property>
      </bean>
    </property>
    <property name="discoverySpi">
      <bean class="org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi">
        <property name="localAddress" value="host-1.qwerty.srv"/>
        <property name="ipFinder">
          <bean class="org.apache.ignite.spi.discovery.tcp.ipfinder.vm.TcpDiscoveryVmIpFinder">
            <property name="addresses">
              <list>
                <value>host-1.qwerty.srv:47500</value>
                <value>host-2.qwerty.srv:47500</value>
              </list>
            </property>
          </bean>
        </property>
      </bean>
    </property>
    <property name="communicationSpi">
      <bean class="org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi">
        <property name="localAddress" value="host-1.qwerty.srv"/>
      </bean>
    </property>
  </bean>
</beans>

In memory dump of an Ignite server node we see a lot of org.apache.ignite.internal.marshaller.optimized.OptimizedObjectStreamRegistry$StreamHolder of 21Mb

Memory leak report shows:

Problem Suspect 1

One instance of "org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager" loaded by "jdk.internal.loader.ClassLoaders$AppClassLoader @ 0x400000100" occupies 529 414 776 (10,39 %) bytes. The memory is accumulated in one instance of "java.util.LinkedList" loaded by "<system class loader>".

Keywords
jdk.internal.loader.ClassLoaders$AppClassLoader @ 0x400000100
java.util.LinkedList
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager

Problem Suspect 2

384 instances of "org.apache.ignite.thread.IgniteThread", loaded by "jdk.internal.loader.ClassLoaders$AppClassLoader @ 0x400000100" occupy 3 023 380 000 (59,34 %) bytes. 

Keywords
org.apache.ignite.thread.IgniteThread
jdk.internal.loader.ClassLoaders$AppClassLoader @ 0x400000100

Problem Suspect 3

1 023 instances of "org.apache.ignite.internal.processors.cache.CacheGroupContext", loaded by "jdk.internal.loader.ClassLoaders$AppClassLoader @ 0x400000100" occupy 905 077 824 (17,76 %) bytes. 

Keywords
jdk.internal.loader.ClassLoaders$AppClassLoader @ 0x400000100
org.apache.ignite.internal.processors.cache.CacheGroupContext

The question is what’s wrong we have done? What can we tune? Maybe the problem in our code, but how to identify where it is?

Answer

2000 caches is a lot. One cache probably takes up to 40M in data structures.

I recommend at least using the same cacheGroup for all caches of the similar purpose and composition, to share some of these data structures.



Source: stackoverflow