Skip to content
Advertisement

_XReply() terminates app with _XIOError()

We’re developing some complexed application which consists of linux binary integrated with java jni calls (from JVM created in linux binary) from our custom made .jar file. All gui work is implemented and done by java part. Each time some gui property has to be changed or gui has to be repainted, it is done by jni call to JVM.

Complete display/gui is repainted (or refreshed) as fast as JVM/java can handle it. It is done iteratively and frequently, few hunderds or thousands iterations per second.

After some exact time, application is terminated with exit(1) which I caught with gdb to be called from _XIOError(). This termination can be repeated after more or less exact time period, e.g. after some 15h on x86 dual core 2.5GHz. If I use some slower computer, it lasts longer, like it is proportional to cpu/gpu speed. Some conclusion would be that some part of xorg ran out of some resource or something like that.

Here is my backtrace:

#0  0xb7fe1424 in __kernel_vsyscall ()
#1  0xb7c50941 in raise () from /lib/i386-linux-gnu/i686/cmov/libc.so.6
#2  0xb7c53d72 in abort () from /lib/i386-linux-gnu/i686/cmov/libc.so.6
#3  0xb7fdc69d in exit () from /temp/bin/liboverrides.so
#4  0xa0005c80 in _XIOError () from /usr/lib/i386-linux-gnu/libX11.so.6
#5  0xa0003afe in _XReply () from /usr/lib/i386-linux-gnu/libX11.so.6
#6  0x9fffee7b in XSync () from /usr/lib/i386-linux-gnu/libX11.so.6
#7  0xa01232b8 in X11SD_GetSharedImage () from /usr/lib/jvm/jre1.8.0_20/lib/i386/libawt_xawt.so
#8  0xa012529e in X11SD_GetRasInfo () from /usr/lib/jvm/jre1.8.0_20/lib/i386/libawt_xawt.so
#9  0xa01aac3d in Java_sun_java2d_loops_ScaledBlit_Scale () from /usr/lib/jvm/jre1.8.0_20/lib/i386/libawt.so

I made my own exit() call in liboverrides.so and used it with LD_PRELOAD to capture exit() call in gdb with help of abort()/SIGABRT. After some debugging of libX11 and libxcb, I noticed that _XReply() got NULL reply (response from xcb_wait_for_reply()) that causes call to _XIOError() and exit(1). Going more deeply in libxcb in xcb_wait_for_reply() function, I noticed that one of the reasons it can return NULL reply is when it detects broken or closed socket connection, which could be my situation.

For test purposes, if I change xcb_io.c and ignore _XIOError(), application doesn’t work any more. And if I repeat request inside _XReply(), it fails each time, i.e. gets NULL response on each xcb_wait_for_reply().

So, my questions would be why such uncontrolled app termination with exit(1) from _XReply() -> XIOError() -> exit(1) happened or how can I find out reason why and what happened, so I can fix it or do some workaround.

For this problem to repeat, as I wrote above, I have to wait for some 15h, but currently I’m very short on time for debuging and can’t find the cause of problem/termination. We also tried to reorganise java part which handles gui/display refresh, but the problem wasn’t solved.

Some SW facts:
– java jre 1.8.0_20, even with java 7 can repeat the problem
– libX11.so 1.5.0
– libxcb.so 1.8.1
– debian wheezy
– kernel 3.2.0

Advertisement

Answer

This is likely a known issue in libX11 regarding the handling of request numbers used for xcb_wait_for_reply.

At some point after libxcb v1.5 code to use 64-bit sequence numbers internally everywhere was introduced and logic was added to widen sequence numbers on entry to those public APIs that still take 32-bit sequence numbers.

Here is a quote from submitted libxcb bug report (actual emails removed):

We have an application that does a lot of XDrawString and XDrawLine. After several hours the application is exited by an XIOError.

The XIOError is called in libX11 in the file xcb_io.c, function _XReply. It didn’t get a response from xcb_wait_for_reply.

libxcb 1.5 is fine, libxcb 1.8.1 is not. Bisecting libxcb points to this commit:

commit ed37b087519ecb9e74412e4df8f8a217ab6d12a9 Author: Jamey Sharp Date: Sat Oct 9 17:13:45 2010 -0700

xcb_in: Use 64-bit sequence numbers internally everywhere.

Widen sequence numbers on entry to those public APIs that still take
32-bit sequence numbers.

Signed-off-by: Jamey Sharp <jamey@xxxxxx.xxx>

Reverting it on top of 1.8.1 helps.

Adding traces to libxcb I found that the last request numbers used for xcb_wait_for_reply are these: 4294900463 and 4294965487 (two calls in the while loop of the _XReply function), half a second later: 63215 (then XIOError is called). The widen_request is also 63215, I would have expected 63215+2^32. Therefore it seems that the request is not correctly widened.

The commit above also changed the compares in poll_for_reply from XCB_SEQUENCE_COMPARE_32 to XCB_SEQUENCE_COMPARE. Maybe the widening never worked correctly, but it was never observed, because only the lower 32bits were compared.

Reproducing the issue

Here’s the original code snippet from the submitted bug report which was used to reproduce the issue:

  for(;;) {
    XDrawLine(dpy, w, gc, 10, 60, 180, 20);
    XFlush(dpy);
  }

and apparently the issue can be reproduced with even simpler code:

 for(;;) {
    XNoOp(dpy);
  }

According to submitted libxcb bug report these conditions are needed to reproduce (assuming the reproduce code is in xdraw.c):

  • libxcb >= 1.8 (i.e. includes the commit ed37b08)
  • compiled with 32bit: gcc -m32 -lX11 -o xdraw xdraw.c
  • the sequence counter wraps.

Proposed patch

The proposed patch which can be applied on top of libxcb 1.8.1 is this:

diff --git a/src/xcb_io.c b/src/xcb_io.c
index 300ef57..8616dce 100644
--- a/src/xcb_io.c
+++ b/src/xcb_io.c
@@ -454,7 +454,7 @@ void _XSend(Display *dpy, const char *data, long size)
        static const xReq dummy_request;
        static char const pad[3];
        struct iovec vec[3];
-       uint64_t requests;
+       unsigned long requests;
        _XExtension *ext;
        xcb_connection_t *c = dpy->xcb->connection;
        if(dpy->flags & XlibDisplayIOError)
@@ -470,7 +470,7 @@ void _XSend(Display *dpy, const char *data, long size)
        if(dpy->xcb->event_owner != XlibOwnsEventQueue || dpy->async_handlers)
        {
                uint64_t sequence;
-               for(sequence = dpy->xcb->last_flushed + 1; sequence <= dpy->request; ++sequence)
+               for(sequence = dpy->xcb->last_flushed + 1; (unsigned long) sequence <= dpy->request; ++sequence)
                        append_pending_request(dpy, sequence);
        }
        requests = dpy->request - dpy->xcb->last_flushed;

Detailed technical explanation

Plase find bellow included detailed technical explanation by Jonas Petersen (also included in the aforementioned bug report):

Hi,

Here’s two patches. The first one fixes a 32-bit sequence wrap bug. The second patch only adds a comment to another relevant statement.

The patches contain some details. Here is the whole story for who might be interested:

Xlib (libx11) will crash an application with a “Fatal IO error 11 (Resource temporarily unavailable)” after 4 294 967 296 requests to the server. That is when the Xlib internal 32-bit sequence wraps.

Most applications probably will hardly reach this number, but if they do, they have a chance to die a mysterious death. For example the application I’m working on did always crash after about 20 hours when I started to do some stress testing. It does some intensive drawing through Xlib using gktmm2, pixmaps and gc drawing at 40 frames per second in full hd resolution (on Ubuntu). Some optimizations did extend the grace to about 35 hours but it would still crash.

What then followed was some frustrating weeks of digging and debugging to realize that it’s not in my application, nor in gtkmm, gtk or glib but that it’s this little bug in Xlib which exists since 2006-10-06 apparently.

It took a while to turn out that the number 0x100000000 (2^32) has some relevance. (Much) later it turned out it can be reproduced with Xlib only, using this code for example:

while(1) { XDrawPoint(display, drawable, gc, x, y); XFlush(display); }

It might take one or two hours, but when it reaches the 4294 million it will explode into a “Fatal IO error 11”.

What I then learned is that even though Xlib uses internal 32bit sequence numbers they get (smartly) widened to 64bit in the process so that the 32bit sequence may wrap without any disruption in the widened 64bit sequence. Obviously there must be something wrong with that.

The Fatal IO error is issued in _XReply() when it’s not getting a reply where there should be one, but the cause is earlier in _XSend() in the moment when the Xlib 32-bit sequence number wraps.

The problem is that when it wraps to 0, the value of ‘last_flushed’ will still be at the upper boundary (e.g. 0xffffffff). There is two locations in _XSend() (xcb_io.c) that fail in this state because they rely on those values being sequential all the time, the first location is:

requests = dpy->request – dpy->xcb->last_flushed;

I case of request = 0x0 and last_flushed = 0xffffffff it will assign 0xffffffff00000001 to ‘requests’ and then to XCB as a number (amount) of requests. This is the main killer.

The second location is this:

for(sequence = dpy->xcb->last_flushed + 1; sequence <= dpy->request; ++sequence)

I case of request = 0x0 (less than last_flushed) there is no chance to enter the loop ever and as a result some requests are ignored.

The solution is to “unwrap” dpy->request at these two locations and thus retain the sequence related to last_flushed.

uint64_t unwrapped_request = ((uint64_t)(dpy->request < dpy->xcb->last_flushed) << 32) + dpy->request;

It creates a temporary 64-bit request number which has bit 8 set if ‘request’ is less than ‘last_flushed’. It is then used in the two locations instead of dpy->request.

I’m not sure if it might be more efficient to use that statement inplace, instead of using a variable.

There is another line in require_socket() that worried me at first:

dpy->xcb->last_flushed = dpy->request = sent;

That’s a 64-bit, 32-bit, 64-bit assignment. It will truncate ‘sent’ to 32-bit when assinging it to ‘request’ and then also assign the truncated value to the (64-bit) ‘last_flushed’. But it seems inteded. I have added a note explaining that for the next poor soul debugging sequence issues… 🙂

  • Jonas

Jonas Petersen (2): xcb_io: Fix Xlib 32-bit request number wrapping xcb_io: Add comment explaining a mixed type double assignment

src/xcb_io.c | 14 +++++++++++— 1 file changed, 11 insertions(+), 3 deletions(-)

— 1.7.10.4

Good luck!

User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement