Haha, this is not JDK's bug, its our code's problem.
we use fdopen on a socket fd twice, and fclose it twice which will close the fd as well.
when the fd is close once, system may reuse it for another socket, and we close other socket's same fd num again. brings the problem.
====================================================================
causing our file close exception, and mix different fd's content with each other, put what I write to a file into a socket!
we upgrade to JDK 1.6, hoping those bugs are fixed and not bringing in some new bugs.
seems some one has met the same problem:
http://256.com/gray/docs/misc/java_bad_file_descriptor_close_bug.shtml
Java Bad File Descriptor Close Bug
We have been fighting for a number of weeks with a bad file descriptor bug on a search project at work. Although we initially assumed it was our buggy software to blame, a closer look at the problem seems to point elsewhere. Our current research seems to indicate that a bug in Sun's JDK under Linux (maybe others) is at fault here.
Background
We have been seeing a large number of the 'Bad file descriptor' exceptions in various places around our software. We initially thought that we were someone double closing or otherwise improperly using file handles. However, after a more aggressively review of the code in question as well as the JDK, we came to the conclusion that aside from creating input or output streams using the FileDescriptor constructor (which we were not doing) there should be no way to get these problems. In addition, many of the places which were throwing the exceptions (see below) were in 3rd party libraries including Lucene and Berkeley Database JE.
Our application is a search system which is using ActiveMQ JMS to receive, index, and store documents. We use 10s of JMS queues and topics to coordinate a distributed collection of search nodes. This translates into 10s of sockets which are mostly TCP but some UDP and multicast packets. Local disk descriptors to standard SCSI disks or hardware/software raid configurations. No network filesystems are being used. Mostly EXT3, some XFS, and Some temporary filesystems but otherwise nothing exotic.
Possible Solution
As of February 26, 2007, more testing has revealed that our problems seem to be coming from concurrency issues in Sun's 1.5 JDK (through v.11) combined with Lucene v2.0's calling close() in FSIndexInput.finalize() and FSIndexOutput.finalize() methods -- a real no-no. Looks like v2.1 has 2 bug fixes that would resolve this (669 and 750). We think that the problem is a race condition between a user thread which has caused close and the finalizer which is also calling close. The fact that the FileDescriptor has already been closed is not properly synchronized between the 2 threads so a double close is performed, closing a file that was just opened.
This problem also seems to be fixed in Sun's v1.6 JDK. Looking through the change logs, we see that a number of garbage collection bugs have been fixed. Right now our speculation is that the problems stem from a synchronization bug in 1.5 versions between the GC (maybe just the finalizer parts) and user threads. The finalizer thread does not know that a user thread had closed a file descriptor -- or vice versa. Although we are seeing 30%+ worse performance with 1.6, we have not seen any Bad File Descriptor exceptions. If they fixed some problems by adding synchronization or memory barriers to the GC, this also may account for the performance losses.
JDK Analysis
One of the things that we noticed in looking at the JDK is that if the close() system call fails, the FileDescriptor is not reset to -1 even though the system's file descriptor is invalid (see Linux kernel source). This could mean that if the close syscall fails, it might cause a close of either an invalid file descriptor, or a valid descriptor which happened to call open and get the latest file descriptor that was just closed. This may be the ultimate problem although we are unsure of when, aside from NFS which we are not using, close returns failure. This bug was not in Sun's 1.4 JDK, but was introduced in 1.5 when the error code from close was detected. The bug in is all 1.5 versions through 1.5.11 as well as 1.6. Since the Linux version is using the Solaris code for close, we wonder if the Solaris kernel does something different of it the Solaris version has a bug as well.
We decided initially to copy and override the FileDescriptor class and the classes which open them so we can instrument, and hopefully protect against, the problems. We also wanted to put some synchronized blocks around the closes to make sure that we don't have a concurrency problem. Initial takes at this seemed to have reduced but not removed the problems. As of February 2007 we are working on rev 3 of these classes however these were unable to fix the problem although they gave us more information of the call graphs at fault.
Linux Kernel
Below is the source for the close() syscall from the 2.6.18.3 Linux kernel. As you can see, if the file descriptor is valid, there is no way to leave this routine without it being closed and made available for the next call to open(). If filep_close returns an error, and the JDK does not set the file descriptor to -1, a double close will most likely happen in the future.
asmlinkage long sys_close(unsigned int fd)
{
struct file * filp;
struct files_struct *files = current->files;
struct fdtable *fdt;
spin_lock(&files->file_lock);
fdt = files_fdtable(files);
if (fd >= fdt->max_fds)
goto out_unlock;
filp = fdt->fd[fd];
if (!filp)
goto out_unlock;
rcu_assign_pointer(fdt->fd[fd], NULL);
FD_CLR(fd, fdt->close_on_exec);
__put_unused_fd(files, fd);
spin_unlock(&files->file_lock);
return filp_close(filp, files);
out_unlock:
spin_unlock(&files->file_lock);
return -EBADF;
}
Updated news:
I replaced the JNI part with pure Java socket, and it works fine now. Oops.
And here comes another challenge, how to stop those complicate sub-thread safely.
1 comment:
I can login your bloger lar.
Mark down the timestamp: 2007-04-28
Post a Comment