
When an Object in a Set is not in the Set
December 2, 2006I have a Jukebox object with a Set of Song objects. I wrote a test to verify that the Songs were cascading properly when I updated and saved the Jukebox. Retrieving an empty Jukebox via Hibernate (Hibernate–keep that in mind), I added two Songs, saved the Jukebox, retrieved the Jukebox from the database and verified it had two Songs. So far, so good. Then I removed one of the Songs, saved the Jukebox, retrieved it again and my problems began. There were still had two Songs in the Jukebox. After some serious head scratching, I modified the test to look like this:
Song song = (Song) jukebox.getSongs().iterator().next();
assertTrue(jukebox.getSongs().contains(song));
Seems simple enough but the assertion failed. So the problem existed outside the persistence framework–or so I thought. After more digging, I (re)discovered that an object’s hashcode is used in an algorithm to determine whether a Set contains that particular object. The interesting thing to note is that the hashcode is run through “a ’supplemental hash function,’ which defends against poor quality hash functions”. The hashcode is actually modified before it’s used. As long as that “supplemental hash function” is also used for adding the object to the Set everything should be fine.
I suspect (and I have not yet investigated this) Hibernate does not implement (or implements differently) the “supplemental hash function” when adding an object to org.hibernate.collection.PersistentSet. If the object’s hash function is of poor quality, the modified hashcode used in the contains() method may not match the hashcode used when the object was added to the Set. The mismatched hashcodes then may prevent locating or removing the object.
The supplemental hash function is designed to protect against a poor quality hashcode() implementations, NOT against incorrect ones.
A HashSet works by dividing its contents up into “hash buckets”. When you put an object in a set, it is assigned to one of the available buckets based on its hashcode(). When you look for an object using contains(), the set will find out which bucket to search based on its hashcode(), then find out whether the object is in that bucket. If it finds one with an exactly matching hashcode, it then checks if the two objects are equal(), and if so that’s a match.
So if your hashcode() and equals() methods don’t agree, there’s a good chance that HashSet will look in the wrong bucket, and even if it looks in the right bucket, you won’t find the object you’re after.
The supplemental hash function protects you against a special case of problem. If you base your hashcode() method on a bad hashing function, it’s possible that all your objects will end up in the SAME hash bucket inside the set. Which would mean that no matter how many objects you put in the set, finding them would involve a linear search through every object. The supplemental hash function is applied to hashcodes to make sure they are evenly distributed across the buckets.
All this protects you against is bad performance, not incorrect behaviour. The supplemental hash function still doesn’t help you with a hashcode() method that doesn’t agree with equals(). It’s a deterministic function, so if the hashes don’t match coming in, they won’t match after they’ve gone through the supplemental function either.