Java中的String.hashCode()方法可能有问题？ - linux编程基础

TOP

Java中的String.hashCode()方法可能有问题？(二)

2018-08-20 09:23:21 【大中小】浏览:202次

sp; int maxhc=0, maxsize=0;
for(Map.Entry<Integer,Set> e : collisions.entrySet()) {
Integer hc=e.getKey();
Set bucket=e.getValue();
if(bucket.size() > maxsize) {
maxhc = hc.intValue();
maxsize = bucket.size();
}
}

System.out.println("Elapsed time: "+elapsed+"ns");
System.out.println("Total unique lines: "+lines.size());
System.out.println("Time per hashcode: "+String.format("%.4f", 1.0*elapsed/lines.size())+"ns");
System.out.println("Total unique hashcodes: "+collisions.size());
System.out.println("Total collisions: "+(lines.size()-collisions.size()));
System.out.println("Collision rate: "+String.format("%.8f", 1.0*(lines.size()-collisions.size())/lines.size()));
if(maxsize != 0)
System.out.println("Max collisions: "+maxsize+" "+collisions.get(maxhc));
}
}

我们使用短字符串（words.txt，链接见文末）作为输入：

$ cat words.txt | java HashTest
Loading lines from stdin...
Warming up..........
Computing collisions...
Elapsed time: 49117411ns
Total unique lines: 466544
Time per hashcode: 105.2793ns
Total unique hashcodes: 466188
Total collisions: 356
Collision rate: 0.00076306
Max collisions: 3 [Jr, KS, L4]

在这些英文短字符串中，总共有466,544个哈希，出现356次冲突。从理论上讲，“公平”的哈希函数应该只会产生25.33次冲突。因此，String.hashCode()产生的冲突是公平哈希函数的14.05倍：
356.0 / 25.33 ≈ 14.05

不过，每10,000个哈希出现8次冲突的概率仍然是个不错的成绩。

那么长字符串值的结果怎样？使用莎士比亚全集中的句子（链接见文末）会产生以下输出：
$ cat shakespeare.txt | java HashTest
Loading lines from stdin...
Warming up..........
Computing collisions...
Elapsed time: 24106163ns
Total unique lines: 111385
Time per hashcode: 216.4220ns
Total unique hashcodes: 111384
Total collisions: 1
Collision rate: 0.00000897
0.00076306
Max collisions: 2 [ There's half a dozen sweets., PISANIO. He hath been search'd among the dead and living,]

在这些较长的英语字符串中，总共有111,385个哈希，出现1次冲突。“公平”哈希函数将在这些数据上产生1.44次冲突，因此String.hashCode()优于公平哈希函数，冲突可能性是公平哈希函数的69.4％：
1 / 1.44 ≈ 0.694

也就是说，每100,000个哈希产生不到1个冲突，这个成绩是极好的。

一点解释

显然，String.hashCode()不具备唯一性，它也不可能具备唯一性。对于短字符串，它与理论平均值差得比较远，但其实做得还算不错。对于长字符串，它可以轻松打败平均理论值。

总得来看，它对于预期字符串而言是具备唯一性的，可以将字符串很好地分布在哈希表中。

最后，我还是认为String.hashCode()是具备唯一性的，至少它足够“好”。

延伸阅读

如果你对这个问题感兴趣，我强烈建议你看一看Stack Overflow上的答案（https://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed#answer-145633），它深入探讨了哈希函数冲突的问题。

重要链接：

查看英文原文：http://sigpwned.com/2018/08/10/string-hashcode-is-plenty-unique/

首页上一页 1 2 下一页尾页 2/2/2
【大中小】【打印】【繁体】【投稿】【收藏】【推荐】【举报】【评论】【关闭】【返回顶部】

上一篇：JavaScript数据类型及判断	下一篇：Java Object类的方法