【转】Java's hashCode is not safe for distributed systems

2019-03-02 23:44|来源: 网路

因为要FQ才能看,转过来

http://martin.kleppmann.com/2012/06/18/java-hashcode-unsafe-for-distributed-systems.html

As you probably know, hash functions serve many different purposes:

  1. Network and storage systems use them (in the guise of checksums) to detect accidental corruption of data.
  2. Crypographic systems use them to detect malicious corruption of data and to implement signatures.
  3. Password authentication systems use them to make it harder to extract plaintext passwords from a database.
  4. Programming languages use them for hash maps, to determine in which hash bucket a key is placed.
  5. Distributed systems use them to determine which worker in a cluster should handle a part of a large job.

All those purposes have different requirements, and different hash functions exist for the various purposes. For example, CRC32 is fine for detecting bit corruption in Ethernet, as it’s really fast and easy to implement in hardware, but it’s useless for cryptographic purposes. SHA-1 is fine for protecting the integrity of a message against attackers, as it’s cryptographically secure and also reasonably fast to compute; but if you’re storing passwords, you’re probably better off with something like bcrypt, which isdeliberately slow in order to make brute-force attacks harder.

Anyway, that’s all old news. Today I want to talk about points 4 and 5, and why they are also very different from each other.

Hashes for hash tables

We use hash tables (dictionaries) in programming languages all the time without thinking twice. When you insert an item into a hash table, the language computes a hash code (an integer) for the key, uses that number to choose a bucket in the hash table (typically mod n for a table of size n), and then puts the key and value in that bucket in the table. If there’s already a value there (a hash collision), a linked list typically takes care of storing the keys and values within the same hash bucket. In Ruby, for example:

$ ruby --version
ruby 1.8.7 (2011-06-30 patchlevel 352) [i686-darwin11.0.0]

$ pry
[1] pry(main)> hash_table = {'answer' => 42}
=> {"answer"=>42}
[2] pry(main)> 'answer'.hash
=> -1246806696
[3] pry(main)> 'answer'.hash
=> -1246806696
[4] pry(main)> ^D

$ pry
[1] pry(main)> 'answer'.hash
=> -1246806696
[2] pry(main)> "don't panic".hash
=> -464783873
[3] pry(main)> ^D

When you add the key 'answer' to the hash table, Ruby internally calls the #hashmethod on that string object. The method returns an arbitrary number, and as you see above, the number is always the same for the same string. A different string usually has a different hash code. Occasionally you might get two keys with the same hash code, but it’s extremely unlikely that you get a large number of collisions in normal operation.

The problem with the example above: when I quit Ruby (^D) and start it again, and compute the hash for the same string, I still get the same result. But why is that a problem, you say, isn’t that what a hash function is supposed to do? – Well, the problem is that I can now put on my evil genius hat, and generate a list of strings that all have the same hash code:

$ pry
[1] pry(main)> "a".hash
=> 100
[2] pry(main)> "\0a".hash
=> 100
[3] pry(main)> "\0\0a".hash
=> 100
[4] pry(main)> "\0\0\0a".hash
=> 100
[5] pry(main)> "\0\0\0\0a".hash
=> 100
[6] pry(main)> "\0\0\0\0\0a".hash
=> 100

Any server in the world running the same version of Ruby will get the same hash values. This means that I can send a specially crafted web request to your server, in which the request parameters contain lots of those strings with the same hash code. Your web framework will probably parse the parameters into a hash table, and they will all end up in the same hash bucket, no matter how big you make the hash table. Whenever you want to access the parameters, you now have to iterate over a long list of hash collisions, and your swift O(1) hash table lookup is suddenly a crawling slow O(n).

I just need to make a small number of these evil requests to your server and I’ve brought it to its knees. This type of denial of service attack was already described back in 2003, but it only became widely known last year, when Java, Ruby, Python, PHP and Node.js all suddenly scrambled to fix the issue.

The solution is for the hash code to be consistent within one process, but to be different for different processes. For example, here is a more recent version in Ruby, in which the flaw is fixed:

$ ruby --version
ruby 1.9.3p125 (2012-02-16 revision 34643) [x86_64-darwin11.3.0]

$ pry
[1] pry(main)> 'answer'.hash
=> 968518855724416885
[2] pry(main)> 'answer'.hash
=> 968518855724416885
[3] pry(main)> ^D

$ pry
[1] pry(main)> 'answer'.hash
=> -150894376904371785
[2] pry(main)> ^D

When I quit Ruby and start it again, and ask for the hash code of the same string, I get a completely different answer. This is obviously not what you want for cryptographic hashes or checksums, since it would render them useless — but for hash tables, it’s exactly right.

Hashes for distributed systems

Now let’s talk about distributed systems — systems in which you have more than process, probably on more than one machine, and they are talking to each other. If you have something that’s too big to fit on one machine (too much data to fit on one machine’s disks, too many requests to be handled by one machine’s CPUs, etc), you need to spread it across multiple machines.

How do you know which machine to use for a given request? Unless you have some application-specific partitioning that makes more sense, a hash function is a simple and effective solution: hash the name of the thing you’re requesting, mod number of servers, and that’s your server number. (Though if you ever want to change the number of machines, consistent hashing is probably a better bet.)

For this setup you obviously don’t want a hash function in which different processes may compute different hash codes for the same value, because you’d end up routing requests to the wrong server. You can’t use the same hash function as the programming language uses for hash tables.

Unfortunately, this is exactly what Hadoop does. Storm, a stream processing framework,does too. Both use the Java Virtual Machine’s Object.hashCode() method.

I understand the use of hashCode() — it’s very tempting. On strings, numbers and collection classes, hashCode() always returns a consistent value, apparently even across different JVM vendors. It’s like that despite the documentation for hashCode()explicitly not guaranteeing consistency across different processes:

Whenever it is invoked on the same object more than once during an execution of a Java application, the hashCode method must consistently return the same integer, provided no information used in equals comparisons on the object is modified. This integer need not remain consistent from one execution of an application to another execution of the same application.

And once in a while, a bold library comes along that actually returns differenthashCode() values in different processes – Protocol Buffers, for example – and people get quite confused.

The problem is that although the documentation says hashCode() doesn’t provide a consistency guarantee, the Java standard library behaves as if it did provide the guarantee. People start relying on it, and since backwards-compatibility is rated so highly in the Java community, it will probably never ever be changed, even though the documentation would allow it to be changed. So the JVM gets the worst of both worlds: a hash table implementation that is open to DoS attacks, but also a hash function that can’t always safely be used for communication between processes. :(

Therefore…

So what I’d like to ask for is this: if you’re building a distributed framework based on the JVM, please don’t use Java’s hashCode() for anything that needs to work across different processes. Because it’ll look like it works fine when you use it with strings and numbers, and then someday a brave soul will use (e.g.) a protocol buffers object, and then spend days banging their head against a wall trying to figure out why messages are getting sent to the wrong servers.

What should you use instead? First, you probably need to serialize the object to a byte stream (which you need to do anyway if you’re going to send it over the network). If you’re using a serialization that always maps the same values to the same sequence of bytes, you can just hash that byte stream. A cryptographic hash such as MD5 or SHA-1 would be ok for many cases, but might be a bit heavyweight if you’re dealing with a really high-throughput service. I’ve heard good things about MurmurHash, which is non-cryptographic but lightweight and claims to be well-behaved.

If your serialization doesn’t always produce the same sequence of bytes for a given value, then you can still define a hash function on the objects themselves. Just please don’t use hashCode(). It’s ok for in-process hash tables, but distributed systems are a different matter.

(Oh, and in case you were wondering: it looks like the web servers affected by Java’s hashCode collisions fixed the problem not by changing to a different hash function, but simply by limiting the number of parameters: Tomcat, Jetty.)


转自:http://www.cnblogs.com/mianlaoshu/articles/3017959

相关问答

更多
  • 依赖于hashCode()的具体实现的原因是,它是否永久保存到数据库,文件或任何其他存储介质中。 如果在哈希算法发生变化时再次读入数据,将会发生坏事 (tm)。 您可能会遇到意想不到的散列冲突,更令人担忧的是,无法通过其散列查找某些内容,因为散列数据在持久数据和“现在”之间发生了变化。 实际上,这也解释了点3#=) 点#1的原因可能是“允许互操作性”。 如果hashCode实现被锁定,那么数据可以在Java的不同实现之间相当安全地共享。 即给定对象的哈希将始终是相同的,而与实现无关。 A reason fo ...
  • 请参阅有效的Java配方 。 这只是最好的来源,请放心。 质数的使用只是为了在不知道域的情况下获得合理的分布。 溢出到相同的值将需要一段时间。 如果我记得正确,值31是相当随意的。 根据布洛赫(他使用17作为初始值,37作为常数乘数): 使用非零初始值(...),所以散列值将受散列值(...)为零的初始字段的影响。 如果使用零作为初始值(...),则整个散列值将不受任何此类初始字段的影响,这可能会增加冲突。 值17是任意的。 ... 选择乘数37是因为它是一个奇素数。 如果它是偶数并且乘法溢出,则信息将会丢 ...
  • 好吧,总有我们的好朋友 谷歌在分布式计算上 维基百科的分布式计算 对此进行了数十年的计算机科学研究以及无数的私人和公共研究工作 - 所以如果你可以稍微关注一下这个问题会有所帮助。 Well there are always our good friends Google on distributed computing Wikipedia on distributed computing There have been decades of computer science research on this ...
  • 你的问题并不是非常精确,但我可能会看到你想去的地方,这本身就是一个非常大的领域。 您可能想从此处开始: http : //www.metabrew.com/article/anti-rdbms-a-list-of-distributed-key-value-stores/ 还要看Dynamo,BigTable以及与此相关的所有理论问题(CAP定理和Werner Vogels关于此的演示,您可以在infoq上找到这些问题)。 由于发现了关于NoSQL聚会的多个视频,因此您有更多的信息。 希望能帮助到你, 编辑 ...
  • 字符串的散列码计算如下: s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1] 使用int算术,其中s[i]是字符串的第i个字符, n是字符串的长度, ^表示取幂 。 (空字符串的散列值为零。) 因此,根据Java语言规范-15.8.2 ,此整数计算的溢出很容易发生,导致负值: 如果整数加法溢出,则结果是数学和的低位,如用一些足够大的二进制补码格式表示的那样。 如果发生溢出,则结果的符号与两个操作数值的数学和的符号不同。 String's hash code is co ...
  • 两个相等长度的字符串具有相同的散列码当且仅当它们相等时 当然不是。 取长度为100的字符串。长度为100的字符串比存在不同的int数字的字符串多得多,因此必须有大量的冲突。 是否有一个字符串的最大长度为M,对于任何两个长度相等的字符串,它们的哈希码将相同,当且仅当它们相等时 如果存在这样的长度M,那么它至多是1(因此不是很有用),如散列码碰撞的例子所示,即使对于Eren和KDP的答案中的长度为2的字符串也是如此。 为了加快比较速度,您可以首先比较哈希码,然后只有在哈希码相同的情况下才与比较结果进行比较。 f ...
  • 既然您需要的只是一个反例, 那么您可以 : object Int extends AnyValCompanion { ... override def toString = "object scala.Int" } 但是,对标准库源上的"toString[^(]"进行grepping会产生数百个其他内容。 请注意,在Scala 2.0中的Scala语言规范中明确添加了使用带参数的空参数列表覆盖方法的能力。 Since all you need is one counterexample, here ...
  • 如果您可以确保因果关系并具有部分订单,那么在使用修改后的时间戳呈现“有用的业务表示”时,我不会发现很多问题。 我认为底层分布式体系结构脱离了业务领域的上下文。 他们可能从整体上理解这个系统,并且强迫他们的心理模型转变可能会引起一些摩擦。 另一方面,我不会对日志中的时间戳进行标准化,您可以使用它来跟踪子系统之间的时钟漂移。 If you can ensure causality relationship and have a partial order, i don't see many problems i ...
  • 在我看来,它是网络上的一组计算机,完成了一项共同的任务。 拥有多台计算机的关键可能是专业化或容错,或两者兼而有之。 维持一致的全球状态是此类架构的主要挑战。 In my opinion, it is a group of computers on a network accomplishing a common task. The point of having several computers may be specialization or fault tolerance or both. Maint ...
  • 使用另一个HashFunction来散列行和列,例如 HashCode h = Hashing.murmur3_32().newHasher() .putString(row.getString(), StandardCharsets.UTF_8) .putString(col.getString(), StandardCharsets.UTF_8) .hash() Use another HashFunction to hash the row and column, e.g. HashCo ...