Kademlia: A study | Roborooter.com

This week I'm at the Recurse Center as part of the "Mini 2 Batch". I'll be spending my time researching and implementing the Kademlia Distributed Hash Table. DHT's are a building block of decentralized systems. They allow "efficient" storage and access of data that is both produced by many untrusted computers and stored by many untrusted computers. I use quotes around "efficient" because while it's absolutely more efficient than previous systems that solved a similar problem, if you happen to control (or at least identify in advance) all the computers you can build a much faster system. We call those faster systems "databases".

A Kademlia DHT powers BitTorrent and uses it to find peers and essentially load the "torrent file" from other computers seeding that torrent. The community there has made many improvements (in both performance and security) to the system over the years. Something I'm considering also researching.

I started by reading the whitepaper. It's dense and I had to lookup a lot of terms, and symbols. I was able to find a small group of people to whiteboard each section with so we could see what the algorithms looked like with data. I found drawing what the "distance" metric looked like and computing values for a simple 4 bit DHT and associated "k-bucket" peer buffers looked like incredibly helpful. Thanks to everyone who talked me through it.

I'm going to leave my notes on each section here to help cement what I've learned. (And if you're reading this, maybe you too!)

One note, I use the term "people", "peers" and "nodes" pretty interchangeably in my notes.

Section Overviews

Section 1: Introduction

It's a pretty nice overview. You'll want to understand what a binary XOR operation is and hint for later it's sometimes written as ⓧ.

Section 2: System Overview

You have nodes in a system, each node has an address, and then lots of "trees" where you know at least one node in other sub trees. In my second read through I finally get what they're getting at. If you were a node, in this system you'd know at least one person in your building, your block, your city, your state, country, etc. And while trees are a correct way to explain this, I like my analogy better. I made a fake 4 bit DHT and scribbled a log of events of what a node lookup would consist of. I'll be using this 4 bit DHT as an example a lot.

Node looksups work as follows. The purpose is to find the contact information of an ID. So you ask yourself "Who do I know who's "closest" (we'll get to what that means) to who I want to talk to?" So you look in your k-bucket and find the closest person and ask them if they know the ID. If they don't know it, they tell you the closest people they know, and then you'll go ask them. Eventually you'll find the node you're looking for or you'll give up. There's more to it but that comes later.

Section 2.1: XOR Metric

This should have been called "Distance" or even "the bits between us" because it's all about making a metric for "distance" so you can determine who you care about making friends (I mean "peers") with. The distance between two nodes is the XOR of both IDS taken as an unsigned integer. They wax philosophic about XOR for reasons I'm sure I'll appreciate when I'm trying to implement this but didn't seem relevant now. Also d(x, z) is how they've chosen to write XOR for most but not all of the section. (ⓧ is also sometimes used for XOR, and if you see ∀ it means "for all")

One part still confuses me, so I recruited a few volunteers to whiteboard it.

"When a tree is not fully populated, the closest leaf to an ID x is the leaf whose ID shares the longest common preﬁx of x. If there are empty branches in the tree, there might be more than one leaf with the longest common preﬁx. In that case, the closest leaf to x will be the closest leaf to ID x̃ produced by ﬂipping the bits in x corresponding to the empty branches of the tree."

What is x̃? Is it ~x the negation of x? Probably. But what are "the bits in x corresponding to the empty branches of the tree"? I don't know. I do know however that the XOR of two IDs is the distance between them, so I'll move on.

2.2 Node State

This section describes k-buckets. K-Buckets are where you put your friends contact information. So maybe it should have been called "how to keep your address list"?

In Kademlia instead of remembering everyone you ever met, you remember 160 buckets of 𝑘 people (where 𝑘 is a system wide setting, suggested to be 20). You put each person in a bucket based on how far away they are from you. This way you only ever bother 3,200 people when you're looking for someone you don't know instead of a possible 2^160 people (which is a lot of people). In Kademlia unlike real life if you meet someone new and their bucket is full, you ping the person you've seen least recently to see if they're online. If they're not you forget their number and welcome your new friend. And if they are you forget about the new person. Harsh.

(A programmer's note, this makes me think we have have a queue of new nodes for each bucket as pings take some time and you might learn about more nodes in the meantime. Also I think in later sections we remember people for different reasons, so you need to have a queue of people to add and why you're adding them?)

I had to whiteboard to fully grok what it meant to have buckets of distance. And you later find out that 𝑘 is 20. Why should it be twenty? I don't know. Stefan Saroiu (and peers) studied Gnutella and figured out that if you've been online for an hour you're likely to stay online another hour. Which is neat. I wonder why this happens? I wonder if it matters why you're online, eg, would a chat DHT have the same characteristics as a file sharing DHT?

2.3 Kademlia protocol

Given the previous information about the system we live in. This is how we act in that system. You can interact with others in 4 ways. You can ping them, you can ask them to store info (no recommendations on what/why/how much etc), you can ask people to help you find_node and find_value which are similar. As you might guess this section introduces the hash table part of of DHT. Our authors don't really talk about the keys and values of the DHT directly but more about the behaviors what you do when you see them. I guess this is a white paper on the algorithm not an API doc but after such a solid intro on peer discovery, I was surprised.

In any case this is what I've gathered about the section;

Every message includes a 160 bit "nonce" that the reply must include.
Keys are 160 bit values and while they're not IDs they're usually stored near nodes with IDs close to the key (using the XOR distance metric) no recommendation about picking an ID, but implied choice is a 160bit hash of the value.
Values are whatever, that's up to you and your distributed network of friends. No size limits.
Key Value Pairs (lets call them a KV) are only remembered for a while (they have a time to live or TTL) so if you're publishing something you have to keep doing it. This is a system wide setting (that is probably gamed/ignored by clients who do what they want)
Responses can include a ping request. This makes sense since it's suggested that UDP be used there are no connections or retransmissions, so if you want to make sure your peer is still there, ask for a reply. (something you get for free with TCP)

"The most important procedure a Kademlia participant must perform is to locate the 𝑘 closest nodes to some given node ID." I'm guessing 𝑘 is still 20 (and that it's the same 𝑘 as before). You use your address book and find the closest 20 people to the ID you're looking up and call them α at a time. But what is α (besides a greek alpha)? I googled the meaning of the character which was hard because it's impossible to copy and paste from the whitepaper's pdf. Eventually I found some math resources after searching for "greek math fish symbol" and there doesn't seem to be any special meaning. Eventually I learned that α means 3 and proceeded to laugh out loud causing stares from my study friends who didn't see the humor in it.

find_node and find_value get a response of the node's connection info, the value, or a list of 𝑘 nodes that are closest to the ID you're looking for from the peer's own address book.

It appears that you keep α (3) connections going asking people if they've seen the IDs. You then select from the responses and your own address book the next "closet" peer you haven't already asked and keep going until you've got a set of 𝑘 (20) closest nodes that don't have what you're looking for. It's unclear to me if you add these new nodes to your k-buckets address book or not.

When doing a find_value you stop as soon as you get your value, and if you learned about anybody closer to the key who didn't have it, you store it with the closest person and remember it yourself. To keep from "over caching" you figure out the distance between you and the closest person you're aware of to the key who has the key and make your TTL inversely proportional to the distance. This might mean (1 - (distance / 2^160)) * 1 hour but I wonder if a minimum cap on that TTL would be helpful.

Lastly this section covers bucket refreshes and initial population of buckets. When you haven't seen any node from a bucket in the last hour you pick a random ID in that bucket's range and do a lookup. When you first join the network, you'll have at least one friend. Do a node lookup for yourself (unclear if you'll get 𝑘 results or your own information), and then lookup a random id in every bucket farther away from your closest friend. It makes sense to me not to do these in parallel, and to wait to add more people to your address book so you can a wider distribution of peers, but that isn't specified.

2.4 Routing Table

Or "How to improve your buckets" sometimes due to randomness, you'll have an incredibly imbalanced tree. The splitting of buckets based upon 160 prefixes can't really deal with this ID space and most of your buckets will be empty. I believe this section is a replacement for section 2.2. The idea is you start with a set of 1 bucket (no prefix). When it's full and you try to insert new contact information you split the bucket that would include your own ID in half (add one more bit to the prefix). If you try to add the new contact again and the new bucket is still full you drop the information. The end of this exercise is a bucket list that looks like 2.2.

The next part of this section talks about one more special case. Say you're all alone and your bucket is mostly empty. If you try to add someone to a bucket and the bucket is full but you don't have 𝑘 people that are close to you, and this new person is closer to you than anyone else. Split the bucket and remember them. It's good to have close friends. (Also this is mostly a guess, I've read this paragraph a ton of times now and I think this is what they mean, but I could be wrong!)

2.5 Efficient key re-publishing

Since we seek the closest person to a key when doing a find_value we could have a situation where a new node that joins the network has a closer ID but doesn't know the KV, they might get our find_value request and not be able to answer it. If this happens 𝑘 times there's a good chance that even though the network has the data, you can't find it. Likewise the closest nodes to the data might just leave, also causing you to not find the data even if it's still out there.

To work around this nodes should republish KVs every hour to ensure it's available on the 𝑘 nodes closest to the key. To keep things efficient, if you got a republish of a key within the last hour, don't republish it yours self. Also don't do a node lookup again, your bucket refreshes should have that covered.

When combined with a new behavior of "Hey I thought I was the closest node ot this key, but I just learned about you and you're the closest node so here have the KV" stuff sticks around without too much wasted effort.

3 Sketch of Proof

This is a big overview saying they did the math of the probabilities for how well communication happens, how likely it is we'll loose KVs in general and edge cases. They don't show the math which is common for papers I think. I did learn of one new behavior.

When publishing a KV you submit it to the closes 𝑘 nodes to the key, and republish every hour. I'm pretty sure this hasn't been mentioned in full prior to this overview, but I'm glad they say it now.

4 Implementation notes

Nice to have a little overview of what's to come. Succinct and to the point, they made a Kademlia and figured out more ways to make it faster.

4.1 Optimized contact accounting

This section is great, it builds on 2.2 and 2.4 by relaxing some rules around address book management and throwing my programming note out the window.

Since keeping your k-bucket address book up to date every time you learn about a new peer will send a torrent of ping requests to everyone you know it's suggested that you keep a "replacement cache" of peers and in the course of events that you need a new peer to fill your address book, pull from this cache.

Additionally since networks are unreliable, it's totally valid to lose UDP packets, this doesn't mean you should tear up a contact because they didn't reply once. Instead they suggest marking the node "stale" after failing to respond to 5 messages, and only removing them from a bucket if the bucket is full and you have valid peers in your replacement cache. This should keep you from destroying your ability to function if you disconnect temporarily from the network.

4.2 Accelerated lookups

This happens by growing our k-buckets even more. In general (section 2.4) we are splitting buckets only when we're in them so we can keep more information about peers that are closer to us. In practice they found it reduced the number of lookups (from log₂n to log₂ᵦn) and they suggest β = 5. They can't prove how efficient other DHTs are but they think this makes theirs the best in a general sense.

5 Summary

I mean I think this is the best DHT in the general sense so that's got to count for something. (Having only studied this particular one...) But it's probably best to close with their own summary.

"With its novel XOR-based metric topology, Kademlia is the first peer-to-peer system to combine provable consistency and performance, latency-minimizing routing, and a symmetric, unidirectional topology. Kademlia furthermore intro-duces a concurrency parameter, α, that lets people trade a constant factor in bandwidth for asynchronous lowest-latency hop selection and delay-free fault recovery. Finally, Kademlia is the first peer-to-peer system to exploit the fact that node failures are inversely related to uptime."

What's next?

Now I got to make the thing!

If you notice anything I got wrong or would like to add, please do let me know! I'm @reconbot on twitter and @reconbot@toot.cafe on mastodon!