The 'crowning achievement' of a few months ago, within World, was updates to the pathfinding system. The fixes and speedups here literally made the game work, compared to the terrible game breaking behaviors beforehand. A result of this update was a particularly rare exception thrown on the server. This exception resulted in an invalid path, but when recomputed raw (without the path cache) the answer was correct. I've run for months with the bug in play, but just ignored. The game runs fine without it being fixed; at least, as far as is obvious to see. Nonetheless an outstanding bug (a crash bug) in this glorious system could not be sustained.
First an explanation of the pathing and mapping system. Part of me wants to say "it's not very complicated", but I'll leave that analysis to others. The map, as far as the server cares, is a connected directed graph. Each vertex is a possible location for an entity to exist. Edges are walkable/movable paths, the length of which determines the time to walk the edge. The graph is built up as a hexagonal grid, but in actuality the hexagonal limitation is not implied by the graph, it's merely a consequence of the graph's construction. As far as the mapping system cares, it's a generic graph with no basis on the game geometry, terrain, or environment.
Finding paths, then, is a fairly straightforward application of well known computer science algorithms. Any game programmer worth their sodium knows the A* pathfinding algorithm. It's a natural fit in this case, and so it is used.
The game, by its nature, requests a ridiculous number of paths per second. Due to constant formation and range adjustments, as well as navigating around other entities, paths are needed very frequently. Compute time is important in this case. I opted to build a caching system for common types of paths. Turns out the cache hit rate is about 99%, so it's well worth it.
The bug in question is a caching bug, so a bit more detail there is relevant. Each map-vertex stores a set of known target cells, and which neighbor to move to to get to it. That is, a cell 'A' may remember paths to 'B', 'C', and 'D', which lead through its neighbor 'B', 'C', and 'C' respectively. The paths are incremental. Rebuilding a path from cache is a simple linear sequence of hashtable lookups to determine the next step.
When a path is not available from the cache, A* is executed to build a path. However, at each A* vertex inspection, if there's a path directly to the end target, we can shortcut the remainder of the A* traversal and simply append the remaining known path to the newly calculated derived path. At this point the derived path's entries' caches are updated with both the tail of the derived path and the known path. Thereby the cache is augmented in the case of a found partial path.
That's it. Where's the bug?
The bug is that somehow, through some sequence of queries, a new path is requested that is broken. The cache implies a sequence to reach the target, but one of the vertices along the way has no cache entry to the target. What's missing?
Throughout the last months when the exception would pop up I'd inspect the cache data. However, inspecting 10000 vertices in a small map isn't exactly tractable. Confirming the cache coherency after all changes would make the game unplayably slow, and so never exhibit the bug. Then, some perfectly reasonable person suggested I just brute force the damn thing and find the problematic sequence.
So I rigged up a unit test to build a 10x10 map cell grid and exhaustively checked all possible sequences of 3 queried paths, and then test the cache coherency. 2 hours later I realized that I'd just executed an O(n^6) algorithm with n=100, broke into the debugger, and realized that I'd be waiting a few years for results, and slapped myself in the face.
So I rigged up a unit test to build a 10x1 map cell grid and bla bla bla. It ran, and it.. didn't crash. Figures.
So I rigged up a unit test to build a 4x4 map cell grid instead. 35 minutes later an exception came up. I checked the pattern of queries. It was not what I expected to see. I was expecting a bug in the partial-path re-caching behavior. It's the most complicated thing in the system and therefore the most likely to fail. What I saw took another 30 minutes to just understand. I made pictures.
Those are the relevant nodes. Actually the node above b is relevant too but I forgot when I made the image. The first queried path was from a to h:
The second path was from b to i:
When I first saw this path I was briefly confused. It seemed curious that the path through e to h would differ between the two. However, given A* on this graph it is reasonable because of the differing endpoint. g is closer in cartesian distance to its target than f, so the path is subtly different. It seemed weird but possibly a problem. I didn't know how yet though.The third path was from c to h:
This path has a partial-cache hit at d. That path is d, e, g, h, the purple path. However, you'll note that the path obtained (in orange) is d, e, f, h! Well, it turns out the cache is a single store, and so only the first path from e to h is stored. d knows that it can get to h through e (and historically this path went through g). However, when the path resolution steps into e, e says the next step is f, because that was the first path from e to h stored.
This bit of inconsistency result in some confusion when the breaking query fires, from c to f:
The queried path should be entirely in cache, since it lies along the orange, previously queried and cached, path. However, when the path cache steps into d, d has no known paths to f. d has only ever pathed through g, and so the cache read has failed.
Why did this fail? It's because of a combination of three assumptions that are not always correct:
- Given a point A and B, there will only ever be one path determined from A to B. This is true if B is the final destination, but not if a vertex beyond B is the target.
- A partial cache hit, already being written to the cache, does not need to be written to the cache. This is true if the earlier (incorrect) assumption is, but since the known path can differ internally, the newly found path needs to be written to the cache as well. When this is written the c to f path is correctly written when the c to g path is finalized.
Bug fixed. :)
Next time, the new terrain engine. Here's a preview.