Database Reference
In-Depth Information
The short answer is that this is a very, very hard problem for graph databases to solve. So
hard, in fact, that this problem falls into a category of “really hard mathematical problems”
that have been given their own name: “NP-hard problems” ( http://en.wikipedia.org/wiki/
NP-hard ) .
Traditional sharding typically involves splitting a full data set such that different portions
are stored on different instances. Provided certain conditions hold true, this approach is a
way to scale large databases while at the same time maintaining a predictable level of per-
formance as the data grows.
The key to the preceding statement is “provided certain conditions hold true.” In most
shard-enabled databases, the keys against which data is held in a shard are stable and pre-
dictable. Additionally, the access patterns generally used are such that you can look up the
data you need from one shard (with a key), look up data in another, and then have the ap-
plication take care of wiring it back together if required.
The sweet spot for graphs, however, and for Neo4j in particular, lies in local graph quer-
ies—starting at a given point and exploring the data and immediate connection patterns
around it. This may involve traversing very different types of data—the types of data that
would typically be spread across different shards. Say you had a query that said, “Given
customer A, tell me which stores are within 1 km of where she lives where any of her im-
mediate friends have purchased a product greater than £100 in the last month.” If you de-
cidedtopartition datasuchthatusersandfriendswerestoredinoneshard,andretail stores
in a separate area, with purchase history in yet another, a built-in graph query would end
up following data spread across physical servers, resulting in unpredictable and slow data-
retrieval times as the network is crossed and navigated.
You may argue that we could try to partition the data based on the pattern described, which
may or may not be possible. This would, however, potentially negate other traversals that
you may also want to perform—ones that start from some other node within the broader
data set, not specifically customer A, and explore from there.
Until such time as we can perform such traversals across network boundaries in an accept-
ableandpredictablemanner,thebestsolutionforsuchcasesinvolvingNeo4jintheinterim
is to make use of cache sharding. The sharding of Neo4j is still actively being looked into
as a problem in its own right, but it's not a problem that looks like it will be solved in the
immediate future.
Search WWH ::




Custom Search