Neo4j in production - Neo4j in Action

Database Reference

In-Depth Information

The short answer is that this is a very, very hard problem for graph databases to solve. So

hard, in fact, that this problem falls into a category of “really hard mathematical problems”

that have been given their own name: “NP-hard problems” ( http://en.wikipedia.org/wiki/

NP-hard ) .

Traditional sharding typically involves splitting a full data set such that different portions

are stored on different instances. Provided certain conditions hold true, this approach is a

way to scale large databases while at the same time maintaining a predictable level of per-

formance as the data grows.

The key to the preceding statement is “provided certain conditions hold true.” In most

shard-enabled databases, the keys against which data is held in a shard are stable and pre-

dictable. Additionally, the access patterns generally used are such that you can look up the

data you need from one shard (with a key), look up data in another, and then have the ap-

plication take care of wiring it back together if required.

The sweet spot for graphs, however, and for Neo4j in particular, lies in local graph quer-

ies—starting at a given point and exploring the data and immediate connection patterns

around it. This may involve traversing very different types of data—the types of data that

would typically be spread across different shards. Say you had a query that said, “Given

customer A, tell me which stores are within 1 km of where she lives where any of her im-

mediate friends have purchased a product greater than £100 in the last month.” If you de-

cidedtopartition datasuchthatusersandfriendswerestoredinoneshard,andretail stores

in a separate area, with purchase history in yet another, a built-in graph query would end

up following data spread across physical servers, resulting in unpredictable and slow data-

retrieval times as the network is crossed and navigated.

You may argue that we could try to partition the data based on the pattern described, which

may or may not be possible. This would, however, potentially negate other traversals that

you may also want to perform—ones that start from some other node within the broader

data set, not specifically customer A, and explore from there.

Until such time as we can perform such traversals across network boundaries in an accept-

ableandpredictablemanner,thebestsolutionforsuchcasesinvolvingNeo4jintheinterim

is to make use of cache sharding. The sharding of Neo4j is still actively being looked into

as a problem in its own right, but it's not a problem that looks like it will be solved in the

immediate future.

Search WWH ::

Custom Search

Home