Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
06-28-2020 05:00 PM
In two previous articles, we covered aspects of graph data modeling such as categorical variables, and how relationships work. In this article, let’s address how to identify things in your graph with keys.
![](upload://o30BNltBDdqpqCeT2nSVv3TEmdl.png)A graph key is a property or set of properties which help you to identify a node or relationship in a graph.
They are frequently used as a starting point for a graph traversal, or used as a condition to constrain.
Before we get into different options for keys in Neo4j, let’s list the attributes of what makes for a really great database key.
Smart keys are usually compound values which encode information into a key. Imagine we had an ordering system and we identified a customer order as 2020-06–19-VA-9912. It might seem convenient that we’ve encoded the order date (2020–06–19), the state of the order (Virginia) and the order number (9912) into a single key. In practice though, smart keys usually end up a disaster, for several reasons:
The key thing to notice about smart keys is that they always have low opacity; that’s the point of them.
In relational databases, it’s typical to define compound keys of two or more attributes, but in my view that never makes sense in a graph. A usual reason why someone would use a compound key is because of a dependency between columns. For example, maybe your customer code + state code together is what uniquely identifies a record. But since graphs let you have as many nodes as you want, this Cypher code:
MATCH (r:Record { ccode: "X", scode: "Y" })
Will usually be worse than this:
MATCH (:A { scode: "Y" })-[:LINKED_TO]->(r:Record { ccode: "X" })
The point is that in most cases, a good data model can eliminate the need for a compound key.
Every node and relationship gets its own “internal” identifier which you can access with the id() function.
![](upload://8Q7VVFtSReqLBU6NVEYTEiUMvYP.png)internal ID of a nodeThe advantage of these IDs is that they’re always guaranteed to be there for you. And lookup by ID is very fast in Neo4j because of the way the graph storage in memory works. But internal node IDs (in my view) make for very bad application identifiers, for a number of reasons:
Basically, the only guarantee you get is that they are globally (to that graph, not to the DBMS) unique. While this is opaque, the authority for the identifier is the database (not your application) and the uniqueness context is scoped to a single graph on a single system only.
Using APOC’s built in UUIDs, you can create them on the fly like this:
CREATE (m:Thing { id: apoc.create.uuid() });![](upload://nFCz0tJlxnS7HZwbnjURnVKTw8l.png)An APOC-generated UUID
These are quite good, because you are the authority and manage them yourself. They are stable and never need to change. They are extremely unique across all contexts, and they’re very opaque. They are 128-bit numbers that are pseudo-randomly generated. Practically speaking, you don’t have to worry about collisions, since the space is so large that if you generate 103 trillion identifiers this way (and we’re pretty sure you’re going to be under that) your chances of a collision are still one in a billion. Good enough.
They come with downsides though.
Let’s face it, usually our source data is coming from somewhere else. If we’re importing tweets from twitter into a graph, all of those tweets have existing IDs. And so often, a good approach will be to adopt someone else’e ID scheme that came with your data import.
It’s tough to say what the pros of this approach are, because it will depend on what the identifier is. The best we can do is go back to those principles we’re looking for (opacity, uniqueness, etc) and evaluate an ID against those.
We can talk about specific negatives of adopting someone else’s identifiers though:
As a general recommendation — always store any upstream identifier that you can get your hands on. But don’t use it to be your identifier. Use it for correlation with your upstream system. There’s nothing wrong with choosing your own ID in addition to storing a remote identifier.
A common approach is to use an auto-incrementing number. Neo4j doesn’t support this straight out of the box, but it’s common to find it in other libraries, and it’s a common technique in the relational world. It’s usually not the best approach though, because:
That being said, this approach is still opaque (good) controlled by you (good), and compact/storage efficient.
As of Neo4j 4.1.0, the database does not have regular b-tree relationship property indexes (it does support full-text indexes on relationship properties though) This has important consequences, and means that it’s not possible to look up individual relationships quickly by an ID, because the database simply doesn’t store things that way. The way you find relationships is by looking up one (or both) of the incident nodes, like so:
MATCH (a:Person { id: 1 })-[r:KNOWS]->(b:Person { id: 2 })
RETURN r;
In this scenario, effectively we’re using the “from” and “to” nodes as the relationship key. The id() function still exists for relationships, and they all have internal Neo4j IDs, but typically we don’t need to ever assign property IDs to relationships. Not only are they locatable in this other way, but lacking property indexes, lookup by key wouldn’t be the efficient way to go anyway.
![|1x1](upload://6w7HOLoKuTDtEXRteNiYA53kW94.gif)Graph Data Modeling: Keys was originally published in Neo4j Developer Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.
All the sessions of the conference are now available online