cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

Nodes with lots of data

Hello,

this is a best-practice / performance question:

I have a lot of nodes, where each node has a large amount of data in it.
An architecture solution I thought about is keeping the nodes "thin".
I'll explain,
each node will hold the minimum needed data in order to maintain a graph with nodes & relations.
The data of the nodes will be saved in a different storage location (different DB). This data will be fetched by demand.

An example I thought about is Facebook:
Let's assume that there are nodes of Facebook users.
Each node connected to lots of other nodes (users, posts, pages, likes...).
I guess that each node of user doesn't hold in it all its data, for example all user's pictures. But these pictures are saved in a different storage location (different DB).
Assuming the aforesaid Facebook architecture does happening, the reason for it, is the inefficiencies of holding large amount of data in the nodes.

My question is, in case of large data amounts in the nodes, should the aforesaid Facebook approach should be adopted? If not necessarily, does it mean the nodes with large amount of data in them can supply good performance?

Thanks in advance,
Boris

2 REPLIES 2

Hi Boris,

This is a god question. One approach we have taken where I work and I adopted in some side projects is to store the raw data in the graph itself. Like how much data are you thinking?

I have some nodes that have approximately 140 properties on them in one of my projects. Storing the raw data as like a "raw data node" means you can keep your data there, but not necessarily tied to the graph structure. Then you can just query the raw data nodes as needed to build / add to your graph structure.

Some tradeoffs are going to be storage vs speed. Generally it is faster to import once and then work within the graph itself to build things. But this does mean you are going to use a lot more storage due to holding the raw data. By storing data in a separate location you then have to rely on the speed of importing new data as needed.

This also comes down to how you plan to add to your graph model. If you have enough RAM and storage than storing the raw data may not be a big issue. However, if you can't afford to handle the data because of how much it is then keeping it separated can work.

I have personally gravitated toward storing raw data in the graph and using that to build out my graph model. That means I only import data once.

I hope that helps.

Also, if you are using Neo4j 4.x with fabric then you can compartmentalize your raw data in its own database. That is something on my radar to try.