cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

Creating relationships and querying data across multiple nodes for a full outer join

Hi,
I am pretty new to this and I have been failing on trying to find a solution
Let's say I have 3 data sets

Values Table
L1,L2,Value
BU,5,1200
BV,12,1400
BQ,J,1000
BI,J,12

Values Mapping
ID, L1, L2
3,BU,5
2,BV,13
4,BU,J
5,BV,J

Start/End Dates
ID,StartDate,EndDate
1,2020,2021
2,2021,2022
3,2022,2023
4,2023,2024

Goal is to get following result, effectively doing a full outer join on the 3 tables

ID, L1, L2, StartDate, EndDate, Value
1, null, null, 2020, 2021, null
2, BV, 13, 2021, 2022, null
3, BU, 5, 2022, 2023, 1200
4, BU, J, 2023, 2024, null
5, BV, J, null, null, null
null, BV, 12, null, null, 1400
null, BQ, J, null, null, 1000
null, BI, J, null, null, 12

Question is whether this behavior is achievable through a graph database and how would i go about implementing it in neo4j

3 REPLIES 3

Hi ShadyNawara,

welcome to Neo4j and the graph universe!

My question to your query would be: How would you like the information to be stored in the graph? Would you like to keep the 3 datasets separate or would you like the joint information to be on nodes of the graph? What would be the relationships? I can offer you solutions to both modeling approaches if you tell me which one you prefer

Regards,
Elena

Hi Elena,
Thank for replying and helping me through this.
The data sets come from different files and I think I would like to keep them separate.
The relationships would be between equal properties for example:
Values Table <-> Values Mapping (when both fields Match)
L1 <-> L1
L2 <-> L2
in the example only "BU,5,1200" would link to "3,BU,5"

Values Mapping <-> Start/End Dates
ID <-> ID when IDs match (2<->2, 3<->3 and so on)

Hi ShadyNawara,

ok, that is of course possible. However, I would recommend to think about a better modelling way. Your current suggestion has several cons:

  • the graph you obtain is not fully connected. There are only some nodes connected.
  • you have the same information listed out several times (e.g. the L1 value is present in more than 1 node). The beauty of the graph database is that you do not need these "foreign" keys but can model everything as an edge.
  • For your query (or any query) you are not able to use the relationships that you envision which is the strength of graph databases.

But to show you how you could implement your model:

  1. load in the data (I assumed for now these are csv files):
LOAD CSV WITH HEADERS FROM 'file:///valuesTable.csv' as row
CREATE (v:ValuesTable {L1:row.L1, L2:row.L2, Value:row.Value});

LOAD CSV WITH HEADERS FROM 'file:///valuesMapping.csv' as row
CREATE (v:ValuesMapping {ID:row.ID, L1:row.L1, L2:row.L2});

LOAD CSV WITH HEADERS FROM 'file:///startEndDates.csv' as row
CREATE (v:StartEndDate {ID:row.ID, StartDate:row.StartDate, EndDate:row.EndDate});
  1. create your desired relationships:
MATCH (a:ValuesTable), (b:ValuesMapping) 
WHERE a.L1 = b.L1 AND a.L2 = b.L2 
CREATE (a)-[:IS_SAME_VALUE_PAIR]->(b);

MATCH (a:ValuesMapping), (b:StartEndDate) 
WHERE a.ID = b.ID 
CREATE (a)-[:HAS_SAME_ID_AS]->(b)
  1. Finally the table you want to have is not really possible to get with one query. At least not that I can think of. What you could do is find 3 tables that join all information, i.e. for the first table
MATCH (a:ValuesTable) 
OPTIONAL MATCH (a)-[:IS_SAME_VALUE_PAIR]->(b:ValuesMapping)
OPTIONAL MATCH (b)-[:HAS_SAME_ID_AS]->(c:StartEndDate) 
RETURN b.ID, a.L1, a.L2,c.StartDate, c.EndDate, a.Value;

which gives you:
Table 1:

b.ID a.L1 a.L2 c.StartDate c.EndDate a.Value
"3" "BU" "5" "2022" "2023" "1200"
null "BV" "12" null null "1400"
null "BQ" "J" null null "1000"
null "BI" "J" null null "12"

So, I guess long story short is that you would really need a better structure within your graph to have all information accessible.
Of course, I do not know your data well, but what I can suggest from what I see are different structures which might or might not be similarly useful based on the queries you want to run against them:

  1. Collect all information in a single node when you read in the data. But then you are also not using the graph relationships.

  2. Join the first two tables when reading in the data and have the year dates as separate nodes which you then connect to.

a) load in the values:

LOAD CSV WITH HEADERS FROM 'file:///valuesTable.csv' as row
CREATE (v:Value {L1:row.L1, L2:row.L2, Value:row.Value});

b) merge the values from the Values Mapping:

LOAD CSV WITH HEADERS FROM 'file:///valuesMapping.csv' as row
MERGE (v:Value {L1:row.L1, L2:row.L2}) 
SET v.ID = row.ID;

c) load in the dates:

'file:///startEndDates.csv' as row
MERGE (b:Date {year: row.StartDate})
MERGE (c:Date {year: row.EndDate})
MERGE (a:Value {ID: row.ID})
MERGE (a)-[:STARTS]->(b)
MERGE (a)-[:ENDS]->(c)

d) get the table

MATCH (a:Value)
OPTIONAL MATCH (DateEnd:Date)<-[:ENDS*0..1]-(a)-[:STARTS*0..1]->(DateStart:Date)
RETURN a.ID,a.L1,a.L2,DateEnd.Year, DateStart.Year, a.Value ORDER BY a.ID

This gives you exactly the table you are looking for:

a.ID a.L1 a.L2 DateEnd.Year DateStart.Year a.Value
"1" null null null null null
"2" "BV" "13" null null null
"3" "BU" "5" null null "1200"
"4" "BU" "J" null null null
"5" "BV" "J" null null null
null "BQ" "J" null null "1000"
null "BI" "J" null null "12"
null "BV" "12" null null "1400"

This way you can query your graph based on "meaningful" relationships. The graph looks like this:

However, you still have values showing up several times, like "BU". So, maybe it is worth having separate nodes for L1 and L2 additionally?

I hope this helps and I could show you several ways around the first steps in Neo4j 😉

Regards,
Elena