Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
05-12-2020 12:54 AM
Hi,
I am pretty new to this and I have been failing on trying to find a solution
Let's say I have 3 data sets
Values Table
L1,L2,Value
BU,5,1200
BV,12,1400
BQ,J,1000
BI,J,12
Values Mapping
ID, L1, L2
3,BU,5
2,BV,13
4,BU,J
5,BV,J
Start/End Dates
ID,StartDate,EndDate
1,2020,2021
2,2021,2022
3,2022,2023
4,2023,2024
Goal is to get following result, effectively doing a full outer join on the 3 tables
ID, L1, L2, StartDate, EndDate, Value
1, null, null, 2020, 2021, null
2, BV, 13, 2021, 2022, null
3, BU, 5, 2022, 2023, 1200
4, BU, J, 2023, 2024, null
5, BV, J, null, null, null
null, BV, 12, null, null, 1400
null, BQ, J, null, null, 1000
null, BI, J, null, null, 12
Question is whether this behavior is achievable through a graph database and how would i go about implementing it in neo4j
05-12-2020 02:15 AM
Hi ShadyNawara,
welcome to Neo4j and the graph universe!
My question to your query would be: How would you like the information to be stored in the graph? Would you like to keep the 3 datasets separate or would you like the joint information to be on nodes of the graph? What would be the relationships? I can offer you solutions to both modeling approaches if you tell me which one you prefer
Regards,
Elena
05-12-2020 02:36 AM
Hi Elena,
Thank for replying and helping me through this.
The data sets come from different files and I think I would like to keep them separate.
The relationships would be between equal properties for example:
Values Table <-> Values Mapping (when both fields Match)
L1 <-> L1
L2 <-> L2
in the example only "BU,5,1200" would link to "3,BU,5"
Values Mapping <-> Start/End Dates
ID <-> ID when IDs match (2<->2, 3<->3 and so on)
05-12-2020 07:13 AM
Hi ShadyNawara,
ok, that is of course possible. However, I would recommend to think about a better modelling way. Your current suggestion has several cons:
But to show you how you could implement your model:
LOAD CSV WITH HEADERS FROM 'file:///valuesTable.csv' as row
CREATE (v:ValuesTable {L1:row.L1, L2:row.L2, Value:row.Value});
LOAD CSV WITH HEADERS FROM 'file:///valuesMapping.csv' as row
CREATE (v:ValuesMapping {ID:row.ID, L1:row.L1, L2:row.L2});
LOAD CSV WITH HEADERS FROM 'file:///startEndDates.csv' as row
CREATE (v:StartEndDate {ID:row.ID, StartDate:row.StartDate, EndDate:row.EndDate});
MATCH (a:ValuesTable), (b:ValuesMapping)
WHERE a.L1 = b.L1 AND a.L2 = b.L2
CREATE (a)-[:IS_SAME_VALUE_PAIR]->(b);
MATCH (a:ValuesMapping), (b:StartEndDate)
WHERE a.ID = b.ID
CREATE (a)-[:HAS_SAME_ID_AS]->(b)
MATCH (a:ValuesTable)
OPTIONAL MATCH (a)-[:IS_SAME_VALUE_PAIR]->(b:ValuesMapping)
OPTIONAL MATCH (b)-[:HAS_SAME_ID_AS]->(c:StartEndDate)
RETURN b.ID, a.L1, a.L2,c.StartDate, c.EndDate, a.Value;
which gives you:
Table 1:
b.ID | a.L1 | a.L2 | c.StartDate | c.EndDate | a.Value |
---|---|---|---|---|---|
"3" | "BU" | "5" | "2022" | "2023" | "1200" |
null | "BV" | "12" | null | null | "1400" |
null | "BQ" | "J" | null | null | "1000" |
null | "BI" | "J" | null | null | "12" |
So, I guess long story short is that you would really need a better structure within your graph to have all information accessible.
Of course, I do not know your data well, but what I can suggest from what I see are different structures which might or might not be similarly useful based on the queries you want to run against them:
Collect all information in a single node when you read in the data. But then you are also not using the graph relationships.
Join the first two tables when reading in the data and have the year dates as separate nodes which you then connect to.
a) load in the values:
LOAD CSV WITH HEADERS FROM 'file:///valuesTable.csv' as row
CREATE (v:Value {L1:row.L1, L2:row.L2, Value:row.Value});
b) merge the values from the Values Mapping:
LOAD CSV WITH HEADERS FROM 'file:///valuesMapping.csv' as row
MERGE (v:Value {L1:row.L1, L2:row.L2})
SET v.ID = row.ID;
c) load in the dates:
'file:///startEndDates.csv' as row
MERGE (b:Date {year: row.StartDate})
MERGE (c:Date {year: row.EndDate})
MERGE (a:Value {ID: row.ID})
MERGE (a)-[:STARTS]->(b)
MERGE (a)-[:ENDS]->(c)
d) get the table
MATCH (a:Value)
OPTIONAL MATCH (DateEnd:Date)<-[:ENDS*0..1]-(a)-[:STARTS*0..1]->(DateStart:Date)
RETURN a.ID,a.L1,a.L2,DateEnd.Year, DateStart.Year, a.Value ORDER BY a.ID
This gives you exactly the table you are looking for:
a.ID | a.L1 | a.L2 | DateEnd.Year | DateStart.Year | a.Value |
---|---|---|---|---|---|
"1" | null | null | null | null | null |
"2" | "BV" | "13" | null | null | null |
"3" | "BU" | "5" | null | null | "1200" |
"4" | "BU" | "J" | null | null | null |
"5" | "BV" | "J" | null | null | null |
null | "BQ" | "J" | null | null | "1000" |
null | "BI" | "J" | null | null | "12" |
null | "BV" | "12" | null | null | "1400" |
This way you can query your graph based on "meaningful" relationships. The graph looks like this:
However, you still have values showing up several times, like "BU". So, maybe it is worth having separate nodes for L1 and L2 additionally?
I hope this helps and I could show you several ways around the first steps in Neo4j 😉
Regards,
Elena
All the sessions of the conference are now available online