Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
09-06-2018 11:22 AM
Community,
Apologies first, I'm new to GraphDB and Cypher. I did research and try several attempts to resolve this issue.
I have a self-referencing hierarchy from a legacy SQL data source which contains levels 4 - 1 with lvl4 being highest and lvl1 being the lowest. They each contain id and name properties. Each record results in a distinct path.
lvl4[A]
lvl3 [A of B]
lvl2 [A of B of C]
lvl1 [A-D]
lvl2 [A of B of C]
lvl2 [A of B of C]
lvl2 [A of B]
lvl2 [A of B]
lvl1 [G]...
Data Example
dim_id | lvl1_id | lvl1_name | lvl2_id | lvl2_name | lvl3_id | lvl3_name | lvl4_id | lvl4_name | |
---|---|---|---|---|---|---|---|---|---|
1000 | 428227 | Customer Item Generic | 300348 | Customer Item | 200692 | Customer Enterprise | 100255 | Customer Services | |
1001 | 431077 | Small Customer Item Latin America | 300348 | Customer Item | 200692 | Customer Enterprise | 100255 | Customer Services | |
1002 | 428138 | Small Customer Item US | 300348 | Customer Item | 200692 | Customer Enterprise | 100255 | Customer Services | |
1003 | 431073 | Customer Item Priority | 300348 | Customer Item | 200692 | Customer Enterprise | 100255 | Customer Services | |
1004 | 431078 | Customer Item Latin America | 300348 | Customer Item | 200692 | Customer Enterprise | 100255 | Customer Services |
Can someone help me with the cypher to create a hierarchy relationship for this data?
lvl1 to lvl2 relation called tg
lvl2 to lvl3 relation called pg
lvl3 to lvl4 relation called sg
-if necessary-
lvl4 to root relation called dg
09-06-2018 11:47 AM
In your example data levl4, levl3, and lvl2 are all the same for all rows. Is that true only for the example, or are those expected to change?
Do you know if the resulting data is meant to be a tree?
If two different rows share the same name and id at a certain level, can we be assured that all levels above are also in common? For example, this example data has all rows with lvl2, lvl3, and lvl4 in common for all rows. Would there ever be a situation where multiple rows have the same lvl2 or lvl3 but the levels above them would be different? I'm asking because I'm wondering if items with the same id are meant to refer to the same node in the graph (and thus id would be unique for nodes at that level) or if we could have duplicate nodes at a given level with the same id (the id property not being unique for nodes at that level) because the nodes above them in the hierarchy are different.
It would help if you could provide the labels (node types) you want to use in the graph as well, and maybe a quick visual of the overall hierarchy.
For example, something like this, but replacing the more generic labels here with something that is a bit more specific to your domain if possible:
(:lvl1)-[:tg]->(:lvl2)-[:pg]->(:lvl3)-[:sg]->(:lvl4)
09-06-2018 12:07 PM
In order to be brief I only included one roll up of a tree hierarchy. In actuality the dataset contains more than 10k unique paths. So there is a lot of duplication in the higher levels. Lvl4 (div) will continue 14 unique names duplicated across 10k records. With each level containing more unique names than its parent.
Lvl4 = Div
Lvl3 = Seg
Lvl2 = Pdg
Lvl1 = Ctg
Thanks for the help so far. I hope I explained this better this time.
09-06-2018 12:21 PM
Thanks for the labels, that will help, though the biggest thing that's missing is information on uniqueness of nodes per level (or per label, now).
With graph databases we tend to model in a normalized fashion, so for example at the top level, for :Div nodes, we would only have 14 nodes and not do any duplication, with relationships down to nodes at the next level.
The question then is if the ids here are meant to be unique across nodes at that level (nodes with that label), or if there's going to be duplication based upon differing nodes at the hierarchy above.
As an example, would we ever have a situation where we have two (or more) identical level 3 nodes, but with different nodes at level 4? Or a situation with identical level 2 nodes, but a difference at level 3 or 4?
Or put another way, is this meant to be a tree structure, where every node in the tree only has a single parent and each id is unique to a node at the same level? Or would we have duplicates of the same node at a given level, because they have a different hierarchy above them?
09-06-2018 12:52 PM
It is a tree structure. If you combined lvls 4+3+2+1 it's unique. Lvl1 is completely unique, each 1 has a single parent 2 who has a single parent 3 etc. Top down 4 is the smallest set of node names but it replicated the most across the dataset.
There is a unique code associated with each row the dim_id property. That is used to identify the "relationship" from other data. I modified my earlier example to bridge multiple segments, products, ps_groups. The tree example below should match the data grid where lvl4 repeats for all 5 rows, lvl3 repeats 2x, lvl2 has only 1 repeat, and lvl1 is all unique.
dim_id | lvl1_id | lvl1_name | lvl2_id | lvl2_name | lvl3_id | lvl3_name | lvl4_id | lvl4_name | |
---|---|---|---|---|---|---|---|---|---|
row_id | psg_id | psg_name | ps_id | ps_name | seg_id | seg_name | div_id | div_name | |
1000 | 428227 | Customer Item Generic | 300348 | Customer Item 1 | 200692 | Customer Enterprise 1 | 100255 | Customer Services | |
1001 | 431077 | Small Customer Item Latin America | 300348 | Customer Item 1 | 200692 | Customer Enterprise 1 | 100255 | Customer Services | |
1002 | 428138 | Small Customer Item US | 300349 | Customer Item 2 | 200692 | Customer Enterprise 1 | 100255 | Customer Services | |
1003 | 431073 | Customer Item Priority | 300349 | Customer Item 3 | 200693 | Customer Enterprise 2 | 100255 | Customer Services | |
1004 | 431078 | Customer Item Latin America | 300350 | Customer Item 4 | 200693 | Customer Enterprise 2 | 100255 | Customer Services |
DIV (Customer Services) SEG (Customer Enterprise 1) PS (Customer Item 1) PSG(Customer Item Generic) PSG(Small Customer Item Latin America) PS (Customer Item 2) PSG(Small Customer Item US) SEG (Customer Enterprise 2) PS (Customer Item 3) PSG(Customer Item Priority) PS (Customer Item 4) PSG(Customer Item Latin America)
09-06-2018 01:13 PM
Great, so now that we know this is a tree structure, and have labels for the nodes, and know that ids are unique per label, we can create the constraints we need, then do the import.
Using the node labels you provided, let's create the following constraints (run each separately)
CREATE CONSTRAINT ON (n:Div) ASSERT n.id IS UNIQUE
CREATE CONSTRAINT ON (n:Seg) ASSERT n.id IS UNIQUE
CREATE CONSTRAINT ON (n:Pdg) ASSERT n.id IS UNIQUE
CREATE CONSTRAINT ON (n:Ctg) ASSERT n.id IS UNIQUE
That will prevent us from accidentally created duplicate nodes, and we get an index with each of those so lookup by any of these nodes by id should be fast.
Now for the actual import. Provided this is a CSV file (in the import
directory of your Neo4j home directory) we can do the following:
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'myImportFile.csv' as line
MERGE (c:Ctg {id: toInteger(line.lvl1_id)})
SET c.name = line.lvl1_name
MERGE (p:Pdg {id: toInteger(line.lvl2_id)})
ON CREATE SET p.name = line.lvl2_name
MERGE (s:Seg {id: toInteger(line.lvl3_id)})
ON CREATE SET s.name = line.lvl3_name
MERGE (d:Div {id: toInteger(line.lvl4_id)})
ON CREATE SET d.name = line.lvl4_name
MERGE (c)-[:tg]->(p)
MERGE (p)-[:pg]->(s)
MERGE (s)-[:sg]->(d)
You'll want to double-check that the labels and relationship types (and directions) are correct, but the output should be your full hierarchy tree.
09-11-2018 08:07 AM
Thank you for the help. The constraints and merge worked perfectly for what I needed. Sorry it's been a while but other work items took priority. I'm back in the GraphDB game again and this really helps with my understanding.
All the sessions of the conference are now available online