Neo4j

golsombe · ‎09-06-2018

Community,

Apologies first, I'm new to GraphDB and Cypher. I did research and try several attempts to resolve this issue.

I have a self-referencing hierarchy from a legacy SQL data source which contains levels 4 - 1 with lvl4 being highest and lvl1 being the lowest. They each contain id and name properties. Each record results in a distinct path.
lvl4[A]
lvl3 [A of B]
lvl2 [A of B of C]
lvl1 [A-D]
lvl2 [A of B of C]
lvl2 [A of B of C]
lvl2 [A of B]
lvl2 [A of B]
lvl1 [G]...

Data Example

dim_id	lvl1_id	lvl1_name	lvl2_id	lvl2_name	lvl3_id	lvl3_name	lvl4_id	lvl4_name
1000	428227	Customer Item Generic	300348	Customer Item	200692	Customer Enterprise	100255	Customer Services
1001	431077	Small Customer Item Latin America	300348	Customer Item	200692	Customer Enterprise	100255	Customer Services
1002	428138	Small Customer Item US	300348	Customer Item	200692	Customer Enterprise	100255	Customer Services
1003	431073	Customer Item Priority	300348	Customer Item	200692	Customer Enterprise	100255	Customer Services
1004	431078	Customer Item Latin America	300348	Customer Item	200692	Customer Enterprise	100255	Customer Services

Can someone help me with the cypher to create a hierarchy relationship for this data?
lvl1 to lvl2 relation called tg
lvl2 to lvl3 relation called pg
lvl3 to lvl4 relation called sg
-if necessary-
lvl4 to root relation called dg

Neo4j Desktop 1.0.20
No API or Driver necessary

andrew_bowman · ‎09-06-2018

In your example data levl4, levl3, and lvl2 are all the same for all rows. Is that true only for the example, or are those expected to change?

Do you know if the resulting data is meant to be a tree?

If two different rows share the same name and id at a certain level, can we be assured that all levels above are also in common? For example, this example data has all rows with lvl2, lvl3, and lvl4 in common for all rows. Would there ever be a situation where multiple rows have the same lvl2 or lvl3 but the levels above them would be different? I'm asking because I'm wondering if items with the same id are meant to refer to the same node in the graph (and thus id would be unique for nodes at that level) or if we could have duplicate nodes at a given level with the same id (the id property not being unique for nodes at that level) because the nodes above them in the hierarchy are different.

It would help if you could provide the labels (node types) you want to use in the graph as well, and maybe a quick visual of the overall hierarchy.

For example, something like this, but replacing the more generic labels here with something that is a bit more specific to your domain if possible:

(:lvl1)-[:tg]->(:lvl2)-[:pg]->(:lvl3)-[:sg]->(:lvl4)

golsombe · ‎09-06-2018

In order to be brief I only included one roll up of a tree hierarchy. In actuality the dataset contains more than 10k unique paths. So there is a lot of duplication in the higher levels. Lvl4 (div) will continue 14 unique names duplicated across 10k records. With each level containing more unique names than its parent.
Lvl4 = Div
Lvl3 = Seg
Lvl2 = Pdg
Lvl1 = Ctg

Thanks for the help so far. I hope I explained this better this time.

andrew_bowman · ‎09-06-2018

Thanks for the labels, that will help, though the biggest thing that's missing is information on uniqueness of nodes per level (or per label, now).

With graph databases we tend to model in a normalized fashion, so for example at the top level, for :Div nodes, we would only have 14 nodes and not do any duplication, with relationships down to nodes at the next level.

The question then is if the ids here are meant to be unique across nodes at that level (nodes with that label), or if there's going to be duplication based upon differing nodes at the hierarchy above.

As an example, would we ever have a situation where we have two (or more) identical level 3 nodes, but with different nodes at level 4? Or a situation with identical level 2 nodes, but a difference at level 3 or 4?

Or put another way, is this meant to be a tree structure, where every node in the tree only has a single parent and each id is unique to a node at the same level? Or would we have duplicates of the same node at a given level, because they have a different hierarchy above them?

golsombe · ‎09-06-2018

It is a tree structure. If you combined lvls 4+3+2+1 it's unique. Lvl1 is completely unique, each 1 has a single parent 2 who has a single parent 3 etc. Top down 4 is the smallest set of node names but it replicated the most across the dataset.

There is a unique code associated with each row the dim_id property. That is used to identify the "relationship" from other data. I modified my earlier example to bridge multiple segments, products, ps_groups. The tree example below should match the data grid where lvl4 repeats for all 5 rows, lvl3 repeats 2x, lvl2 has only 1 repeat, and lvl1 is all unique.

dim_id	lvl1_id	lvl1_name	lvl2_id	lvl2_name	lvl3_id	lvl3_name	lvl4_id	lvl4_name
row_id	psg_id	psg_name	ps_id	ps_name	seg_id	seg_name	div_id	div_name
1000	428227	Customer Item Generic	300348	Customer Item 1	200692	Customer Enterprise 1	100255	Customer Services
1001	431077	Small Customer Item Latin America	300348	Customer Item 1	200692	Customer Enterprise 1	100255	Customer Services
1002	428138	Small Customer Item US	300349	Customer Item 2	200692	Customer Enterprise 1	100255	Customer Services
1003	431073	Customer Item Priority	300349	Customer Item 3	200693	Customer Enterprise 2	100255	Customer Services
1004	431078	Customer Item Latin America	300350	Customer Item 4	200693	Customer Enterprise 2	100255	Customer Services

DIV (Customer Services)
    SEG (Customer Enterprise 1)
        PS (Customer Item 1)
           PSG(Customer Item Generic)
           PSG(Small Customer Item Latin America)
        PS (Customer Item 2)
           PSG(Small Customer Item US)    
    SEG (Customer Enterprise 2)
       PS (Customer Item 3)
           PSG(Customer Item Priority)
       PS (Customer Item 4)
           PSG(Customer Item Latin America)

andrew_bowman · ‎09-06-2018

Great, so now that we know this is a tree structure, and have labels for the nodes, and know that ids are unique per label, we can create the constraints we need, then do the import.

Using the node labels you provided, let's create the following constraints (run each separately)

CREATE CONSTRAINT ON (n:Div) ASSERT n.id IS UNIQUE
CREATE CONSTRAINT ON (n:Seg) ASSERT n.id IS UNIQUE
CREATE CONSTRAINT ON (n:Pdg) ASSERT n.id IS UNIQUE
CREATE CONSTRAINT ON (n:Ctg) ASSERT n.id IS UNIQUE

That will prevent us from accidentally created duplicate nodes, and we get an index with each of those so lookup by any of these nodes by id should be fast.

Now for the actual import. Provided this is a CSV file (in the import directory of your Neo4j home directory) we can do the following:

USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'myImportFile.csv' as line
MERGE (c:Ctg {id: toInteger(line.lvl1_id)})
SET c.name = line.lvl1_name
MERGE (p:Pdg {id: toInteger(line.lvl2_id)})
ON CREATE SET p.name = line.lvl2_name
MERGE (s:Seg {id: toInteger(line.lvl3_id)})
ON CREATE SET s.name = line.lvl3_name
MERGE (d:Div {id: toInteger(line.lvl4_id)})
ON CREATE SET d.name = line.lvl4_name

MERGE (c)-[:tg]->(p)
MERGE (p)-[:pg]->(s)
MERGE (s)-[:sg]->(d)

You'll want to double-check that the labels and relationship types (and directions) are correct, but the output should be your full hierarchy tree.

golsombe · ‎09-11-2018

Thank you for the help. The constraints and merge worked perfectly for what I needed. Sorry it's been a while but other work items took priority. I'm back in the GraphDB game again and this really helps with my understanding.

Neo4j

Defining a hierarchy relationship from multiple properties