cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

Multiple Possible Nodes from One CSV Column

I've a CSV file of businesses and business owners that looks something like this:

S/n Business Business_ID Owned_By Gender
1 Aries Pte Ltd 12345 James Bond Male
2 Taurus Pte Ltd 23456 Daenerys Female
3 Gemini Pte Ltd 34567 Taurus Pte Ltd NA

I would like to create a graph with the businesses as a node, the owner as a node, and a relationship between the business and the owner. For s/n 1 and 2, this is very straight forward:

LOAD CSV WITH HEADERS FROM "....csv" AS file
MERGE (b:Business {ID: file.Business_ID) 
WITH file, b
MERGE (p:Person {Name: file.Owned_by, Gender: file.Gender}) 
WITH file, b, p
MERGE (b)-[:owned_by] ->(p) 

But note that a business could also be owned by another business (s/n 3), and that businesses and persons contains different properties (Business_ID and gender). Using the above syntax, I would get two distinct relationships for s/n 2 and 3:

(Taurus Pte Ltd)-[:owned_by]->(Daenerys)
(Gemini Pte Ltd)-[:owned_by]->(Taurus Pte Ltd)

But what I would like is a single relationship showing the entire chain of ownership:

(Taurus Pte Ltd)-[:owned_by]->(Gemini Pte Ltd)-[:owned_by] ->(Daenerys)

How should I go about doing this?

1 ACCEPTED SOLUTION

Hi,

welcome to the Neo4j community!

I think you could approach it from different angles. Let me give you some suggestions:

  1. You could have different csvs that you load in: one with businesses, one with people and a third one with the edges. That way you can clearly differentiate between people and business and have the correct properties on every node.

  2. If you want to keep your single file and the load command that you have, you could have an additional "postprocess" query. Something like:

MATCH (b1:Business)-[r:owned_by]->(p:Person), (b2:Business) WHERE p.name = b2.name
CREATE (b1)-[s:owned_by]->(b2)
SET s = r
DETACH DELETE p

This will basically find all nodes with the label "Person" that are actually business nodes and "re-route" the existing relationship to the correct node while deleting the "wrong" Person node.

I hope this helps you. You can check back if you have problems with any of the above approaches.

Regards,
Elena

View solution in original post

3 REPLIES 3

Hi,

welcome to the Neo4j community!

I think you could approach it from different angles. Let me give you some suggestions:

  1. You could have different csvs that you load in: one with businesses, one with people and a third one with the edges. That way you can clearly differentiate between people and business and have the correct properties on every node.

  2. If you want to keep your single file and the load command that you have, you could have an additional "postprocess" query. Something like:

MATCH (b1:Business)-[r:owned_by]->(p:Person), (b2:Business) WHERE p.name = b2.name
CREATE (b1)-[s:owned_by]->(b2)
SET s = r
DETACH DELETE p

This will basically find all nodes with the label "Person" that are actually business nodes and "re-route" the existing relationship to the correct node while deleting the "wrong" Person node.

I hope this helps you. You can check back if you have problems with any of the above approaches.

Regards,
Elena

Hi Elena,

Thanks a lot for your help! I'm inclined to chose the 2nd method because that would mean that as my csv file gets updated and expands in the future, I can just refresh my graph database simply by running the same syntax again without requiring a man in the loop to separate the csv file. The data isn't wrong per se, it was just that the system from which we extracted the data from wasn't design with this particular concern in mind.

Your method does work on my trial run on a very small dataset of about 20 rows. I'll try it out next on my actual dataset of over 6 million rows. I hope it'll work as well!

I was just thinking though, is it possible to solve this problem in the "pre-processing" phase rather than "post-processing"? Will it be less resource intensive give that neo4j will not have to establish relationships only to have to comb through it, find the wrong ones, establish a new relationship and delete the old one?

I'm thinking from a more "python-ic" manner, but I'm not sure if this is achievable on neo4j (still very fresh to this!) :

  • put the unique values under the "business" column in a list
  • if the values under the "owned_by" column is in the list, do not do anything. Else, create a node for each unique "business" and "owned by" entity
  • if values under "owned_by" column is in the list, create (business)-[owned_by]->(business), else, (business)-[owned_by]->(person)

Hi,

yes, I completely get what you want to achieve. However, I think that this is not so easy in Cypher. The problem is always that you can't really "save" anything for further processing unless you put it on nodes.

I don't know your setup but what you can always do is have either some kind of "wrapper" programme (e.g. in Python or Java) that does the preprocessing of your csv before writing the nodes or you could write a user-defined procedure (in Java) that does the preprocessing for you which you can then call from Neo4j.

Regards,
Elena

Nodes 2022
Nodes
NODES 2022, Neo4j Online Education Summit

All the sessions of the conference are now available online