Neo4j

alessio9899 · ‎03-30-2020

Hello everybody,

could please help me with my thesis project?
I have to import data from a large CSV file (1,5 gb), here a sample from source file and the query i'm trying to execute:

CSV.txt (2.9 KB)

i waited an entire day but query never stop to load, any help? thanks

12kunal34 · ‎03-30-2020

Please post your query here so we can help you

alessio9899 · ‎03-30-2020

yes, sorry, i choosed the text file for a matter of organization because query is pretty long, anyway:

LOAD CSV WITH HEADERS FROM
"file:///C:/opere_prova.csv"
AS line FIELDTERMINATOR ';'
merge(o:Work{
shortTitle:COALESCE(line.ShortTitle,$defaultValue),
longTitle:COALESCE(line.LongTitle,$defaultValue),
academyName:COALESCE(line.ItacAcademyItem,$defaultValue),
academyId:COALESCE(toInteger(line.AcademyId),$defaultInteger),
format:COALESCE(line.Format,$defaultValue),
cityPublicationNameIt:COALESCE(line.CityPublicationPlaceItalianName,$defaultValue),
cityPublicationNameLat:COALESCE(line.CityPublicationPlaceLatinName,$defaultValue),
cityPublicationNameEng:COALESCE(line.CityPublicationPlaceEnglishName,$defaultValue),
cityPublicationPlaceFictitiousName:COALESCE(line.CityPublicationPlaceFictitiousName,$defaultValue),
cityId:COALESCE(toInteger(line.CityId),$defaultInteger),
publicationYear:COALESCE(line.PublicationYear,$defaultValue),
dedicationPlaceDate:COALESCE(line.DedicationPlaceDate,$defaultValue),
shelfmark:COALESCE(line.Shelfmark,$defaultValue),
content:COALESCE(line.Content,$defaultValue),
subjects:COALESCE(line.Subjects,$defaultValue),
language:COALESCE(line.Language,$defaultValue),
illustration:COALESCE(line.Illustration,$defaultValue),
pagination:COALESCE(line.Pagination,$defaultValue),
marginalia:COALESCE(line.Marginalia,$defaultValue),
authorInfo:COALESCE(line.AuthorsInfo,$defaultValue),
authorPersonId:COALESCE(toInteger(line.AuthorsPersonId),$defaultInteger),
edition:COALESCE(line.Edition,$defaultValue),
contributorInfo:COALESCE(line.ContributorsInfo,$defaultValue),
contributorPersonId:COALESCE(toInteger(line.ContributorsPersonId),$defaultInteger),
censorInfo:COALESCE(line.CensorsInfo,$defaultValue),
censorPersonId:COALESCE(toInteger(line.CensorsPersonId),$defaultInteger),
censorAgreement:COALESCE(line.CensorsAgreement,$defaultValue),
dedicateeInfo:COALESCE(line.DedicateesInfo,$defaultValue),
dedicateePersonId:COALESCE(toInteger(line.DedicateesPersonId),$defaultInteger),

editorInfo:COALESCE(line.EditorsInfo,$defaultValue),
editorPersonId:COALESCE(toInteger(line.EditorsPersonId),$defaultInteger),
artistInfo:COALESCE(line.ArtistsInfo,$defaultValue),
artistPersonId:COALESCE(toInteger(line.ArtistsPersonId),$defaultInteger),
illustratorInfo:COALESCE(line.IllustratorsInfo,$defaultValue),
illustratorPersonId:COALESCE(toInteger(line.IllustratorsPersonId),$defaultInteger),
printerInfo:COALESCE(line.PrintersInfo,$defaultValue),
printerPersonId:COALESCE(toInteger(line.PrintersPersonId),$defaultInteger),
printerOrnament:COALESCE(line.PrinterOrnament,$defaultValue),
engraverReference:COALESCE(line.EngraverReference,$defaultValue),
essayTitle:COALESCE(line.EssayTitle,$defaultValue),
essayContributorInfo:COALESCE(line.EssayContributorInfo,$defaultValue),
essayContributorPersonId:COALESCE(toInteger(line.EssayContributorPersonId),$defaultInteger),
citation:COALESCE(line.Citation,$defaultValue),
workId:COALESCE(toInteger(line.WorkId),$defaultInteger),
notes:COALESCE(line.Notes,$defaultValue)
})
return count(o)

andrew_bowman · ‎03-30-2020

MERGEs like this won't be performant. You want to only MERGE on the properties that uniquely identify a node, and you want to make sure it will be backed by an index, and then you can use an ON CREATE SET to set the remaining properties. Run an EXPLAIN of the query, if you see it using a NodeByLabelScan for the merge, then you don't have the appropriate indexes in place on the set of properties that uniquely identify a :Work node.

Also you should be USING PERIODIC COMMIT here, and don't perform an aggregation (count()) here.

alessio9899 · ‎03-30-2020

could you please write me a little example? Because i'm new at using Neo4j and it was difficult to build that query so i'm not sure i undestand how to write it correcty, thanks a lot

andrew_bowman · ‎03-30-2020

First you need to tell us what property or properties should uniquely identify a :Work node.

Also, if all of the lines in your CSV should be unique nodes, or if some of them will refer to the same node.

Lastly, if there are any :Work nodes in your graph already.

alessio9899 · ‎03-30-2020

thanks for answer, so:

the unique property should be workId
yes, all the lines with same workId should refer to the same node (then i'll use "CALL
apoc.refactor.mergeNodes(node2Merge) YIELD node" to merge nodes with same workId)
No :Work nodes yet

andrew_bowman · ‎03-30-2020

Okay, make sure you create a unique constraint on :Work(workId).

Next, change your MERGE so that you only MERGE on workId, and use ON CREATE SET to set the remaining properties.

If you do an EXPLAIN of the query you should see a plan that uses NodeIndexSeek, and not a NodeByLabelScan.

Can you provide more details about when you need to use apoc.refactor.mergeNodes()? If you're handling this correctly at the start, you shouldn't have any duplicate nodes to merge together. Note that we're talking about two different usages of the term here. In Cypher, MERGE refers to creating the node if it doesn't exist, otherwise matching to the node. The APOC refactoring proc is meant to take multiple nodes and collapse them into a single node, often as a means of correcting a mistake of duplicate nodes.

alessio9899 · ‎03-31-2020

About your first sentence,
do you mean to execute first a query like this? :

USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM
"file:///C:/opuere.csv"
AS line FIELDTERMINATOR ';'
MERGE(o:Work{workId:COALESCE(toInteger(line.WorkId),$defaultInteger)})
ON CREATE SET
o.shortTitle = COALESCE(line.ShortTitle,$defaultValue),
o.longTitle=COALESCE(line.LongTitle,$defaultValue),
o.academyName=COALESCE(line.ItacAcademyItem,$defaultValue),
o.academyId=COALESCE(toInteger(line.AcademyId),$defaultInteger),
o.ormat=COALESCE(line.Format,$defaultValue),
o.cityPublicationNameIt=COALESCE(line.CityPublicationPlaceItalianName,$defaultValue),
o.cityPublicationNameLat=COALESCE(line.CityPublicationPlaceLatinName,$defaultValue),
o.cityPublicationNameEng=COALESCE(line.CityPublicationPlaceEnglishName,$defaultValue),
o.cityPublicationPlaceFictitiousName=COALESCE(line.CityPublicationPlaceFictitiousName,$defaultValue),
o.cityId=COALESCE(toInteger(line.CityId),$defaultInteger),
o.publicationYear=COALESCE(line.PublicationYear,$defaultValue),
o.dedicationPlaceDate=COALESCE(line.DedicationPlaceDate,$defaultValue),
o.shelfmark=COALESCE(line.Shelfmark,$defaultValue),
o.content=COALESCE(line.Content,$defaultValue),
o.subjects=COALESCE(line.Subjects,$defaultValue),
o.language=COALESCE(line.Language,$defaultValue),
o.illustration=COALESCE(line.Illustration,$defaultValue),
o.pagination=COALESCE(line.Pagination,$defaultValue),
o.marginalia=COALESCE(line.Marginalia,$defaultValue),
o.authorInfo=COALESCE(line.AuthorsInfo,$defaultValue),
o.authorPersonId=COALESCE(toInteger(line.AuthorsPersonId),$defaultInteger),
o.edition=COALESCE(line.Edition,$defaultValue),
o.contributorInfo=COALESCE(line.ContributorsInfo,$defaultValue),
o.contributorPersonId=COALESCE(toInteger(line.ContributorsPersonId),$defaultInteger),
o.censorInfo=COALESCE(line.CensorsInfo,$defaultValue),
o.censorPersonId=COALESCE(toInteger(line.CensorsPersonId),$defaultInteger),
o.censorAgreement=COALESCE(line.CensorsAgreement,$defaultValue),
o.dedicateeInfo=COALESCE(line.DedicateesInfo,$defaultValue),
o.dedicateePersonId=COALESCE(toInteger(line.DedicateesPersonId),$defaultInteger),
o.editorInfo=COALESCE(line.EditorsInfo,$defaultValue),
o.editorPersonId=COALESCE(toInteger(line.EditorsPersonId),$defaultInteger),
o.artistInfo=COALESCE(line.ArtistsInfo,$defaultValue),
o.artistPersonId=COALESCE(toInteger(line.ArtistsPersonId),$defaultInteger),
o.illustratorInfo=COALESCE(line.IllustratorsInfo,$defaultValue),
o.illustratorPersonId=COALESCE(toInteger(line.IllustratorsPersonId),$defaultInteger),
o.printerInfo=COALESCE(line.PrintersInfo,$defaultValue),
o.rinterPersonId=COALESCE(toInteger(line.PrintersPersonId),$defaultInteger),
o.printerOrnament=COALESCE(line.PrinterOrnament,$defaultValue),
o.engraverReference=COALESCE(line.EngraverReference,$defaultValue),
o.essayTitle=COALESCE(line.EssayTitle,$defaultValue),
o.essayContributorInfo=COALESCE(line.EssayContributorInfo,$defaultValue),
o.essayContributorPersonId=COALESCE(toInteger(line.EssayContributorPersonId),$defaultInteger),
o.notes=COALESCE(line.Notes,$defaultValue),
o.citation=COALESCE(line.Citation,$defaultValue)

let me understand this first and then i'll try to explain what i meant with the "apoc.refractor"
Thanks a lot.

alessio9899 · ‎03-31-2020

following this query i have a lot of problems, here an example:

in the CSV file there are more than one record for n.dedicateePersonId for each workId but the query built it with just one record per node, any idea?
it should be for example: 5099,5100,6798, i mean more than one value per property

andrew_bowman · ‎03-31-2020

The only way to have more than one value per property is for that property to be a list that you add to over time.

If each line can only add one value to the property, then you should be adding to the list property (increasing the size of the list) each time.

Note that you cannot query by index on an element of a list property. You may want to consider if your data requires a richer model. Right now it looks like you're using a single node type as an data dump for all kinds of data that should have their own dedicated nodes (academies, publications, cities, artists, authors). And if you only ever plan to return this data but never query by it, maybe that's fine.

But for elements that are meant to be common between works, you really sound think about whether those elements should be their own separate nodes, with relationships between the common works that they apply to.

Note that graph databases are at their most useful when there are relationships and patterns to query over. If :Work nodes are the only thing your db, with no relationships, then Neo4j probably isn't the right tool to use, or perhaps you just haven't figured out how to adapt to a graphy kind of modeling for your use case.

alessio9899 · ‎03-31-2020

i have to add also :Person and :Academy nodes in order to link everything through some Ids(after all this).

I was thinking about these three kind of nodes because i have three different CSV, one for each type, but if you suggest to split them in more than 3 nodes i'll consider it.

can you tell me hot to modify my query with list please?

andrew_bowman · ‎03-31-2020

Sure, but I don't know what your default value is for this one. Just looking at the dedicateePersonId for this example, you may have:

...
ON CREATE SET
o.dedicateePersonId = [coalesce(line.DedicateesInfo, $defaultValue)]
...
ON MATCH SET
o.dedicateePersonId = dedicateePersonId + coalesce(line.DedicateesInfo, $defaultValue)
...

The idea is that the property value on creation must be a list, and on subsequent matches you're only adding on to that list.

alessio9899 · ‎03-31-2020

defaultvalue is a string= 'No Entry' and defaultInteger (i created this for a specific reason but it's no needed to be honest) is 0.

i'll try the complete query, be right back, thanks Andrew.

andrew_bowman · ‎03-31-2020

Typo in my example, the value we're setting in ON CREATE SET should be set within a literal list. Just fixed it.

alessio9899 · ‎03-31-2020

i'm trying, it's still running, keeping my finger crossed!

alessio9899 · ‎03-31-2020

it's still running, how can i know if there's a problem? or i have just to wait?

andrew_bowman · ‎03-31-2020

May just have to wait.

You can check if the number of :Work nodes is growing over time, though that won't show cases where it's matching to existing nodes.

To double-check, can you provide an EXPLAIN plan (with all elements expanded) of your query?

alessio9899 · ‎03-31-2020

it created 803 nodes over 1010, but it didn't progress for hours, here the explain pictures, anyway the nodes created are corrected, thanks.

andrew_bowman · ‎03-31-2020

The query plan looks expected, shouldn't be any issues with it.

Can you clarify created 803 nodes over 1010? What is 1010 here?

Also, by progress are you referring to the number of nodes created? If this is a 1.5 gb file I imagine this may take a long time to fully process. How many rows is this thing? And how large are you expecting these lists to grow? You may want to reconsider using list properties for this. As noted previously, it looks like you're trying to store several nodes worth of information into a single node type.

alessio9899 · ‎03-31-2020

Sorry for my bad explanation.
1010 is the max number of workId, so creating uniqueness constraint on that property i would expect 1010 nodes in total.

this is the max number of rows:

should i consider something different on my query?
Is there a way to reduce the process?

andrew_bowman · ‎03-31-2020

Try just doing a MERGE on the :Work nodes by the workId as before, but for this run don't consider the properties that you were appending to lists previously, so don't do ON MATCH SET operations. Just get the nodes created.

The ids of the items that you were to have as lists, were you planning on adding specific nodes for them? If so, get those nodes created next (and make sure you have indexes and/or unique constraints to support MERGE operations on them). Most likely those correspond to elements you were going to save as lists.

Once all nodes are created, then go about subsequent runs to MATCH to existing nodes and create the relationships between them. That should be faster than trying to do list appending one at a time for 1 million nodes.

alessio9899 · ‎04-01-2020

thanks Andrew, i think i figured out, i'll let you know soon, let's see if the query is going to stop first

andrew_bowman · ‎04-01-2020

Glad to hear it, let us know what you figured out when you get the chance.

ofer_bar · ‎04-03-2020

Hi,

I just want to share my own experience with importing CSV data into Neo4j database.
I also tried doing that using my own queries that built the nodes, properties and relationship from CSV files.
For small sized data this is fine. But once you go larger (for example 10^5 nodes and higher) you can take a vacation after running the query and still wait for a while once you get back home

A much better approach and it is written in the docs, is to use the import tool:

bin/neo4j-admin import

There are few cons for using this tool:

You will need to convert your CSV file for a very specific format of the import tool.
The import tool creates a new database, so you can't use previous data and then import.

Still, the import process itself is a blast and will finish very quickly so I think overall it is well worth the time I spent on converting to use the import tool.

Good luck!

alessio9899 · ‎04-07-2020

Hi guys,

after 4 days of continous running finally it ended!
Before to go on i want also thank ofer.bar for your suggestion, i definitely retry following what you told me(after a giant dump of my DB of course!) but i'm not sure i understood how to exactly do it but i'm going to learn and try to understand, i'll let you know.
May i ask something else about my project here or do i need to open new posts?

ofer_bar · ‎04-11-2020

Hi,

Sure you can ask any question.

You can start with the docs, it is pretty detailed and I managed to figured it out from there:

https://neo4j.com/docs/operations-manual/3.5/tutorial/import-tool/

Don't forget to pick your database version in the doc (3.5/4.0) although I'm not sure it was changed with 4.0...

I hope this helps

Neo4j

My long importing query never ends