Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
10-03-2021 10:51 AM
Hi,
Newbie here. For a story I'm trying to import 330.000 json documents on academic papers, with relationships between papers, authors, institutions, funding and patents. I'm using apoc load json for this, but it's terribly slow, it's starts out pretty slow already, but crawls almost to a halt. A 125M json file with 12.000 records takes about 16 hours on my 16 GB RAM machine.
What I have tried:
Coming from Python I assume that my code is slow, using multiple FOREACH loops. Is there any way to make this code more efficient? Or am I missing something else entirely?
This is my query:
UNWIND ['file:///CH.json'] AS filename
CALL apoc.load.json(filename) YIELD value as v
MERGE (p:Paper {title: v.title}) ON CREATE SET p.abstract = COALESCE(v.abstract, 'NULL'),
p.lensId = v.lens_id,
p.datePublished = COALESCE(v.date_published.date, 'NULL'),
p.publicationType = COALESCE(v.publication_type, 'NULL'),
p.scholarlyCitationsCount = v.scholarly_citations_count,
p.patentCitationsCount = v.patent_citations_count
FOREACH (fund in v.funding |
MERGE (f:Funding {name: COALESCE(fund.org, 'NULL')}) ON CREATE SET f.country = COALESCE(fund.country, 'NULL'),
f.fundingId = COALESCE(fund.funding_id, 'NULL')
MERGE (f)-[:FUNDED]->(p))
FOREACH (author in v.authors |
MERGE (a:Author {name: COALESCE(author.first_name + ' ' + author.initials + ' ' + author.last_name, 'NULL')})
ON CREATE SET a.firstName = author.first_name,
a.initials = author.initials,
a.lastName = author.last_name,
a.nameInstitution = COALESCE(author.first_name + '_' + author.initials + '_' + author.last_name + '_' + author.affiliations[0].name, 'NULL')
MERGE (a)-[:AUTHORED]->(p)
FOREACH (affiliation in author.affiliations |
MERGE (i:Institution {name: COALESCE(affiliation.name, 'NULL')})
ON CREATE SET i.gridID = COALESCE(affiliation.grid.id, 'NULL')
MERGE (a)-[:WORKS_AT]->(i)))```
The JSON records are like this:
{
"abstract": "Exogenous application of methyl jasmonate (MeJA) has been extensively used to study jasmonate-dependent signaling events triggered by biotic stresses. MeJA application leads to complex jasmonate-dependent physiological responses, including changes in stomatal openness and induction of emissions of a multitude of volatile compounds. Whether the alterations in stomatal conductance and emissions of MeJA-induced volatiles are quantitatively associated with MeJA dose, and whether the induced volatile emissions are regulated by modifications in stomatal conductance had been poorly known until recently. Our latest studies highlighted a biphasic kinetics of jasmonate-dependent volatile emissions induced by MeJA treatment in the model species cucumber (Cucumis sativus), indicating induction of an immediate stress response and subsequent gene-expression level response. Both the immediate and delayed responses were MeJA dose-dependent. The studies further demonstrated that stomata modulated the kinetics of emissions of water-soluble volatiles in a MeJA dose-dependent manner. These studies contribute to understanding of plant short- and long-term responses to different biotic stress severities as simulated by treatments with a range of MeJA doses corresponding to mild to acute stress.",
"authors":
[
{
"affiliations":
[
{
"grid":
{
"addresses":
[
{
"country_code": "CN"
}
],
"id": "grid.27871.3b"
},
"name": "Nanjing Agricultural University"
},
{
"grid":
{
"addresses":
[
{
"country_code": "EE"
}
],
"id": "grid.16697.3f"
},
"name": "Estonian University of Life Sciences"
}
],
"collective_name": null,
"first_name": "Yifan",
"initials": "Y",
"last_name": "Jiang"
},
{
"affiliations":
[
{
"grid":
{
"addresses":
[
{
"country_code": "EE"
}
],
"id": "grid.16697.3f"
},
"name": "Estonian University of Life Sciences"
}
],
"collective_name": null,
"first_name": "Jiayan",
"initials": "J",
"last_name": "Ye"
},
{
"affiliations":
[
{
"grid":
{
"addresses":
[
{
"country_code": "EE"
}
],
"id": "grid.418882.f"
},
"name": "Estonian Academy of Sciences"
},
{
"grid":
{
"addresses":
[
{
"country_code": "EE"
}
],
"id": "grid.16697.3f"
},
"name": "Estonian University of Life Sciences"
}
],
"collective_name": null,
"first_name": "Ülo",
"initials": "Ü",
"last_name": "Niinemets"
}
],
"chemicals":
[],
"date_published":
{
"date": "2021-04-21T00:00:00+00:00"
},
"end_page": null,
"fields_of_study":
[
{
"name": "Biophysics"
},
{
"name": "Biotic stress"
}
],
"funding":
[
{
"country": null,
"funding_id": null,
"org": "European Regional Development Fund"
},
{
"country": null,
"funding_id": null,
"org": "by European Commission through the European Research Council"
},
{
"country": null,
"funding_id": null,
"org": "by European Commission through the European Research Council"
}
],
"issue": "7",
"keywords":
[
{
"value": "Dose–response"
},
{
"value": "MeJA"
},
{
"value": "green leaf volatiles"
},
{
"value": "lipoxygenase pathway volatiles"
}
],
"lens_id": "000-420-903-932-767",
"mesh_terms":
[],
"patent_citations_count": 0,
"publication_type": "journal article",
So I'm trying to create relationships between all entities in the records. Each records is a paper with, most of the time, several authors with (sometimes multiple) affiliations, funding, etc.
Some pointers would be greatly appreciated.
10-03-2021 11:34 PM
Hi Dimitry
Please check the reference from feb 2020 similar to the case you describe.
Regards
10-04-2021 12:46 AM
Thanks Refeaeli, I did read that and tried some of the suggestions. One of the cases is pretty similar to mine and I rewrote part of the query, but to no avail.
10-04-2021 02:15 PM
what indexes are on
:Paper
:Funding
:Author
:Institution
10-04-2021 11:36 PM
Hi Dana,
I see that I have one duplicate index on author, but here's the rest.
authorInstitutionNameIndex BTREE NONUNIQUE NODE [ "Author" ] [ "nameInstitution" ] ONLINE
authors BTREE NONUNIQUE NODE [ "Author" ] [ "name", "nameInstitution" ] ONLINE
constraint_1439150b BTREE UNIQUE NODE [ "Paper" ] [ "title" ] ONLINE
constraint_350d222c BTREE UNIQUE NODE [ "Institution" ] [ "name" ] ONLINE
constraint_41e068e1 BTREE UNIQUE NODE [ "Patent" ] [ "patentLensId" ] ONLINE
constraint_8fed31ab BTREE UNIQUE NODE [ "Funding" ] [ "fundingOrg" ] ONLINE
constraint_c18c113c BTREE UNIQUE NODE [ "Paper" ] [ "lensId" ] ONLINE
constraint_e20ec68d BTREE UNIQUE NODE [ "Institution" ] [ "gridId" ] ONLINE
institutions BTREE NONUNIQUE NODE [ "Institution" ] [ "name", "gridId" ] ONLINE
papers BTREE NONUNIQUE NODE [ "Paper" ] [ "title", "lensId" ] ONLINE
patents BTREE NONUNIQUE NODE [ "Patent" ] [ "title", "lensId" ] ONLINE
Constraints
ON ( paper:Paper ) ASSERT (paper.title) IS UNIQUE
ON ( institution:Institution ) ASSERT (institution.name) IS UNIQUE
ON ( patent:Patent ) ASSERT (patent.patentLensId) IS UNIQUE
ON ( funding:Funding ) ASSERT (funding.fundingOrg) IS UNIQUE
ON ( paper:Paper ) ASSERT (paper.lensId) IS UNIQUE
ON ( institution:Institution ) ASSERT (institution.gridId) IS UNIQUE
10-05-2021 09:28 AM
unless mistaken I'm not seeing an index on :Funding(name)
which would be of benefit for MERGE (f:Funding {name: COALESCE(fund.org, 'NULL')}) ON CREATE SET ... ...
10-06-2021 09:01 AM
Yes, that's true, thank you. I'm trying that now, but it still is very slow....
10-15-2021 08:59 AM
If you have all your indexes correct and that is not what is slowing down the merges, then I would think the problem is you are bringing the entire JSON into memory and committing it in one go. To that end I would recommend you use apoc.periodic.iterate to create batches.
so something like:
CALL apoc.periodic.iterate(
"UNWIND ['file:///CH.json'] AS filename
CALL apoc.load.json(filename) YIELD value as v",
"MERGE (p:Paper {title: v.title}) ON CREATE SET p.abstract = COALESCE(v.abstract, 'NULL'),
p.lensId = v.lens_id,
p.datePublished = COALESCE(v.date_published.date, 'NULL'),
p.publicationType = COALESCE(v.publication_type, 'NULL'),
p.scholarlyCitationsCount = v.scholarly_citations_count,
p.patentCitationsCount = v.patent_citations_count
FOREACH (fund in v.funding |
MERGE (f:Funding {name: COALESCE(fund.org, 'NULL')}) ON CREATE SET f.country = COALESCE(fund.country, 'NULL'),
f.fundingId = COALESCE(fund.funding_id, 'NULL')
MERGE (f)-[:FUNDED]->(p))
FOREACH (author in v.authors |
MERGE (a:Author {name: COALESCE(author.first_name + ' ' + author.initials + ' ' + author.last_name, 'NULL')})
ON CREATE SET a.firstName = author.first_name,
a.initials = author.initials,
a.lastName = author.last_name,
a.nameInstitution = COALESCE(author.first_name + '_' + author.initials + '_' + author.last_name + '_' + author.affiliations[0].name, 'NULL')
MERGE (a)-[:AUTHORED]->(p)
FOREACH (affiliation in author.affiliations |
MERGE (i:Institution {name: COALESCE(affiliation.name, 'NULL')})
ON CREATE SET i.gridID = COALESCE(affiliation.grid.id, 'NULL')
MERGE (a)-[:WORKS_AT]->(i))) "
,{batchSize:100, parallel:false})
10-17-2021 11:08 AM
Dimitri,
I would try creating a temporary node for each document with all relevant JSON fields imported as properties of the node. Once the data is imported into neo4j, it's much faster to perform the parsing functions you have listed to create the specific nodes you need rather than performing these functions as you import the data.
10-18-2021 04:54 AM
thanks for the suggestion.
I discovered a small error in my index and decided to load the json through python and that worked a lot quicker. In the end, it still took 7 hours, but that is manageble for now. I will try your suggestion to shave off some extra hours though.
All the sessions of the conference are now available online