Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
07-03-2019 08:52 AM
I am importing several datasets of which a couple of large ones (500K-1.5M documents of the type shown below), and I have an index created with the command
"CREATE CONSTRAINT ON (n:_) ASSERT n.id IS UNIQUE;"
Indexing time goes from a couple of ms per document at the beginning to currently 1s per document (import still running).
Is this normal? How can this be prevented?
PROFILE
or EXPLAIN
with boxes expanded (lower right corner)"parameters": {
"operation": "add",
"dataset": "bag",
"type": "hg:Building",
"id": "https://bag.basisregistraties.overheid.nl/bag/id/pand/0082100000158084",
"data": {
"uri": "https://bag.basisregistraties.overheid.nl/bag/id/pand/0082100000158084",
"type": "hg:Building",
"validSince": [
"1977-01-01",
"1977-12-31"
],
"geometry": "{\\"type\\":\\"Polygon\\",\\"coordinates\\":[[[5.70891547486678,52.8504181579206],[5.70891457428957,52.8504168123607],[5.70894999504419,52.8504089883029],[5.70896652550447,52.8504363528222],[5.70900994956621,52.8504267996935],[5.70906180219923,52.8505135601466],[5.70903749603102,52.8505188379876],[5.70903433968836,52.8505135444138],[5.70898068889875,52.850525281936],[5.70896446255978,52.8504989051244],[5.70896638900196,52.8504984505955],[5.70893317961297,52.8504437219559],[5.70893125317285,52.8504441764843],[5.70891547486678,52.8504181579206]]]}",
"dataset": "bag",
"validSinceTimestamp": 220924800
},
"structure": "node"
},
"statement": "MERGE (n:_ {id: {id}})\\nON CREATE\\n SET n = {data},\\n n.created = timestamp(),\\n n.id = {id},\\n n.dataset = {dataset},\\n n:`hg:Building`\\nON MATCH\\n SET n = {data},\\n n.accessTime = timestamp(),\\n n.counter = coalesce(n.counter, 0) + 1,\\n n.id = {id},\\n n.dataset = {dataset},\\n n:`hg:Building`\\n REMOVE n:_VACANT\\nRETURN n\\n"
},
07-03-2019 01:54 PM
I tried to run this from the browser by setting the parameter (has to be run on one line). I couldn't get the n={data} to work with the error TypeError: Property values can only be of primitive types or arrays thereof
, but IMO that is a red herring, if your query is working, there must be some difference in the param usage in Browser vs. the HTTP call you are doing.
I think when you change the Label from _ or Vacant to hg:building
you don't have an index or constraint on hg:building
. Try adding that constraint to UNIQUE to hg:building.id.
When the label with the constraint is used, the index is used
When I get hg:building it is not. There is a node scan.
This will cause things to get slower as the number of nodes for the non-indexed label increases.
Merge I used
MERGE (n:Problem1 {id: {id}})
ON MATCH SET n.accessTime = timestamp(), n.counter = coalesce(n.counter, 0) + 1, n.id = {id}, n.dataset = {dataset}
,n.uri=$data.uri,n:hg:Building
RETURN n
Params:
:params {operation: "add",dataset: "bag",type: "hg:Building",id: "https://bag.basisregistraties.overheid.nl/bag/id/pand/0082100000158084",data: {uri: "https://bag.basisregistraties.overheid.nl/bag/id/pand/0082100000158084",type: "hg:Building",validSince: ["1977-01-01","1977-12-31"],geometry: {type:"Polygon",coordinates:[[[5.70891547486678,52.8504181579206],[5.70891457428957,52.8504168123607],[5.70894999504419,52.8504089883029],[5.70896652550447,52.8504363528222],[5.70900994956621,52.8504267996935],[5.70906180219923,52.8505135601466],[5.70903749603102,52.8505188379876],[5.70903433968836,52.8505135444138],[5.70898068889875,52.850525281936],[5.70896446255978,52.8504989051244],[5.70896638900196,52.8504984505955],[5.70893317961297,52.8504437219559],[5.70893125317285,52.8504441764843],[5.70891547486678,52.8504181579206]]]},dataset: "bag",validSinceTimestamp: 220924800},structure: "node"}
07-10-2019 09:27 AM
Hi David,
Thank you very much for your reply. Inspired from what you did I have also run an EXPLAIN query on my DB
EXPLAIN MERGE (n:_ {id: "https://bag.basisregistraties.overheid.nl/bag/id/pand/0082100000158084"})
ON CREATE
SET n.id = "https://bag.basisregistraties.overheid.nl/bag/id/pand/0082100000158084",
n.dataset = "bag",
n:`hg:Building`
ON MATCH
SET n.id = "https://bag.basisregistraties.overheid.nl/bag/id/pand/0082100000158084",
n.dataset = "bag",
n:`hg:Building`
REMOVE n:_VACANT
RETURN n
and here is the result:
Only I do not understand where the NodeByLabelScan comes from unless something is seriously wrong, since I just want to remove the label of the found node, and not of any arbitrary node in the DB.
Do I have an error in the query?
And further, this problem was not showing up before we changed the id from a short string to a full URL, can this be related to the increased indexing time?
Thanks,
Stefano
07-22-2019 01:33 PM
Hi Stefano,
Try creating an index on the other Labels you are accessing id with. E.g _
or _Vacant
. The table scan indicates that the query doesn't have an index for that label.
David
07-24-2019 03:08 AM
Hi David,
Thank you for your reply, I was convinced I had already created an index but the script that was supposed to do so did not run correctly.
Thanks again for your support.
Stefano
12-09-2020 11:13 PM
Hi Stefano,
One key point, while creating indexes AND on the same time importing data.
If data on which you are creating index is HUGE data then the process of populating the index data may take some time to complete. So its better to check the status of index (ONLINE, POPULATING or FAILED) make sure its ONLINE once all data is indexed.
On PROFILE your query, if NodeByLabelScan comes (even tough you had indexes) then that means, either indexing not yet complete or it may have failed. Always best practice to see failureMessage in verbose mode.
All the sessions of the conference are now available online