Neo4j

massung · ‎05-20-2019

Right now I have a graph with ~70 M nodes of a given type and have just finished uploading ~20 M nodes of another type. These node type can be considered like so:

// name is unique, index on position
CREATE (:A { name: string, position: int })

// index on start, end guaranteed > start
CREATE (:B { start: int, end: int })

Now, what I need to do is create links from (a)-[:IN]->(b) where the position falls within the given range.

I've tried multiple times using various forms of apoc.period.iterate (streaming over all B's or all A's):

call apoc.periodc.interate(
  "MATCH (b:B) RETURN b",  // stream all B's
  "MATCH (a:A) WHERE a.position >= b.start AND a.position < b.end AND NOT ((a)-[:IN]->(b)) MERGE (a)-[:IN]->(b)",
  {
    // i've tried parallel true & false, various batch sizes, etc.
  }
)

And apoc.periodic.commit (admittedly, I'm sure sure if commit even makes sense for this or where in the query the limit should be applied):

call apoc.periodic.commit(
  "MATCH (a:A) MATCH (b:B) WHERE a.position >= b.start AND a.position < b.end AND NOT ((a)-[:IN]->(b)) MERGE (a)-[:IN]->(b) WITH a, b LIMIT {limit} MERGE (a)-[:IN]->(b)",
  { limit: 1000 }
)

Regardless of how I run this, one of a few outcomes happens:

The query returns almost immediately and does nothing.
The query runs for ~2 hours, the connection is lost, nothing has been updated.
The query runs for a little while (~1 hour), ends "successfully", nothing has been updated.

My "last hope" option is to actually perform the link up when loading B's into the database as part of the LOAD CSV query. But this is sub-optimal, works, but is abysmally slow (at it's current rate it would take upwards of 6 days to complete).

Any help solving this would be a huge help. Thanks!

jsmccrumb · ‎05-20-2019

I would think periodic iterate would be able to do this, but make sure there is an index on a.position.

You don't have to check that :IN doesn't already exist since you are using MERGE, and depending on your memory make sure to keep the batch size low.

Do each of the three outcomes happen with each type of query? #2 looks like a query at work that tries to do too much and ends up with out of memory or something like that, so tweaking the query to cut down on the amount of work in a single transaction might help?

massung · ‎05-20-2019

Thanks for the thoughts. Yes, a.position does have an index. Good to know about checking :IN; I was worried that I may end up producing multiple relationships if I had to run it many times in testing.

massung · ‎05-20-2019

So, removing the [:IN] test actually helped considerably (still using periodic.iterate). However, it's consistently timing out at the 2 hour mark regardless.

This is running from Scala (so the Java driver), and I don't see any obvious settings to adjust that would allow the connection to remain active while the query keeps running. At the rate it's going it would need ~8 hours to complete, which is fine.

https://neo4j.com/docs/driver-manual/current/client-applications/

Is there a setting I'm missing or misreading one of the existing ones? My current code is:

val driver: Driver = {
  val auth = AuthTokens.basic(user, password)

  // set custom connection settings
  val settings = Config.build
    .withMaxConnectionLifetime(24, TimeUnit.HOURS)
    .withConnectionLivenessCheckTimeout(30, TimeUnit.MINUTES)
  
  GraphDatabase.driver(config.url, auth, settings.toConfig)
}

I would think if one of the settings would do it, it'd be the connection lifetime, but regardless of the value, 2 hours (literally, to the second) is when the connection pool dies:

INFO: Closing connection pool towards xxx.org:7687
ERROR [o.b.d.a.a.Main$] UploadRegions terminated: Connection to the database failed
org.neo4j.driver.v1.exceptions.ServiceUnavailableException: Connection to the database failed
        at org.neo4j.driver.internal.util.Futures.blockingGet(Futures.java:122)
        at org.neo4j.driver.internal.NetworkSession.close(NetworkSession.java:136)

jsmccrumb · ‎05-21-2019

My best guess is that it is trying to do too much at once.

Another angle to try is adding a temporary label to :B and use that with periodic commit (temporary label is helpful because easier to match on for periodic commit)

apoc.periodic.commit(
    "MATCH (b:B) WHERE NOT b:ProcessInRelationship WITH b LIMIT $limit SET b:ProcessInRelationship",
    { limit: 10000 }

and then:

apoc.periodic.commit(
    "MATCH (b:ProcessInRelationship)
    WITH b LIMIT $limit
    REMOVE b:ProcessInRelationship
    WITH b
    MATCH (a:A)
    WHERE a.position >= b.start AND a.position < b.end
    MERGE (a)-[:IN]->(b)",
    { limit: 100 }

Not sure on average how many A are :IN B... if this has the same problem drop the limit even lower and see how it works. The temporary label lets you quickly pick up on each iteration on where you left off, and the index on A should allow you to quickly find the matching A nodes.

andrew_bowman · ‎05-21-2019

Just to note, in nearly all cases you want to be using apoc.periodic.iterate() instead of apoc.periodic.commit().

This is because apoc.periodic.commit() will be executing across the same data set over and over per batch, so it will keep on looking at nodes already processed and have to do some extra processing to determine that they've already been processed. That's extra work that you don't need to do.

By contrast apoc.periodic.iterate() should be processing a stream of results, so per batch it will never have to revisit any previously processed nodes.

You may want to add CYPHER runtime=slotted as a prefix to the first (driving) query within apoc.periodic.iterate(), as there was an issue at one point which caused a bad query plan that didn't work well with the intended batching approach.

The fact that nothing was being changed in the graph after a long period of time doesn't sound right. Since this is a batching procedure, we should have seen some number of batches being successfully processed. If not even a single batch was processed then either one (or both) of those queries need another look, or the batch size may be too large.

Neo4j

Difficult query not working