cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

Improving apoc.periodic.iterate performance for MATCH/CREATE

tms
Graph Buddy

I'm ingesting about 68K items from a CSV, and I'm wondering if I can make the process go faster. At the moment, it's taking over an hour (1:16:41), or about 14.8 items per second.

I'm using "apoc.periodic.iterate" with a batch size of 1000.

The items query answers a list of labeled nodes ("Datapoint").

Each instance of "Datapoint" has an indexed property ("fipsID") that uniquely identifies another labeled node ("FIPSItem").

The "fipsID" property is a (unique) constraint on "FIPSItem".

The apply query uses the value of "fipsID" on each datapoint to collect the "FIPSItem" instance with that value of "fipsID".

Once "f" is in the cypher context, the apply cypher uses CREATE to instantiate a labeled relationship (:FEATURE) from the "Datapoint" instance to the "FIPSItem" instance.

The apply query therefore has a single MATCH that uses two indexed property values followed by a single CREATE. The apply query is called 69 times to handle all 68,097 Datapoint instances.

I'm seeing an elapsed time of 76:41 -- 4,601 seconds -- for this batch query. That's about 67 msec per item, or about 14.8 items per second.

That is slower than I expect (but perhaps my expectations are unrealistic). I'm running Neo4J Enterprise v4.4.6 on a robust AWS EC2 instance. I've done the usual configuration following the guidance of memrec.

I wonder if the batchsize (1000) is too small for the 68K items this query is handling. According to PROFILE, a single batch of 1000 items without the CREATE uses about 5M and takes 72 msec. The .png of that profile is attached.

The database has a "node_label_lookup_index" and "rel_type_lookup_index" as per the most recent documentation with an indexProvider of "token-lookup-1.0".

The Datapoint index is named "Datapoint_fipsIDIndex", and is a "NONUNIQUE" index on the "fipsID" property of "Datapoint". Its indexProvider is "native-btree-1.0"

The FIPSItem index is named "FIPSItem_fipsIDIndex", and is a "UNIQUE" index on the "fipsID" property of "FIPSItem". Its index provider is "native-btree-1.0".

Here are the log entries (in query.log) for the start and stop of the batch query:

2022-08-10 15:24:06.036+0000 INFO  Query started: id:3 - 90 ms: 0 B - bolt-session	bolt	neo4j-python/4.4.1 Python/3.9.6-final-0 (linux)		client/172.30.2.147:40924	server/172.30.2.147:7687>	tms-covid-a - tms_neo4j - call apoc.periodic.iterate("
MATCH(dp:Datapoint)-[:DATASET]->(dataset:Dataset {datasetID: 'nytimes.20220301.us-counties'})-[:SOURCE]->(source:Source {sourceID: 'nytimes.us-counties'})
  RETURN dp ORDER BY dp.stateCountyFIPSCode
", "
MATCH (f:FIPSItem {fipsID: dp.fipsID})
CREATE (dp)-[:FEATURE]->(f)
", {batchSize:1000}) - {} - runtime=null - {}

...

2022-08-10 16:40:47.504+0000 INFO  id:3 - 4601558 ms: 176 B - bolt-session	bolt	neo4j-python/4.4.1 Python/3.9.6-final-0 (linux)		client/172.30.2.147:40924	server/172.30.2.147:7687>	tms-covid-a - tms_neo4j - call apoc.periodic.iterate("
MATCH(dp:Datapoint)-[:DATASET]->(dataset:Dataset {datasetID: 'nytimes.20220301.us-counties'})-[:SOURCE]->(source:Source {sourceID: 'nytimes.us-counties'})
  RETURN dp ORDER BY dp.stateCountyFIPSCode
", "
MATCH (f:FIPSItem {fipsID: dp.fipsID})
CREATE (dp)-[:FEATURE]->(f)
", {batchSize:1000}) - {} - runtime=pipelined - {}

Is 15 items per second about as good as can be done by Neo4J Enterprise?

I invite and appreciate any guidance about how I can improve the performance of
operations like this.

1 ACCEPTED SOLUTION

tms
Graph Buddy

The solution was to "pivot" the nodes answered by the items and apply queries. I changed the original batch query so that the items query answers each feature (no sorting needed!), and the apply query finds each Datapoint instances associated with a given feature and binds it with an instance of the :FEATURE relationship.

Each feature has one datapoint for each date, so if the ingestor is loading 200 days of data, there will be 200 Datapoint instances for each feature.

In the original form of the query, the MATCH operation (to collect the feature) was being run hundreds of times with an identical result for each feature.

When each feature is returned from the items query, then just one query for each Datapoint instance is needed. The CREATE operation to add the new :FEATURE relationship is very fast.

It appears that this demonstrates a possible "best practices" pattern for such situations -- when a node query is known to change less frequently, execute it first. When a node query is known to change more frequently, use the results of the first in a WITH context while executing the second.

I appreciate your attention. This exchange helped me find and solve the issue.

View solution in original post

6 REPLIES 6

Can you post the code?

Here, for your convenience, is the "apoc.periodic.iterate" cypher excerpted from the logfile above:

 

call apoc.periodic.iterate(
    "
        MATCH(dp:Datapoint)-[:DATASET]->(dataset:Dataset {datasetID: 'nytimes.20220301.us-counties'})-[:SOURCE]->(source:Source {sourceID: 'nytimes.us-counties'})
        RETURN dp ORDER BY dp.stateCountyFIPSCode
    ",  // items query
    "
        MATCH (f:FIPSItem {fipsID: dp.fipsID})
        CREATE (dp)-[:FEATURE]->(f)
    ",  // batch query
    {batchSize:1000} // configuration
)

Let me know if there is other code you'd like to see.

The above code is invoked by Python (v3.6?) using the standard Neo4J Python driver.

I assume you have the following indexes created:

create index dataset_on_dataset for (n:Dataset) on (n.datasetID);
create index source_on_source_id for (n:Source) on (n.sourceID);

I don't think you need to extend the match pattern from dataset to source, since you are specifying a specific dataset node. If you match just on datapoint related to dataset, you will get all the datapoints related to the specific dataset specified, regardless of dataset's connection to the specific source. As such, I think you can simplify the first query to only match on (datapoint)->(dataset).  If the intent is to ensure that there is an existing relationship between the specific dataset node and the specific source node, you can use an 'exists' instead, avoiding a hash join between the two patterns, i.e. (datapoint)->(dataset) and (dataset)->(source). Remove the 'where exists' line if this is not required. 

The rest of the query looks good, except I did not understand the need to order the datapoints nodes. Does it matter in which order the second phase of the iterate query processes the datapoint nodes?  Sorting could take time if the data set is large, so I removed it in the proposed query below:

 

call apoc.periodic.iterate(
    "
    match (dp:Datapoint)-[:DATASET]->(dataset:Dataset {datasetID: 'nytimes.20220301.us-counties'})
    where exists ( (dataset)-[:SOURCE]->(:Source {sourceID: 'nytimes.us-counties'}) )
    return dp
    ",
    "
    MATCH (f:FIPSItem {fipsID: dp.fipsID})
    CREATE (dp)-[:FEATURE]->(f)
    ",
    {batchSize:1000}
)​

Not sure the changes will be impactful, but putting it out there in case. 

 

Yes, those indexes exist. They are created as constraints, so they are unique.

> I don't think you need to extend the match pattern from dataset to source, since you are specifying a specific dataset node ...

I added these because experimentation with `PROFILE`  shows that the result with extra step uses less memory and goes faster (at least according to PROFILE).

I agree that the order probably doesn't matter. I specified it because I know that each feature has on the order of several hundred Datapoint instances (one per day for each day since early 2020). Now that I write this, I realize that the query might go faster by answering the 3K features from the items query, then processing each Datapoint for a given query.

If each cypher phrase in the apply query is performed for each item answered from the items query, then inverting the order of this (so that I MATCH Datapoint instances from features answered by the ITEMs instance) might make a difference.

That's worth an experiment.

I realize that my own solution that I added a few days ago did not adequately attribute my success to this comment from  @glilienfield.

It was this comment that provoked my experiment, and that experiment provided the solution.

The query is now lightning-fast.

tms
Graph Buddy

The solution was to "pivot" the nodes answered by the items and apply queries. I changed the original batch query so that the items query answers each feature (no sorting needed!), and the apply query finds each Datapoint instances associated with a given feature and binds it with an instance of the :FEATURE relationship.

Each feature has one datapoint for each date, so if the ingestor is loading 200 days of data, there will be 200 Datapoint instances for each feature.

In the original form of the query, the MATCH operation (to collect the feature) was being run hundreds of times with an identical result for each feature.

When each feature is returned from the items query, then just one query for each Datapoint instance is needed. The CREATE operation to add the new :FEATURE relationship is very fast.

It appears that this demonstrates a possible "best practices" pattern for such situations -- when a node query is known to change less frequently, execute it first. When a node query is known to change more frequently, use the results of the first in a WITH context while executing the second.

I appreciate your attention. This exchange helped me find and solve the issue.