Neo4j

tms · ‎05-12-2022

I'm suddenly seeing EXTREMELY slow times for LOAD CSV queries that have been fine and fast for months. I'm using neo4j-enterprise v4.4.6 on an AWS EC2 instance running a current Rocky Linux system.

According to "SHOW INDEXES", the database in question has an index on all relevant properties and labels.

Here is the query:

LOAD CSV WITH HEADERS FROM 'file://my/local/file.csv' AS row
CALL {
    WITH row
    WITH
        row WHERE row.county_fips <> row.neighbor_fips
    WITH
        row,
        row.county_fips + '000000000000000' AS fipsIDCounty,
        row.neighbor_fips + '000000000000000' AS fipsIDNeighbor
    MATCH (countyFeature:FIPSItem { fipsID: fipsIDCounty})
    MATCH (neighbor:FIPSItem { fipsID: fipsIDNeighbor})
    WITH
        countyFeature,
        neighbor
    MERGE (countyFeature)-[:INFECTS]->(neighbor)
} IN TRANSACTIONS OF 1000 ROWS

The use of CALL {} IN TRANSACTIONS OF 10000 ROWS is a recent step in my vain attempt to make this work at all.

Earlier this month (4-May-2022), the following earlier version of the same query finished in under 2 minutes:

LOAD CSV WITH HEADERS FROM 'file://my/local/file.csv' AS row
MATCH (fipsItem:FIPSItem { fipsID: row.county_fips + '000000000000000'})
MATCH (neighbor:FIPSItem { fipsID: row.neighbor_fips + '000000000000000'})
WHERE row.county_fips <> row.neighbor_fips
MERGE (fipsItem)-[r:INFECTS]-(neighbor)

The file being loaded has 16,892 records and has not changed.

I notice that three "INFO" entries like the following have been added to debug.log while the most recent version is running:

2022-05-12 19:13:51.996+0000 INFO  [o.n.c.i.ExecutionEngine] [tms-covid-b/11303b5a] Discarded stale query from the query cache after 102 seconds. Reason: CardinalityByLabelsAndRelationshipType(Some(LabelId(0)),Some(RelTypeId(9)),Some(LabelId(0))) changed from 4700.0 to 1.0, which is a divergence of 0.9997872340425532 which is greater than threshold 0.7325330418736514. Query id: 468

I have no clue what this means or if it's relevant. I see no other log entries of interest.

I've configured the instance with memrec, and that configuration hasn't changed. There is lots of memory, lots of disk, and lots of CPU.

This query that used to reliably finish in under 2 minutes is now adding 1,000 bindings each minute. That's at least an order of magnitude slower.

I invite guidance about what is going on.

ameyasoft · ‎05-12-2022

Try this:

LOAD CSV WITH HEADERS FROM 'file://my/local/file.csv' AS row
    WITH
        row WHERE row.county_fips <> row.neighbor_fips

CALL {

    WITH
        row,
        row.county_fips + '000000000000000' AS fipsIDCounty,
        row.neighbor_fips + '000000000000000' AS fipsIDNeighbor
MATCH (neighbor:FIPSItem { fipsID: fipsIDNeighbor})
MATCH (countyFeature:FIPSItem { fipsID: fipsIDCounty})
  WITH
         countyFeature, 
        neighbor
    MERGE (countyFeature)-[:INFECTS]->(neighbor)
} IN TRANSACTIONS OF 1000 ROWS

Neo4j

Extremely degraded LOAD CSV query!