Neo4j

calkhoavu · ‎03-03-2020

Hi,

I'm trying to export results of a query out into csv files. The query calls the jaccard apoc, then writes out into csv file by using apoc.export.csv.query. That approach worked for smaller dataset and we ran into a customer that it was taking days for a single month output file.

I'm wondering if it's possible to use apoc.periodic.iterate along with apoc.export.csv.query along with algo.similarity.jaccard.stream to write out the results of apoc.export.csv.query utilizing jaccard but have them written out in multiple files with apoc.periodic.iterate.

I've been trying but running into syntax issues. I just want to know if it's possible.

The general idea :

		CALL apoc.periodic.iterate("MATCH (s:Identity)-[:REL_2019_12]->(e:Ent) WITH {item:id(s), categories: collect(id(e))} as entitlements WITH collect(entitlements) as data RETURN data",
        "CALL apoc.export.csv.query('CALL algo.similarity.jaccard.stream(data, {similarityCutoff: 0.5})
        YIELD item1, item2, count1, count2, intersection, similarity
        RETURN algo.getNodeById(item1).id +'-'+'2019-12' AS `:START_ID(IdenAG-ID)`,
               algo.getNodeById(item2).id +'-'+'2019-12' AS `:END_ID(IdenAG-ID)`,
               similarity AS weight, 'SIM_50_2019_12' AS `:TYPE`
        ORDER BY similarity', '/AG/AG-SIM-50-2019-12' + $_count + '.csv', {quotes: false})",
        {batchSize:20000, iterateList:true, parallel:true}
        )

Thanks

calkhoavu · ‎03-09-2020

First of all, thank you @benjamin.squire for taking the time to help me with the problem and providing me with multiple ideas until I arrived to the solution.

CALL apoc.periodic.iterate(
"MATCH (s:IdStateAG)-[:REL_2019_12]->(e:EntitlementAG) WITH {item:id(s), categories: collect(id(e))} AS entitlements WITH collect(entitlements) AS data CALL algo.similarity.jaccard.stream(data, {similarityCutoff:0.5}) YIELD item1, item2, count1, count2, intersection, similarity RETURN algo.getNodeById(item1).id, algo.getNodeById(item2).id, similarity, 'SIM-50-2019-12' AS `:TYPE` ORDER BY similarity",
"CALL apoc.export.csv.query('UNWIND $_batch AS row RETURN row.`algo.getNodeById(item1).id` AS `:START_ID(IdentityAG-ID)`, row.`algo.getNodeById(item2).id` AS `:END_ID(IdentityAG-ID)`, row.similarity AS weight, row.`:TYPE` AS `:TYPE`', '/TEST/AG-SIM-50-2019-12-' + $_count + '.csv', {quotes: false,params:{_batch:$_batch, _count:$_count, _relType:'SIM-50-2019-12'}}) YIELD nodes RETURN sum(nodes)", {batchSize:1000000, iterateList:true, parallel:true})

This will output in multiple files, depending on size of data and batch size, the output of jaccard utilizing apoc.periodic.iterate and apoc.export.csv.query and algo.similarity.jaccard.

View solution in original post

calkhoavu · ‎03-06-2020

Anyone? Main question is : is it possible to do batch export with apoc.periodic.iterate using apoc.export.csv.query with jaccard on the inside?

Thanks in advance.

benjamin_squire · ‎03-06-2020

I think something like this would work.

CALL apoc.periodic.iterate(
"MATCH (s:IdStateAG)-[:HAS_ENT_2019_12]->(e:EntitlementAG) 
WITH {item:id(s), categories: collect(id(e))} as entitlements 
WITH collect(entitlements) as data RETURN d",
"CALL apoc.export.csv.query('CALL algo.similarity.jaccard.stream(UNWIND $_batch as row with row.d as data, {similarityCutoff: 0.5}) YIELD item1, item2, count1, count2, intersection, similarity 
RETURN algo.getNodeById(item1).id AS `:START_ID(IdentityAG-ID)`, algo.getNodeById(item2).id AS `:END_ID(IdentityAG-ID)` similarity AS weight ORDER BY similarity', '/AG/AG-SIM-50-2019-12' + $_count + '.csv' params:{_batch:$_batch}"), {batchSize:10000, iterateList:true})

calkhoavu · ‎03-09-2020

First of all, thank you @benjamin.squire for taking the time to help me with the problem and providing me with multiple ideas until I arrived to the solution.

CALL apoc.periodic.iterate(
"MATCH (s:IdStateAG)-[:REL_2019_12]->(e:EntitlementAG) WITH {item:id(s), categories: collect(id(e))} AS entitlements WITH collect(entitlements) AS data CALL algo.similarity.jaccard.stream(data, {similarityCutoff:0.5}) YIELD item1, item2, count1, count2, intersection, similarity RETURN algo.getNodeById(item1).id, algo.getNodeById(item2).id, similarity, 'SIM-50-2019-12' AS `:TYPE` ORDER BY similarity",
"CALL apoc.export.csv.query('UNWIND $_batch AS row RETURN row.`algo.getNodeById(item1).id` AS `:START_ID(IdentityAG-ID)`, row.`algo.getNodeById(item2).id` AS `:END_ID(IdentityAG-ID)`, row.similarity AS weight, row.`:TYPE` AS `:TYPE`', '/TEST/AG-SIM-50-2019-12-' + $_count + '.csv', {quotes: false,params:{_batch:$_batch, _count:$_count, _relType:'SIM-50-2019-12'}}) YIELD nodes RETURN sum(nodes)", {batchSize:1000000, iterateList:true, parallel:true})

This will output in multiple files, depending on size of data and batch size, the output of jaccard utilizing apoc.periodic.iterate and apoc.export.csv.query and algo.similarity.jaccard.

Neo4j

Chaining apocs