Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
03-03-2020 02:40 PM
Hi,
I'm trying to export results of a query out into csv files. The query calls the jaccard apoc, then writes out into csv file by using apoc.export.csv.query. That approach worked for smaller dataset and we ran into a customer that it was taking days for a single month output file.
I'm wondering if it's possible to use apoc.periodic.iterate along with apoc.export.csv.query along with algo.similarity.jaccard.stream to write out the results of apoc.export.csv.query utilizing jaccard but have them written out in multiple files with apoc.periodic.iterate.
I've been trying but running into syntax issues. I just want to know if it's possible.
The general idea :
CALL apoc.periodic.iterate("MATCH (s:Identity)-[:REL_2019_12]->(e:Ent) WITH {item:id(s), categories: collect(id(e))} as entitlements WITH collect(entitlements) as data RETURN data",
"CALL apoc.export.csv.query('CALL algo.similarity.jaccard.stream(data, {similarityCutoff: 0.5})
YIELD item1, item2, count1, count2, intersection, similarity
RETURN algo.getNodeById(item1).id +'-'+'2019-12' AS `:START_ID(IdenAG-ID)`,
algo.getNodeById(item2).id +'-'+'2019-12' AS `:END_ID(IdenAG-ID)`,
similarity AS weight, 'SIM_50_2019_12' AS `:TYPE`
ORDER BY similarity', '/AG/AG-SIM-50-2019-12' + $_count + '.csv', {quotes: false})",
{batchSize:20000, iterateList:true, parallel:true}
)
Thanks
Solved! Go to Solution.
03-09-2020 01:54 PM
First of all, thank you @benjamin.squire for taking the time to help me with the problem and providing me with multiple ideas until I arrived to the solution.
CALL apoc.periodic.iterate(
"MATCH (s:IdStateAG)-[:REL_2019_12]->(e:EntitlementAG) WITH {item:id(s), categories: collect(id(e))} AS entitlements WITH collect(entitlements) AS data CALL algo.similarity.jaccard.stream(data, {similarityCutoff:0.5}) YIELD item1, item2, count1, count2, intersection, similarity RETURN algo.getNodeById(item1).id, algo.getNodeById(item2).id, similarity, 'SIM-50-2019-12' AS `:TYPE` ORDER BY similarity",
"CALL apoc.export.csv.query('UNWIND $_batch AS row RETURN row.`algo.getNodeById(item1).id` AS `:START_ID(IdentityAG-ID)`, row.`algo.getNodeById(item2).id` AS `:END_ID(IdentityAG-ID)`, row.similarity AS weight, row.`:TYPE` AS `:TYPE`', '/TEST/AG-SIM-50-2019-12-' + $_count + '.csv', {quotes: false,params:{_batch:$_batch, _count:$_count, _relType:'SIM-50-2019-12'}}) YIELD nodes RETURN sum(nodes)", {batchSize:1000000, iterateList:true, parallel:true})
This will output in multiple files, depending on size of data and batch size, the output of jaccard utilizing apoc.periodic.iterate and apoc.export.csv.query and algo.similarity.jaccard.
03-06-2020 08:15 AM
Anyone? Main question is : is it possible to do batch export with apoc.periodic.iterate using apoc.export.csv.query with jaccard on the inside?
Thanks in advance.
03-06-2020 01:58 PM
I think something like this would work.
CALL apoc.periodic.iterate(
"MATCH (s:IdStateAG)-[:HAS_ENT_2019_12]->(e:EntitlementAG)
WITH {item:id(s), categories: collect(id(e))} as entitlements
WITH collect(entitlements) as data RETURN d",
"CALL apoc.export.csv.query('CALL algo.similarity.jaccard.stream(UNWIND $_batch as row with row.d as data, {similarityCutoff: 0.5}) YIELD item1, item2, count1, count2, intersection, similarity
RETURN algo.getNodeById(item1).id AS `:START_ID(IdentityAG-ID)`, algo.getNodeById(item2).id AS `:END_ID(IdentityAG-ID)` similarity AS weight ORDER BY similarity', '/AG/AG-SIM-50-2019-12' + $_count + '.csv' params:{_batch:$_batch}"), {batchSize:10000, iterateList:true})
03-09-2020 01:54 PM
First of all, thank you @benjamin.squire for taking the time to help me with the problem and providing me with multiple ideas until I arrived to the solution.
CALL apoc.periodic.iterate(
"MATCH (s:IdStateAG)-[:REL_2019_12]->(e:EntitlementAG) WITH {item:id(s), categories: collect(id(e))} AS entitlements WITH collect(entitlements) AS data CALL algo.similarity.jaccard.stream(data, {similarityCutoff:0.5}) YIELD item1, item2, count1, count2, intersection, similarity RETURN algo.getNodeById(item1).id, algo.getNodeById(item2).id, similarity, 'SIM-50-2019-12' AS `:TYPE` ORDER BY similarity",
"CALL apoc.export.csv.query('UNWIND $_batch AS row RETURN row.`algo.getNodeById(item1).id` AS `:START_ID(IdentityAG-ID)`, row.`algo.getNodeById(item2).id` AS `:END_ID(IdentityAG-ID)`, row.similarity AS weight, row.`:TYPE` AS `:TYPE`', '/TEST/AG-SIM-50-2019-12-' + $_count + '.csv', {quotes: false,params:{_batch:$_batch, _count:$_count, _relType:'SIM-50-2019-12'}}) YIELD nodes RETURN sum(nodes)", {batchSize:1000000, iterateList:true, parallel:true})
This will output in multiple files, depending on size of data and batch size, the output of jaccard utilizing apoc.periodic.iterate and apoc.export.csv.query and algo.similarity.jaccard.
All the sessions of the conference are now available online