Neo4j

erikazosi · ‎09-05-2018

I am using the dataset from Kaggle.
As I had received an error prompting to increase the heap memory size,I assigned heap memory size of 16GB and page cache memory size of 1G, as I have total of 32GB RAM. I left this query to run for an hour but I didn't get any result so I terminated the process.

This is my schema:

    OPTIONAL MATCH(a:Application)-[:HAS_PREVIOUS]->(p:Previous_Application)-[:HAS_PREV_INSTALLMENTS]->(i:Installments_Payment) WITH 
    toInteger(i.DAYS_INSTALMENT)-toInteger(i.DAYS_ENTRY_PAYMENT) AS DaysInstallMinusEntry,
    COUNT(i) AS TotalInstallments
    OPTIONAL MATCH(a)-[:HAS_BUREAU]->(bureau:Bureau)
     WITH size(FILTER(x IN collect(bureau.CREDIT_ACTIVE) WHERE x="Active")) AS TotalActiveBureau, a.SK_ID_CURR AS ApplicationID,COUNT(bureau) AS TotalBureau, TotalInstallments,DaysInstallMinusEntry
    RETURN ApplicationID,TotalInstallments,DaysInstallMinusEntry,TotalBureau,
    toFloat(TotalActiveBureau)/toFloat(TotalBureau) AS ActiveBureauByTotalBureauRatio ORDER BY ActiveBureauByTotalBureauRatio DESC

gabriel_toma · ‎09-05-2018

Can you attach a picture of the query profile?
Secondly, have you tried running with apoc.cypher.parallel?

https://neo4j-contrib.github.io/neo4j-apoc-procedures/

erikazosi · ‎09-05-2018

Hi, I tried by using PROFILE but it was still processing after an hour. I will try by using apoc.cypher.parallel. Thank you very much.

erikazosi · ‎09-05-2018

Could you please give me an example on how to use the apoc.cypher.parallel. The APOC user guide didn't give any description on it's use.

gabriel_toma · ‎09-05-2018

editing in progress, seems that function signature has changed since i last used it

Hello,
You need to put your query between " ", like call apoc.cypher.parallel("match (n) return n limit 1")
What about profile? can you make a print-screen or an export (sometimes, the export does not work)

I would also save all calculations as new attributes within nodes, not do them inside the query. You can also use virtual nodes if you don't want to ruin your graph, meaning: do the calculation and create new virtual nodes for each of your nodes, then do the query on the resulting graph.

michael_hunger · ‎09-05-2018

Let's not look at parallel before we optimized the base query.

erikazosi · ‎09-05-2018

Thank you for the suggestion. I will look into virtual nodes as adding a new attributes to millions of data would be time consuming.

michael_hunger · ‎09-05-2018

Do you have your database available somewhere for download or testing?
How big is the db in terms of nodes and relationships?

You can use EXPLAIN instead of profile to get an idea of the query complexity.

Please format your query properly next time. It's not polite to expect others to make sense of a mess of code.

Usually, you would write such a query step by step, always having a look at PROFILE and EXPLAIN and often just returning the total count(*) from a pattern so that you always have a good understanding of the cost.

Tips: Turn this into a label: bureau.CREDIT_ACTIVE = "Active"

Problem: You missed a in the first WITH.

I formatted and simplified it:

replaced optional match with pattern comprehension which computes all status as list of true/false
save second count by just filtering that list
added a to the first `WITH``
only access properties as late as possible

Here is the statement:

MATCH (a:Application)-[:HAS_PREVIOUS]->(p:Previous_Application)-[:HAS_PREV_INSTALLMENTS]->(i:Installments_Payment) 

WITH a, toInteger(i.DAYS_INSTALMENT)-toInteger(i.DAYS_ENTRY_PAYMENT) AS DaysInstallMinusEntry,
     count(i) AS TotalInstallments

WITH *, [ (a)-[:HAS_BUREAU]->(bureau:Bureau) | bureau.CREDIT_ACTIVE = "Active"] as bureauStatus
WITH DaysInstallMinusEntry, TotalInstallments, 
     size([status in bureauStatus WHERE status]) TotalActiveBureau, 
     size(bureauStatus) as TotalBureau, a

RETURN a.SK_ID_CURR AS ApplicationID, TotalInstallments,DaysInstallMinusEntry,TotalBureau, 	
       toFloat(TotalActiveBureau)/toFloat(TotalBureau) AS ActiveBureauByTotalBureauRatio 

ORDER BY ActiveBureauByTotalBureauRatio DESC

gabriel_toma · ‎09-05-2018

58mil+ nodes /42+GB on HDD and not all relationships yet created.
erikazosi, have you created indexes on all those SK keys ?

erikazosi · ‎09-05-2018

Hi, thank you for the detailed information. I apologize for the wrongly formatted query as I am new to the complex queries.

I have total of 52695506 NODES and 52339251 relationships.
I will also make the database available for download and provide you the link.

The reason I used OPTIONAL MATCH is because the application node may or may not have relationship with bureau node and previous_application node with installments_payment.

The result of using EXPLAIN on the query I mentioned with my question:

michael_hunger · ‎09-05-2018

Can you also try my updated query?
I really think it's the omitted a from the first WITH

erikazosi · ‎09-05-2018

I received the same error from both of your first query and updated query. Am I missing something to mention in it?

michael_hunger · ‎09-06-2018

Seems to be a bug, if you replace the * with the explicit variables it works for me:

MATCH (a:Application)-[:HAS_PREVIOUS]->(p:Previous_Application)-[:HAS_PREV_INSTALLMENTS]->(i:Installments_Payment) 

WITH a, toInteger(i.DAYS_INSTALMENT)-toInteger(i.DAYS_ENTRY_PAYMENT) AS DaysInstallMinusEntry,
     count(i) AS TotalInstallments

WITH a,DaysInstallMinusEntry,TotalInstallments, [ (a)-[:HAS_BUREAU]->(bureau:Bureau) | bureau.CREDIT_ACTIVE = "Active"] as bureauStatus
WITH DaysInstallMinusEntry, TotalInstallments, 
     size([status in bureauStatus WHERE status]) AS TotalActiveBureau, 
     size(bureauStatus) as TotalBureau, a

RETURN a.SK_ID_CURR AS ApplicationID, TotalInstallments,DaysInstallMinusEntry,TotalBureau, 	
       toFloat(TotalActiveBureau)/toFloat(TotalBureau) AS ActiveBureauByTotalBureauRatio 

ORDER BY ActiveBureauByTotalBureauRatio DESC

erikazosi · ‎09-06-2018

I loaded this query twice and on both execution, I ran into black screen of Neo4j desktop after 7 minutes.

gabriel_toma · ‎09-06-2018

Hmmm, not related to your query, but I often also get black screens when running long queries on AWS.
Are you running it on your personal computer?
Secondly, i'm running Michael's query
Third, your heap is still at 1g ? dbms.memory.heap.max_size=1G

michael_hunger · ‎09-06-2018

Don't run such a large query in Neo4j browser.

Use a client program that streams the data and e.g. consumes it or writes it into a file.

Yes heap should probably increased also depending on the profile output.

I'll check with your db.

Michael

erikazosi · ‎09-06-2018

Thank you for your time. I am planning to use python as the client program. Even though I increased the heap it didn't help to process this query.

michael_hunger · ‎09-07-2018

So I spend two hours working through your query, after all it is a gigantic full graph query that basically touches almost all entities in the graph.

I did a bunch of optimizations:

9G page-cache, 8G heap
store the delta directly in the node (as numeric value)
use a label for tagging active bureaus
load the installment with their delta before expanding
turn the optional match into a pattern comprehension
store the status in the list so you save the 2nd expand to bureaus
and then just filter the list for the status

for doing the initial updates:

call apoc.periodic.iterate("MATCH (i:Installments_Payment) RETURN i",
"SET i.DaysInstallMinusEntry = toInteger(i.DAYS_INSTALMENT)-toInteger(i.DAYS_ENTRY_PAYMENT)"
,{batchSize:10000,iterateList:true,parallel:true,concurrency:12});

// takes 88s

call apoc.periodic.iterate("MATCH (bureau:Bureau) WHERE bureau.CREDIT_ACTIVE = 'Active' RETURN bureau",
"SET bureau:Active"
,{batchSize:10000,iterateList:true,parallel:true,concurrency:12});
// takes 12s

The runtime on my computer is between 5 and 7 minutes, here is the final result:

from neo4j.v1 import GraphDatabase
import time


driver = GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "test"))

statement = """
MATCH (i:Installments_Payment) 
WITH i, toInteger(i.DAYS_INSTALMENT)-toInteger(i.DAYS_ENTRY_PAYMENT) AS DaysInstallMinusEntry
MATCH (a:Application)-[:HAS_PREVIOUS]->(p:Previous_Application)-[:HAS_PREV_INSTALLMENTS]->(i) 

WITH a, DaysInstallMinusEntry, count(i) AS TotalInstallments
WITH a,DaysInstallMinusEntry,TotalInstallments, [ (a)-[:HAS_BUREAU]->(bureau) | bureau:Active] as bureauStatus

WITH DaysInstallMinusEntry, TotalInstallments, 
     size([status in bureauStatus WHERE status]) AS TotalActiveBureau, 
     size(bureauStatus) as TotalBureau, a

RETURN a.SK_ID_CURR AS ApplicationID, TotalInstallments,DaysInstallMinusEntry,TotalBureau, 	
	       toFloat(TotalActiveBureau)/toFloat(TotalBureau) AS ActiveBureauByTotalBureauRatio 
ORDER BY ActiveBureauByTotalBureauRatio DESC
"""

start = time.time()
rows = 0
value = 0
for record in driver.session().run(statement):
    rows = rows + 1
    value = value + record["ActiveBureauByTotalBureauRatio"]

print("rows {rows}".format(rows=rows))
print("took {time}".format(time=time.time() - start))


rows 5339018
took 450.4561972618103 -> 7 minutes

here are all the details

https://www.dropbox.com/s/dpyk60ziitmy9fn/erikazosi-queries.cypher?dl=0

erikazosi · ‎09-06-2018

Yes, it's on my personal computer which is running on 16G heap size and 1G is the page cache size.

gabriel_toma · ‎09-08-2018

So, the black screen is because the query has actually finished and Browser is trying to retrieve back the resulted stream...which, being too big, crashes the Browser.
Hence i've added all results in grouping expression (count, sum, avg, etc) so that the result stream is 1 row.

Managed to run the query in 8+ minutes (AMd Threadripper 1900x, 8 cores X 3.9 (i've seen that it's actually running on a single processor which is pumped to 4Ghz

) / 32gb ram 3200 CL14, m.2 NVME)

Trick is to run outside of index scans, meaning:
match(a:Application)
with collect(a) as aa
unwind aa as a
MATCH (a:Application)-[:HAS_PREVIOUS]....

I actually wonder how much time does it take to output the entire results.

michael_hunger · ‎09-08-2018

Neo4j browser is not meant to retrieve millions of rows.

See my code, it takes 7-8 minutes from python.

You just need to stream and write the data somewhere or analyze it further e.g. with pandas.

Yes, it is currently a single processor query, this might change in the future with a newer runtime.

michael_hunger · ‎09-08-2018

Your "trick" has no effect here.

If you look at the profile you don't save any db-hits.

gabriel_toma · ‎09-06-2018

Please do provide the link.

erikazosi · ‎09-06-2018

Here is the link to the dump file of my database:

https://drive.google.com/file/d/10Lr-G4PFC0np8VUCOgk5eJckyCFDn9S9/view?usp=sharing

gabriel_toma · ‎09-05-2018

This dataset is not intended for processing in a graph database, as it is= a minimal hierarchy with a lot of non-hierarchy related data superimposed over and over again, million of times.
But that's it, there's no connection between these hierarchies, or at least, not yet.
@michael.hunger - would it be better for memory consumption, to break a node with a lot of attributes into multiple nodes, so that the node loaded in memory contains only the minimal set of data?
As per what it seems, the model does not fit into Ram.
Secondly, is there any benefit of he links all categorical values, instead of Keeping them as attributes?
And third, but related to 2 and extended: would a value graph help ?

michael_hunger · ‎09-06-2018

Not sure how many attributes are there.
Sometimes it helps to split out less frequently used attributes
but we also try to minimize how many records we load, as the property records are separate from node-records.

Many value graphs elements turn into dense-nodes which can be detrimental for updates and querying across them (as you have to post-filter by the end-node of those relationships).

gabriel_toma · ‎09-08-2018

Michael, Erika, what kind of operation is this: "with ... [(a)-[:HAS_BUREAU]->(bureau:Bureau) | bureau.CREDIT_ACTIVE = "Active"] as bureauStatus" ? Haven't seen it before

Different approach, worse timings - 11 min

MATCH (a:Application) with a
CALL apoc.path.subgraphAll(a,{relationshipFilter:'HAS_PREVIOUS|HAS_PREV_INSTALLMENTS|HAS_BUREAU'}) YIELD nodes, relationships
with a,[ (a)-[:HAS_BUREAU]->(bureau:Bureau) | bureau.CREDIT_ACTIVE = "Active" and bureau in filter(n in nodes where n:Bureau)] as bureauStatus,
filter(n in nodes where n:Installments_Payment) as ii
WITH size([status in bureauStatus WHERE status]) as TotalActiveBureau,ii,bureauStatus,a
unwind ii as i
with size(bureauStatus) as TotalBureau, a, toInteger(i.DAYS_INSTALMENT)-toInteger(i.DAYS_ENTRY_PAYMENT) AS DaysInstallMinusEntry,count(i) AS TotalInstallments,TotalActiveBureau
RETURN count(a.SK_ID_CURR) AS ApplicationID, avg(TotalInstallments), avg(DaysInstallMinusEntry), avg(TotalBureau),
avg(toFloat(TotalActiveBureau)/toFloat(TotalBureau)) AS ActiveBureauByTotalBureauRatio
ORDER BY ActiveBureauByTotalBureauRatio DESC

michael_hunger · ‎09-08-2018

it is a pattern comprehension, almost like a subquery, can contain patterns with new variables, where clauses and result expressions, results in a list of values

yeshveer_yadav · ‎10-10-2019

HI
i need to run simple match query
match(n:label)-[:has_example]->(m:label2) with n count(n.property) as c, collect(n.property) as c2 where c>1000
return m.property, c,c2 order by c desc
on 45 million nodes ...

and my server is not able to run ,.. how can i make it run via parallel or any other way..??

benjamin_squire · ‎10-17-2019

Please post a separate Question and then you can link it here.

yeshveer_yadav · ‎10-17-2019

Neo4j

How to fetch millions of data faster?