Neo4j

Zapp · ‎08-27-2020

I'm currently processing dozens of questionnaires, each one has dozens of questions, of which some are grid questions. Each question has potentially hundreds of answers (lets say up to 400. those are grid questions, so for a grid of 20x20, that means 400 unique answers). For those grid questions there is a relation to a node which acts as a "pair", pairing an answer in each of the dimensions. Needless to say the amount of relations between that pair and people who have chosen it is very big. With my current setup, count for all of those links shows that the number is nearing 200 000 000. Querying this data set takes a long time, so I'm looking for ideas on how to make that model more efficient. Any suggestions would be appreciated.

Thanks!

asperlinga · ‎08-30-2020

Hi Zapp

Can you print the image of graph model, please ?

CALL db.schema.visualization()

Thank You

Alessio

asperlinga · ‎08-31-2020

Hi Zapp

Meanwhile I got the idea that the graphic model depends on the type of question you ask. In Neo4j you can draw a graph for each point of view.

I was inspired by these articles:

and intelligent use of apoc function in:

.

ciao

Alessio

Zapp · ‎09-17-2020

Hi @asperlinga,

Thanks for your response, sorry I got back so late.

Thanks for the links, however it seems they do not mention grid questions specifically, only single axis questions. They also talk about representing the questionnaire flow and storing progress etc., but in my use case, the database is mostly read-only. Sometimes new questionnaires are imported but they are already filled, no need to store questionnaire flow etc.

Here is a simplified schema (I used apoc for it as the Neo4j's built-in one shows erroneous relations):

All those nodes apart from Brand are only connected to a single questionnaire. Here is a sample query:

MATCH (Questionnaire:Questionnaire)-[:ASKS]->(Question:Question)<-[:ANSWERS]-(XAnswer:Answer)<-[:SELECTS]-(Selection:Selection)<-[PickedYes:PICKED_YES]-(Panelist:Panelist)
MATCH (Questionnaire:Questionnaire)-[:ASKS]->(Question:Question)<-[:ANSWERS]-(YAnswer:Answer)<-[:SELECTS]-(Selection:Selection)<-[PickedYes:PICKED_YES]-(Panelist:Panelist)
MATCH (Questionnaire:Questionnaire)-[:ASKS]->(AgeQuestion:Question)<-[:ANSWERS]-(AgeAnswer:Answer)<-[:SELECTS]-(:Selection)<-[:PICKED_YES]-(Panelist:Panelist)
WHERE
	Questionnaire.Guid = 'fef238b4-c211-43a6-a31a-9a552adaad55'
AND
	Question.Text = 'Time spent on the internet'
AND
	XAnswer.Axis = 'X'
AND
	YAnswer.Axis = 'Y'
AND
	YAnswer.Code <> 1
AND
	YAnswer.Code <> 4
AND
	AgeQuestion.Text = 'Age'
AND
	(Questionnaire:Questionnaire)-[:ASKS]->(:Question { Text: 'Awareness' })<-[:ANSWERS]-(:Answer { Text: 'Foo' })<-[:SELECTS]-(:Selection)<-[:PICKED_NO]-(Panelist:Panelist)
AND
	// filter by grid question. Selection.UniqueId used because it's practically always faster than going through Question - Answer - Selection - Panelist
	// UniqueId is a unique (duh) string which identifies a selection across all Questionnaires. it consists of question text, answer pair and the questionnaire id
	(:Selection { UniqueId: 'Consideration_1_3_fef238b4-c211-43a6-a31a-9a552adaad55' })<-[:PICKED_ANY]-(Panelist:Panelist)
RETURN XAnswer AS XAnswer, AgeAnswer AS AgeAnswer, COUNT(Panelist) AS PanelistCount
ORDER BY XAnswer.Code

Here's a longer running one (30 minutes):

MATCH (Questionnaire:Questionnaire)-[:ASKS]->(Question:Question)-[:HAS_SELECTION]->(Selection:Selection)<-[:PICKED_YES]-(Panelist:Panelist)
WHERE
	Questionnaire.Guid IN [/* dozens of questionnaires here */]
AND
	Question.Text IN [/* dozens of questions here */]
AND
(
		(Panelist:Panelist)-[:PICKED_YES]->(:Selection { CrossQuestionnaireUniqueId: 'Gender_1_Male' })
	AND
		(
			(Panelist:Panelist)-[:PICKED_YES]->(:Selection { CrossQuestionnaireUniqueId: 'Age_2_27-30' })
		OR
			(Panelist:Panelist)-[:PICKED_YES]->(:Selection { CrossQuestionnaireUniqueId: 'Age_4_44-47' })
		)
)
WITH Questionnaire AS Questionnaire, Selection AS Selection, COUNT(Panelist) AS PanelistCount
RETURN Questionnaire.Guid AS QuestionnaireId, Selection.CrossQuestionnaireUniqueId AS SelectionCrossQuestionnaireUniqueId, PanelistCount AS PanelistCount

Here there wasn't even any point in joining in the answers in the query because it was taking so long, so I'm doing it after retrieving the results, manually. The query itself is not even that complicated, nor does it produce a ton of rows. The issue is the sheer amount of selection - panelist relations it processes. In a test case, each of the 3 filter sub queries makes ~1 billion DB hits and produces ~50 million intermediate rows, with a final query row count of 4000. Sometimes even overflowing the profiling counters and visually bugging out Neo4j Browser...

Most of the queries I use boil down to this: take a questionnaire (or many of them) and its panelists, apply multiple (dynamic) filters to those panelists and then get the count of remaining panelists, usually grouped by something.

Neo4j

Help creating more efficient model to store questionnaires