cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

My first post and question: Is DISTINCT going to hurt my query performance in the long run?

Hello World!

I'm the CTO of a cyber-security company. Frustrated with the developer experience and performance of Postgres, we decided to give a graph database a shot. I'm incredibly excited our company got accepted to the Neo4j Startup Program today as well!

We are building an Attack Surface Management product. Give us your domain and we will find all connected Internet Assets.

The plan is to incrementally move data from Postgres to Neo4j, starting with Domains and Subdomains. I came up with the following schema:

(Domain {name: "neo4j.com"})-[HAS_SUBDOMAIN {found_by: "scanner1"}]->(Subdomain {name: "community.neo4j.com"})
(Domain {name: "neo4j.com"})-[HAS_SUBDOMAIN {found_by: "scanner2"}]->(Subdomain {name: "community.neo4j.com"})

To get all subdomains for a domain, I execute:

MATCH (d:Domain)-[:HAS_SUBDOMAIN]->(s:Subdomain)
WHERE d.name = 'neo4j.com'
RETURN DISTINCT s.name

Please note the DISTINCT. I wonder if DISTINCT is going to hurt my performance in the long run as the dataset grows? Or maybe it will just be annoying to query like this in the future? Would it be better to have found_by as new node/label, so that Domain and Subdomain would only be connected with one relationship and not multiple.

I hope I was able to express my question in a way so it makes sense.

Thank you,
Matt

2 REPLIES 2

Can you tell us more about that so we can have a better idea of the data model and it's scale. The scale will chose for you how you must proceed. I would say yes for the scanner as a node to avoid the famous super node problem where you have to much relation to scan but it depends on how much data you have and what are the questions you want to ask.

So you can ask something like what subdomains have been found by x scanner which for now is really hard to answer according to the Neo4j data accessibility principles.

You can use arrows.app website ( made by neo4j ) if you want to design your model to give us an idea.

The intuitivity and complexity management of the Cypher language use by Neo4j is far better than SQL if this is what you mean by developer experience. But, Neo4j and SQL serve not exactly the same purposes. Neo4j tends to be a little bit slower than SQL for updating the database and not bad but not especially good with something like inventory management.

But if have to deal with complex questions it's an impressive tool for business, I love it.

My opinion, according to some profiling I just did, DISTINCT clause change nothing, not a single DB hit is different from the non distinct version of the query. But it seems to required more memory and I guess a tiny little bit more CPU juice.

As the most important thing in performance is db hits, you should be fine with a larger scale if you are not working on a raspberry pie. Just kidding 😉

clem
Graph Steward

How frequently do you expect to access the :found_by property in your queries? Especially will you do MATCHes by :found_by property? (vs. just reporting :found_by)

If the answer is hardly ever/never, then consider using a Cypher list of strings for found_by: property instead of a string.

(Domain {name: "neo4j.com"})-[HAS_SUBDOMAIN {found_by: ["scanner1", "scanner2"]}]->(Subdomain {name: "community.neo4j.com"})

Then you won't need DISTINCT. The downside will be updating or searching by found_by: will be slower.