cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

When I add a Skip/Limit to the query I start getting duplicates

mike2
Node Clone

NEO4j 4.0.5 in both Desktop and on a standalone Enterprise single instance in an EC2 VM.

When I run the following query with the SKIP and LIMIT in them I get some duplicates on subsequent pages. (groups of 50).

call{
                MATCH (u:User {user_id:218})
                OPTIONAL MATCH (u)-[:USER]-(mf:Image)
                with u, collect(mf.version_history) as myInts
                WITH u, REDUCE(output = [], i IN myInts| output + i) AS my_files
                MATCH (u)-[:FAMILY_FRIEND]-(:User)-[ul:USER]-(f:Image)
                with collect({node: f, deleted: exists(ul.deleted), vh: f.version_history, vd: f.version_date}) as files, my_files
                unwind files as fset
                with fset['node'] as f, fset['deleted'] as deleted, fset['vh'] as vh, fset['vd'] as vd, files, my_files
                WHERE none(x in files WHERE apoc.coll.intersection(vh, x['vh']) and x['vd'] > vd)
                AND not apoc.coll.intersection(vh, my_files) AND not deleted
                return f
                union
                MATCH (:User {user_id:218})-[ul:USER]-(f:Image) 
                WHERE not exists(ul.deleted)
                return f
            }
            WITH f
            ORDER BY f.version_date DESC
            SKIP 0 // or 50
            LIMIT 50
            return f.uid

I tried adding a distinct f.
My understanding is that the new call subquery would remove the issue of skip and limit not working on both parts of the UNION
If I remove the SKIP and LIMIT and return all the results there are no duplicates returned.
The duplicates are not necessarily the end node and first node of the set (It's not an off by 1 issue), It seems to be 1 or mode nodes on each page that are duplicated.

Any ideas?

1 ACCEPTED SOLUTION

The timestamps of some of the items are the same (They must have been changed as part of the same TXN when they were in Postgres and thus have the same timestamp.) Adding a second order by clause removes the duplicates on the subsequent pages. (Still seems like strange behavior, but at least I know why now.

View solution in original post

11 REPLIES 11

mike2
Node Clone

This is holding me up from getting my software working, is there more information I can give to someone to help with this? Is this a bug in 4.0.5?

When I remove the skip and limit it does not return duplicates, but returning thousands of results at a time will not work, they need to be paged, what can I do?

Hi Mike,

As union gives non duplicate results you do not need distinct keyword.
However I feel like duplicate records are due to variable name f. In the return it is not hitting combined f.

I am surprised how come SKIP and LIMIT is giving you duplicate results when f itself if distinct result set.

Once you come online we can discuss on this in more detail.

So I did a quick test this morning and this works as it should:

call {
	UNWIND [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20] as data1
    return data1
    UNION
    UNWIND [7,8,9,17,18,19] as data1
    return data1
}
WITH data1
order by data1
skip 10
limit 10
return data1

So, I removed all the logic in the first part of my union to try and eliminate where the problem was coming from, but this still has duplicates:
call{
MATCH (u:User {user_id:'218'})
OPTIONAL MATCH (u)-[:USER]-(f:Image)
return f
union
MATCH (:User {user_id:'218'})-[ul:USER]-(f:Image)
WHERE not exists(ul.deleted)
return f
}
WITH f
ORDER BY f.version_date DESC
SKIP 50 // or 50
LIMIT 50
return f.uid

mike2
Node Clone

Cypher Manual 3.18.1 -- https://neo4j.com/docs/cypher-manual/4.0/clauses/union/

To keep all the result rows, use UNION ALL . Using just UNION will combine and remove duplicates from the result set.

Yeah I reread the doc, that's why I deleted my post

So I have made the dumbest possibly query (Which you would never do in real life, but to test the issue It was useful.) This returns duplicates. Once again, the duplicates seem random.

call{
      MATCH(f:Image)
      return f
      union
      MATCH (f:Image)
      return f
}
WITH f
ORDER BY f.version_date DESC
SKIP $skip
LIMIT $limit
RETRUN f.uid

mike2
Node Clone

Here is one of the items returning duplicates:

Results from: MATCH (f:File{uid:'4147'}) RETURN f

{
  "identity": 7247,
  "labels": [
    "File",
    "Image"
  ],
  "properties": {
    "has_people": true,
    "version_date": "2020-02-06T18:38:00.740160999Z",
    "version_history": [
      4147
    ],
    "file_size": 0,
    "uid": "4147",
    "system": "facebook",
    "filename": "10155585636624547",
    "width": 721,
    "thumbnail_system": "wasabi-us-east-2",
    "thumbnail_key": "6ced12aa-8b93-44b4-a1e8-42f4ef008cc9.jpg",
    "hash": "ee6137545d2e4902e4124a59a5471b75e91c1925a37c0459decbf85886083c71",
    "upload_date": "2019-12-18T19:56:46.047438000Z",
    "height": 960
  }
}

mike2
Node Clone

So on a whim I changed the order by from the version_date to the uid and suddenly no duplicates.

The timestamps of some of the items are the same (They must have been changed as part of the same TXN when they were in Postgres and thus have the same timestamp.) Adding a second order by clause removes the duplicates on the subsequent pages. (Still seems like strange behavior, but at least I know why now.

As a followup I reran the query without the union or anything and the duplicates came up, the problem is only related to the dates.

            MATCH(f:Image)
            WITH f
            ORDER BY f.version_date DESC//, f.uid DESC
            SKIP $skip
            LIMIT $limit
            RETURN f.uid, f.version_date
Nodes 2022
Nodes
NODES 2022, Neo4j Online Education Summit

All the sessions of the conference are now available online