cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

Scrapy Pipeline to Neo4j Bulk Store Very Slow

I need help speeding up the process for inputting items from a Scrapy pipeline into Neo4j. I am currently working on a project where I am scraping the data for about a million patents and storing their information and connections with Neo4j. Each patent will have on average have 10 different connections including assignees, inventors, classifications, and most importantly connections to other patents.

Neo4j Server version: 4.0.4 (community)
Neo4j Browser version: 4.0.8
Py2Neo Version: 5.0b1

I have tried searching for a way, using python to store these items into Neo4j using py2neo and UNWIND queries, but it takes WAY too long (several seconds) per item. Any suggestions on how to speed up this process? Here's an example snippet from my code:

def assignee(item):
                    user = item.get("user")
                    for assignee in user['assignees']:
                        assignee_user = parse_user(assignee)

                        fullname = assignee_user['fullname'] if 'fullname' in assignee_user else '',
                        first_name = assignee_user['first_name'] if 'first_name' in assignee_user else '',
                        last_name = assignee_user['last_name'] if 'last_name' in assignee_user else ''

                        assignee = {
                            "fullname": fullname,
                            "first_name": first_name,
                            "last_name": last_name
                        }

                        if assignee_user['status'] == 3:
                            city_located = assignee_user['city']
                            state_abbreviation =  assignee_user['state']
                            country_abbreviation = assignee_user['country']

                            location = {
                                "city": city, 
                                "state": state_abbreviation,
                                "country": country_abbreviation
                            }

                        elif assignee_user['status'] == 2:
                            city = assignee_user['city']
                            country_abbreviation = assignee_user['country']

                            location = {
                                "city": city, 
                                "state": None,
                                "country": country_abbreviation
                            }

                        elif assignee_user['status'] == 0:
                            location = {
                                "city": None,
                                "state": None,
                                "country": None,
                            }

                        yield assignee, location



params = []
                    for individual in assignee(item):
                        assignee, location = individual
                        params.append({
                                        'fullname': assignee['fullname'], 
                                        'first_name': assignee['first_name'],
                                        'last_name': assignee['last_name'],
                                        'city': location['city'],
                                        'state': location['state'],
                                        'country': location['country']
                                    })

                    q = """
                        MATCH(patent:Patent) WHERE patent.document_number = '"""+document_number+"""'
                        UNWIND {$datas} as data
                        MERGE(assignee:User {fullname: data.fullname})
                        SET assignee.first_name = data.first_name,
                            assignee.last_name = data.last_name
                        MERGE(city:City {name: data.city})
                        MERGE(patent)-[:ASSIGNEE]->(assignee)
                        MERGE(assignee)-[:LOCATED_IN]->(city)
                    """
2 REPLIES 2

sameerG
Graph Buddy

Thanks Sameer, I'll take a look at it

Nodes 2022
Nodes
NODES 2022, Neo4j Online Education Summit

All the sessions of the conference are now available online