Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
07-24-2020 08:46 AM
I need help speeding up the process for inputting items from a Scrapy pipeline into Neo4j. I am currently working on a project where I am scraping the data for about a million patents and storing their information and connections with Neo4j. Each patent will have on average have 10 different connections including assignees, inventors, classifications, and most importantly connections to other patents.
Neo4j Server version: 4.0.4 (community)
Neo4j Browser version: 4.0.8
Py2Neo Version: 5.0b1
I have tried searching for a way, using python to store these items into Neo4j using py2neo and UNWIND queries, but it takes WAY too long (several seconds) per item. Any suggestions on how to speed up this process? Here's an example snippet from my code:
def assignee(item):
user = item.get("user")
for assignee in user['assignees']:
assignee_user = parse_user(assignee)
fullname = assignee_user['fullname'] if 'fullname' in assignee_user else '',
first_name = assignee_user['first_name'] if 'first_name' in assignee_user else '',
last_name = assignee_user['last_name'] if 'last_name' in assignee_user else ''
assignee = {
"fullname": fullname,
"first_name": first_name,
"last_name": last_name
}
if assignee_user['status'] == 3:
city_located = assignee_user['city']
state_abbreviation = assignee_user['state']
country_abbreviation = assignee_user['country']
location = {
"city": city,
"state": state_abbreviation,
"country": country_abbreviation
}
elif assignee_user['status'] == 2:
city = assignee_user['city']
country_abbreviation = assignee_user['country']
location = {
"city": city,
"state": None,
"country": country_abbreviation
}
elif assignee_user['status'] == 0:
location = {
"city": None,
"state": None,
"country": None,
}
yield assignee, location
params = []
for individual in assignee(item):
assignee, location = individual
params.append({
'fullname': assignee['fullname'],
'first_name': assignee['first_name'],
'last_name': assignee['last_name'],
'city': location['city'],
'state': location['state'],
'country': location['country']
})
q = """
MATCH(patent:Patent) WHERE patent.document_number = '"""+document_number+"""'
UNWIND {$datas} as data
MERGE(assignee:User {fullname: data.fullname})
SET assignee.first_name = data.first_name,
assignee.last_name = data.last_name
MERGE(city:City {name: data.city})
MERGE(patent)-[:ASSIGNEE]->(assignee)
MERGE(assignee)-[:LOCATED_IN]->(city)
"""
07-24-2020 09:44 AM
07-24-2020 02:08 PM
Thanks Sameer, I'll take a look at it
All the sessions of the conference are now available online