Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
02-13-2019 08:24 PM
Hi everybody,
I am reading "A comprehensive Guide to Graph Algorithms in Neo4j" ebook. According to this book, I have downloaded YELP data to experiment some algorithms. However, I cannot import the data into my Neo4j server.
I have followed guides from github (in the book) but errors have happened. Anybody here can help me or you have an yelp-graph-database so that you can upload somewhere?
Thanks for you help,
Harvey Nguyen
Solved! Go to Solution.
02-14-2019 01:34 AM
Hey,
That file is generated by running this command:
python lat_long_expansion.py
Or did you try that already and it didn't work?
02-13-2019 09:57 PM
Can you share the error messages?
There is also a Cypher based import script for the same data here: https://neo4j.com/docs/graph-algorithms/current/yelp-example/#yelp-import
02-13-2019 10:14 PM
Thanks for you reply,
There are different errors, one of them is that
Traceback (most recent call last):
File "json_to_csv.py", line 51, in <module>
with open("dataset/businessLocations.json") as business_locations_json, \
FileNotFoundError: [Errno 2] No such file or directory: 'dataset/businessLocations.json'
After extracting the database file, there is no file such as "businessLocations.json". I think this python code is not update. Also on the Yelp website, there is no file like that.
I will try with your suggestion.
Thanks a lot.
02-13-2019 10:10 PM
please, share the error message.
02-14-2019 01:34 AM
Hey,
That file is generated by running this command:
python lat_long_expansion.py
Or did you try that already and it didn't work?
02-14-2019 01:41 AM
@mark.needham really?
I have followed the instructions on github
After extracting, I run
python json_to_csv.py
So, we need to change the order ?
02-14-2019 02:06 AM
Yeh I must have those instructions in the wrong order
05-09-2019 03:32 PM
Hi @mark.needham, I tried the python as well as the json file load using cypher and it gets stuck loading user.json. I'm loading it on neo4j desktop.
05-10-2019 12:16 AM
Is there an error message / more info that you can share?
05-10-2019 07:31 AM
Hi @mark.needham, Thanks for your response! I am working with @shivanandiyer on a project to analyse the Yelp dataset with Neo4j. I initially tried using the following apoc procedure to load the json and it took more than a day and I had to abort:
CALL apoc.load.json("path") YIELD value AS user
MERGE (u:User {user_id: user.user_id})
SET u.name = user.name,
u.review_count = user.review_count,
u.average_stars = user.average_stars,
u.fans = user.fans
I later came across your GitHub repository to convert JSON to CSV and directly import the data file with relations (https://github.com/mneedham/yelp-graph-algorithms). I followed the steps and got this error while running the import.sh script file:
Available resources:
Total machine memory: 16.00 GB
Free machine memory: 2.81 GB
Max heap memory : 3.56 GB
Processors: 8
Configured max memory: 11.20 GB
High-IO: true
Import starting 2019-05-11 00:24:07.432+1000
Estimated number of nodes: 2.83 M
Estimated number of node properties: 8.24 M
Estimated number of relationships: 1.82 G
Estimated number of relationship properties: 0.00
Estimated disk space usage: 58.61 GB
Estimated required memory usage: 1.03 GB
InteractiveReporterInteractions command list (end with ENTER):
c: Print more detailed information about current stage
i: Print more detailed information
(1/4) Node import 2019-05-11 00:24:07.487+1000
Estimated number of nodes: 2.83 M
Estimated disk space usage: 967.71 MB
Estimated required memory usage: 1.03 GB
.......... .......... .......... .......... .......... 5% ∆1s 822ms
.......... .......... .......... .......... .......... 10% ∆403ms
.......... .......... .......... .......... .......... 15% ∆403ms
.......... .......... .......... .......... .......... 20% ∆458ms
.......... .......... .......... .......... .......... 25% ∆1s 407ms
.......... .......... .......... .......... .......... 30% ∆1s 804ms
.......... .......... .......... ........-. .......... 35% ∆236ms
.......... .......... .......... .......... .......... 40% ∆0ms
.......... .......... .......... .......... .......... 45% ∆0ms
.......... .......... .......... .......... .......... 50% ∆605ms
.......... .......... .......... .......... .......... 55% ∆0ms
.......... .......... .......... .......... .......... 60% ∆202ms
.......... .......... .......... .......... .......... 65% ∆202ms
.......... .......... .......... .......... .......... 70% ∆0ms
.......... .......... .......... .......... .......... 75% ∆2s 410ms
.......... .......... .......... .......... .......... 80% ∆0ms
.......... .......... .......... .......... .......... 85% ∆1ms
.......... .......... .......... .......... .......... 90% ∆0ms
.......... .......... .......... .......... .......... 95% ∆0ms
.......... .......... .......... .......... .........Exception in thread "Thread-50" java.lang.RuntimeException: org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.DuplicateInputIdException: Id '#NAME?' is defined more than once in group 'Review'
at org.neo4j.unsafe.impl.batchimport.staging.AbstractStep.issuePanic(AbstractStep.java:155)
at org.neo4j.unsafe.impl.batchimport.staging.AbstractStep.issuePanic(AbstractStep.java:147)
at org.neo4j.unsafe.impl.batchimport.staging.LonelyProcessingStep.lambda$receive$0(LonelyProcessingStep.java:59)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.DuplicateInputIdException: Id '#NAME?' is defined more than once in group 'Review'
at org.neo4j.unsafe.impl.batchimport.input.BadCollector$NodesProblemReporter.exception(BadCollector.java:278)
at org.neo4j.unsafe.impl.batchimport.input.BadCollector.collect(BadCollector.java:168)
at org.neo4j.unsafe.impl.batchimport.input.BadCollector.collectDuplicateNode(BadCollector.java:135)
at org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.EncodingIdMapper.detectDuplicateInputIds(EncodingIdMapper.java:606)
at org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.EncodingIdMapper.buildCollisionInfo(EncodingIdMapper.java:522)
at org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.EncodingIdMapper.prepare(EncodingIdMapper.java:239)
at org.neo4j.unsafe.impl.batchimport.IdMapperPreparationStep.process(IdMapperPreparationStep.java:56)
at org.neo4j.unsafe.impl.batchimport.staging.LonelyProcessingStep.lambda$receive$0(LonelyProcessingStep.java:53)
... 1 more
IMPORT FAILED in 13s 620ms.
Data statistics is not available.
Peak memory usage: 1.02 GB
Duplicate input ids that would otherwise clash can be put into separate id space, read more about how to use id spaces in the manual: https://neo4j.com/docs/operations-manual/3.5/tools/import/file-header-format/#import-tool-id-spaces
Caused by:Id '#NAME?' is defined more than once in group 'Review'
WARNING Import failed. The store files in /Users/abishekarunachalam/Downloads/NEO4J_HOME/data/databases/yelp.db are left as they are, although they are likely in an unusable state. Starting a database on these store files will likely fail or observe inconsistent records so start at your own risk or delete the store manually
unexpected error: Id '#NAME?' is defined more than once in group 'Review'
I checked the review.csv file and found '#NAME' repeated multiple times in Column1 as seen in the attached screenshot:
Considering we are beginners, any guidance on what could have gone wrong or any other way to efficiently import Yelp data in NEO4j would be much appreciated. Thank you!
05-10-2019 07:23 AM
Hi @mark.needham, this is the error I'm getting while running the json to csv.
Traceback (most recent call last):
File "json_to_csv.py", line 109, in
for category in item["categories"]:
TypeError: 'NoneType' object is not iterable
Tried installing the libraries from requirements.txt and had errors with pkg-resources and PyYAML.
Could not find a version that satisfies the requirement pkg-resources==0.0.0 (from -r requirements.txt (line 10)) (from versions: )
No matching distribution found for pkg-resources==0.0.0 (from -r requirements.txt (line 10))
and Cannot uninstall 'PyYAML'. It is a distutils installed project and thus we cannot accurately determine which files belong to it which would lead to only a partial uninstall.
05-14-2019 07:55 PM
Hi @mark.needham ,
We managed to sort some of those issues out for now by loading data using cypher instead of python.
How long did it take for you for load the complete Yelp dataset? Loading the business.json took me around 7 hours with heapsize configured to 12G and pagecache size 6GB. I'm running neo4j desktop on my laptop - 4 core, 32GB
This is what I ran. Wondering if setting the batch size and parallel = true would have made some difference.
CALL apoc.load.json('file:///business.json')
YIELD value
WITH value
MERGE (b:Business {id:value.business_id})
SET b += apoc.map.clean(value, ['attributes','hours','business_id','categories','address','postal_code'], )
WITH b,value.categories as categories
UNWIND categories as category
MERGE (c:Category{name:category})
MERGE (b)-[:IN_CATEGORY]->(c);
10-07-2019 11:42 AM
Hey, I just came upon this thread and I'm also trying to import the yelp data. How long did it end up taking your for import? so far everything appears to be working for me just using the apoc commands here https://neo4j.com/docs/graph-algorithms/current/yelp-example/, but it is just taking a while.
11-02-2020 04:38 PM
Hi @mark.needham. The following couple of edits was required for me to successfully run the Python files:
If there are any more updates, will add to this list...
All the sessions of the conference are now available online