cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

Analysis of big data and use of ETL

I am currently working on a project. The amount of data involved in this project is large. I have evaluated the computing power of neo4j according to the evaluation methods provided by experts in the community, and found that the neo4j community version may not be qualified for this task. This project is mainly to analyze the relationship of some data in the relational database. My idea is to automatically import some data from the relational database into neo4j for analysis, but the amount of data is too large to be made. My problem is: not fully import, first use the neo-jj ETL to automatically import some data in the relational database into the analysis, export the results, delete the data in neo4j, and then use the ETL to import the remaining data in batches. analysis. Is such a process more efficient and requires a lot of time in the I/O chain? Is there a better way?
Thank you very much!

11 REPLIES 11

12kunal34
Graph Fellow

Hi @ruanlovelin

If you are working on large dataset then i you can do it in multiple ways .

one of the way is using neo4j-ETL
second way is using APOC jdbc procedure .
second way is better because in this you can specify the schema bt yourself . you don't need to worry about csv files as in neo4j-ETL.
simply download jdbc driver and put it into the plugin folder and after that you need to register the driver and can use the procedure according to your need.

Volume of data doesn't matter . the only thing matter your neo4j configuration. neo4j is capable enough to handle bigdata so when ever you passing some table with huge volume of data ,make sure it should be in batches of 10k max.

Thank you very much!I also studied some apoc materials in the early stage, but I found a way to solve my problem. I will take another look, then I will ask you again.

How much data are we talking about? Roughly how much model complexity? (I.e. number of node labels and relationship types)

What are the daily throughput amounts like? When you say delete the data export the analysis and do another batch, how frequent are the batches and how big are the batches?

I am dealing with a large securities trading system, I need to analyze the trading situation of the account that I am concerned about. These data are stored in my existing relational database db2 or oracle. There are more than 100 schemas in the original database, and each schema has about 200 million nodes. My application scenario is: Given an account number, you can query the transaction chain of the account at a certain moment or after a transaction in the shortest possible time. For the sake of simplicity, there are 4 node tags and 1 relationship type. It has not been updated daily, and can only be updated monthly. When updating monthly, the number of nodes increases not much, but the number of transactions will be more, and the average monthly number of each schema is 800 million. I wanted to use ETL to import the transaction data into the neo4j analysis, but it doesn't seem to work after the evaluation. So I want to be able to import and analyze the data of one schema each time, get the corresponding transaction chain and then import the data of the next schema for analysis. And not all transactions involve nodes in all schemas. I hope that this method can improve efficiency. But in this case, frequent I/O interactions also have a lot of time. Therefore, I hope to get your help and guidance.Thank you very much!

Sorry, I made a mistake. I have not found a way to solve this problem.

could you please check it with jdbc driver

According to your statement, I don't need ETL. Is it only necessary to solve my problem with apoc? I have already downloaded apoc and tried it, but I have not found a corresponding solution. My scenario is: There are a lot of transaction records, how can I find a trading chain for an account? Please help me solve this problem.Thank you very much!

According to my statement , You can try with APOC as well if Neo4j-ETL not giving desired output.
You want to migrate data from db2 or oracle to neo4j and after that you want to find trading chain for an account.
for getting the trading chain is not that much big issue.Once you get all data in neo4j then you can easily use cypher to get that.

could you please let us know what approach you followed for the data migration ?

So far, I have not migrated data from db2 or oracle. I am still demonstrating the feasibility of various methods, including neo4j, spark graphx, elasticsearch, and so on. I hope to get the trading chain as short as possible in the case of large data volumes. For neo4j, I evaluated it in the early stage. If all the transaction data is imported (this may require ETL), neo4j may not be satisfied. So I just thought of a way to import and export data in batches according to each schema, and then implement the query transaction chain step by step. Do you think that my idea is correct?Thank you very much!

You've got an interesting situation here. I've reached out to one of our field engineers in China, who will follow up directly with you. I think we can help, but it might be better for that engineer to engage you in a separate email or phone conversation to get the details and step through some options. If you don't hear back from him in a day or two, please ping back and we'll try to get you further help.

Sorry, my English is relatively poor. First of all, thank you very much for your help. Now I hope that you help me in the following aspects: 1. I need to export data from db2 or oracle to neo4j, apoc or ETL which is more suitable for this requirement? 2. If it is apoc, can you tell me some links to the information, I will learn by myself and ask you if I have difficulties. 3. If it is ETL, can you tell me some links to the information and the lisence of ETL , I will learn by myself and ask you if I have difficulties. Thank you very much!