Steven Haines
Contributor

Big data analytics with Neo4j and Java, Part 1

how-to
Mar 13, 201818 mins

Graph databases like Neo4j are ideal for modeling complex relationships--and they move through big data at lightspeed

JavaWorld - JW - OSJP - big data analytics - social graph
Credit: Matt Biddulph

Relational databases have dominated data management for decades, but they’ve recently lost ground to NoSQL alternatives. While NoSQL data stores aren’t right for every use case, they are generally better for big data, which is shorthand for systems that process massive volumes of data. Four types of data store are used for big data:

  • Key/value stores such as Memcached and Redis
  • Document-oriented databases such as MongoDB, CouchDB, and DynamoDB
  • Column-oriented data stores such as Cassandra and HBase
  • Graph databases such as Neo4j and OrientDB

This tutorial introduces Neo4j, which is a graph database used for interacting with highly related data. While relational databases are good at managing relationships between data, graph databases are better at managing n-th degree relationships. As an example, take a social network, where you want to analyze patterns involving friends, friends of friends, and so on. A graph database would make it easy to answer a question like, “Given five degrees of separation, what are five movies popular with my social network that I have not yet seen?” Such questions are common for recommendation software, and graph databases are perfect for solving them. Additionally, graph databases are good at representing hierarchical data, such as access controls, product catalogs, movie databases, or even network topologies and organization charts. When you have objects with multiple relationships, you’ll quickly find that graph databases offer an elegant, object-oriented paradigm for managing those objects.

The case for graph databases

Like the name suggests, graph databases are good at representing graphs of data. This is especially useful for social software, where every time you connect with someone, a relationship is defined between you. Probably in your last job search, you picked a few companies that you were interested in and then searched your social networks for connections to them. While you might not know anyone working for one of those companies, someone in your social network likely does. Solving a problem like this is easy at one or two degrees of separation (your friend or a friend of a friend) but what happens when you start extending the search across your network?

In their book, Neo4j In Action, Aleksa Vukotic and Nicki Watt explore the differences between relational databases and graph databases for solving social network problems. I’m going to draw on their work for the next few examples, in order to show you why graph databases are becoming an increasingly popular alternative to relational databases.

Modeling complex relationships: Neo4j vs MySQL

From a computer science perspective, when we think about modeling relationships between users in a social network, we might draw a graph like the one in Figure 1.

osjp neo4j fig01 Steven Haines

Figure 1. Graphing relationships in a social network

A user has IS_FRIEND_OF relationships with other users, and those users have IS_FRIEND_OF relationships with other users, and so forth. Figure 2 shows how we’d represent this in a relational database.

osjp neo4j fig02 Steven Haines

Figure 2. Modeling a social graph in a relational database

The USER table has a one-to-many relationship with the USER_FRIEND table, which models the “friend” relationship between two users. Now that we’ve modeled the relationships, how would we query our data? Vukotic and Watt measured the query performance for counting the number of distinct friends going out to a depth of five levels (friends of friends of friends of friends of friends). In a relational database the queries would look as follows:


# Depth 1
select count(distinct uf.*) from user_friend uf where uf.user_1 = ?

# Depth 2
select count(distinct uf2.*) from user_friend uf1
  inner join user_friend uf2 on uf1.user_1 = uf2.user_2
  where uf1.user_1 = ?

# Depth 3
select count(distinct uf3.*) from t_user_friend uf1
  inner join t_user_friend uf2 on uf1.user_1 = uf2.user_2
  inner join t_user_friend uf3 on uf2.user_1 = uf3.user_2
  where uf1.user_1 = ?

# And so on...

What is interesting about these these queries is that each time we go out one more level, we are required to join the USER_FRIEND table with itself. Table 1 shows what researchers Vukotic and Watt found when they inserted 1,000 users with approximately 50 relationships each (50,000 relationships) and ran the queries.

Table 1. MySQL query response time for various depths of relationships

DepthExecution time (seconds)Count result

2 0.028 ~900
3 0.213 ~999
4 10.273 ~999
5 92.613 ~999

MySQL does a great job of joining data up to three levels away, but performance degrades rapidly after that. The reason is that each time the USER_FRIEND table is joined with itself, MySQL must compute the cartesian product of the table, even though the majority of the data will be thrown away. For example, when performing that join five times, the cartesian product results in 50,000^5 rows, or 102.4*10^21 rows. That’s a waste when we are only interested in 1,000 of them!

Next, Vukotic and Watt tried executing the same type of queries against Neo4j. These entirely different results are shown in Table 2.

Table 2. Neo4j response time for various depths of relationships

DepthExecution time (seconds)Count result

2 0.04 ~900
3 0.06 ~999
4 0.07 ~999
5 0.07 ~999

The takeaway from these execution comparisons is not that Neo4j is better than MySQL. Rather, when traversing these types of relationships, Neo4j’s performance is dependent on the number of records retrieved, whereas MySQL’s performance is dependent on the number of records in the USER_FRIEND table. Thus, as the number of relationships increases, the response times for MySQL queries will likewise increase, whereas the response times for Neo4j queries will remain the same. This is because Neo4j’s response time is dependent on the number of relationships for a specific query, and not on the total number of relationships.

Scaling Neo4j for big data

Extending this thought project one step further, Vukotic and Watt next created a million users with 50 million relationships between them. Table 3 shows results for that data set.

Table 3. Neo4j response time for 50 million relationships

DepthExecution time (seconds)Count result

2 0.01 ~2,500
3 0.168 ~110,000
4 1.359 ~600,000
5 2.132 ~800,000

Needless to say, I am indebted to Aleksa Vukotic and Nicki Watt and highly recommend checking out their work. I extracted all the tests in this section from the first chapter of their book, Neo4j in Action.

Getting started with Neo4j

You’ve seen that Neo4j is capable of executing massive amounts of highly related data very quickly, and there’s no doubt it’s a better fit than MySQL (or any relational database) for certain kinds of problems. If you want to understand more about how Neo4j works, the easiest way is to interact with it through the web console.

Start by downloading Neo4j. For this article, you’ll want the Community Edition, which as of this writing is at version 3.2.3.

  • On a Mac, download a DMG file and install it as you would any other application.
  • On Windows, either download an EXE and walk through an installation wizard or download a ZIP file and decompress it on your hard drive.
  • On Linux, download a TAR file and decompress it on your hard drive.
  • Alternatively, use a Docker image on any operating system.

Once you have installed Neo4j, start it up and open a browser window to the following URL:

http://127.0.0.1:7474/browser/

Login with the default username of neo4j and the default password of neo4j. You should see a screen similar to Figure 3.

osjp neo4j fig03 Steven Haines

Figure 3. Web Interface for Neo4

Nodes and relationships in Neo4j

Neo4j is designed around the concept of nodes and relationships:

  • A node represents a thing, such as a user, a movie, or a book.
  • A node contains a set of key/value pairs, such as a name, a title, or a publisher.
  • A node’s label defines what type of thing it is–again, a User, a Movie, or a Book.
  • Relationships define associations between nodes and are of specific types.

As an example, we might define Character nodes such as Iron Man and Captain America; define a Movie node named “Avengers”; and then define an APPEARS_IN relationship between Iron Man and Avengers and Captain America and Avengers. All of this is shown in Figure 4.

osjp neo4j fig04 Steven Haines

Figure 4. Nodes and relationships

Figure 4 shows three nodes (two Character nodes and one Movie node) and two relationships (both of type APPEARS_IN).

Modeling and querying nodes and relationships

Similar to how a relational database uses Structured Query Language (SQL) to interact with data, Neo4j uses Cypher Query Language to interact with nodes and relationships.

Let’s use Cypher to create a simple representation of a family. At the top of the web interface, look for the dollar sign. This indicates a field that allows you to execute Cypher queries directly against Neo4j. Enter the following Cypher query into that field (I’m using my family as an example, but feel free to change the details to model your own family if you like):

CREATE (person:Person {name: "Steven", age: 45}) RETURN person

The result is shown in Figure 5.

osjp neo4j fig05 Steven Haines

Figure 5. Creating a Person with Cypher Query Language

In Figure 5 you can see a new node with the label Person and the name Steven. If you hover your mouse over the node in your web console, you will see its properties at the bottom. In this case, the properties are ID: 19, name: Steven, and age: 45. Now let’s break down the Cypher query:

  • CREATE: The CREATE keyword is used to create nodes and relationships. In this case, we pass it a single argument, which is a Person enclosed in parentheses, so it is meant to create a single node.
  • (person: Person {…}): The lower case “person” is a variable name through which we can access the person being created, while the capital “Person” is the label. Note that a colon separates the variable name from the label.
  • {name: “Steven, age: 45}: These are the key/value properties that we’re defining for the node we’re creating. Neo4j does not require you to define a schema before creating nodes and each node can have a unique set of elements. (Most of the time you define nodes with the same label to have the same properties, but it is not required.)
  • RETURN person: After the node is created, we ask Neo4j to return it back to us. This is why we saw the node appear in the user interface.

The CREATE command (which is case insensitive) is used to create nodes and can be read as follows: create a new node with the Person label that contains name and age properties; assign it to the person variable and return it back to the caller.

Querying with Cypher Query Language

Next we want to try some querying with Cypher. First, we’ll need to create a few more people, so that we can define relationships between them.


    CREATE (person:Person {name: "Michael", age: 16}) RETURN person
    CREATE (person:Person {name: "Rebecca", age: 7}) RETURN person
    CREATE (person:Person {name: "Linda"}) RETURN person

Once you’ve created your four people, you can either click on the Person button under the Node Labels (visible if you click on the database icon in the upper left corner of the web page) or execute the following Cypher query:

MATCH (person: Person) RETURN person

Cypher uses the MATCH keyword to find things in Neo4j. In this example, we are asking Cypher to match all nodes that have a label of Person, assign those nodes to the person variable, and return the value that is associated with that variable. As a result you should see the four nodes that you’ve created. If you hover over each node in your web console, you will see each person’s properties. (You might note that I excluded my wife’s age from her node, illustrating that properties do not need to be consistent across nodes, even of the same label. I am also not foolish enough to publish my wife’s age.)

We can extends this MATCH example a little further by adding conditions to the nodes we want returned. For example, if we wanted just the “Steven” node, we could retrieve it by matching on the name property:

MATCH (person: Person {name: "Steven"}) RETURN person

Or, if we wanted to return all of the children we could request all people having an age under 18:

MATCH (person: Person) WHERE person.age < 18 RETURN person

In this example we added the WHERE clause to the query to narrow our results. WHERE works very similarly to its SQL equivalent: MATCH (person: Person) finds all nodes with the Person label, and then the WHERE clause filters values out of the result set.

Modeling direction in relationships

We have four nodes, so let’s create some relationships. First of all, let’s create the IS_MARRIED_TO relationship between Steven and Linda:

MATCH (steven:Person {name: "Steven"}), (linda:Person {name: "Linda"}) CREATE (steven)-[:IS_MARRIED_TO]->(linda) return steven, linda

In this example we match two Person nodes labeled Steven and Linda, and we create a relationship of type IS_MARRIED_TO from Steven to Linda. The format for creating the relationship is as follows:

(node1)-[relationshipVariable:RELATIONSHIP_TYPE->(node2)

The relationshipVariable is optional, but it’s required if you want to be able to access it in your RETURN statement (or in a WHERE clause). The arrows, ()-[]->(), denote the direction of the relationship, which is required by Cypher. If you wanted to express that Linda is married to Steven, then you could write the relationship in the other direction as follows: ()<-[]-(). If you wanted to create a bi-directional relationship, showing that Linda and Steve are married to each other, then you would need to create two separate relationships. While Cypher requires that you define a direction to your relationship, you can query either with a direction or without a direction.

The following query finds all the people in this family who are married (note the lack of any direction in the query):

MATCH (p1:Person)-[:IS_MARRIED_TO]-(p2:Person) RETURN p1, p2

The result is shown in Figure 6.

Now let’s create a few more relationships:


MATCH (michael:Person {name: "Michael"}), (rebecca:Person {name: "Rebecca"}) CREATE (michael)-[:IS_SIBLILNG]->(rebecca) return michael, rebecca
MATCH (steven:Person {name: "Steven"}), (michael:Person {name: "Michael"}) CREATE (steven)-[:HAS_CHILD]->(michael) return steven, michael
MATCH (steven:Person {name: "Steven"}), (rebecca:Person {name: "Rebecca"}) CREATE (steven)-[:HAS_CHILD]->(rebecca) return steven, rebecca
MATCH (linda:Person {name: "Linda"}), (michael:Person {name: "Michael"}) CREATE (linda)-[:HAS_CHILD]->(michael) return linda, michael
MATCH (linda:Person {name: "Linda"}), (rebecca:Person {name: "Rebecca"}) CREATE (linda)-[:HAS_CHILD]->(rebecca) return linda, rebecca

We can now see all people and their relationships with the following query:

MATCH (p:Person) RETURN p

The result is shown in Figure 7.

Traversing the social graph

To really explore the power of graph databases, we’ll need to expand our social graph. To start, let’s add some FRIEND relationships:


        MATCH (michael:Person {name: "Michael"}) CREATE (michael)-[:FRIEND]->(charlie:Person {name: "Charlie", age: 16}) RETURN michael, charlie
        MATCH (michael:Person {name: "Michael"}) CREATE (michael)-[:FRIEND]->(koby:Person {name: "Koby"}) RETURN michael, koby
        MATCH (michael:Person {name: "Michael"}) CREATE (michael)-[:FRIEND]->(grant:Person {name: "Grant"}) RETURN michael, grant
        MATCH (rebecca:Person {name: "Rebecca"}) CREATE (rebecca)-[:FRIEND]->(jordyn:Person {name: "Jordyn"}) RETURN rebecca, jordyn
        MATCH (rebecca:Person {name: "Rebecca"}) CREATE (rebecca)-[:FRIEND]->(katie:Person {name: "Katie"}) RETURN rebecca, katie
    

Something interesting about these relationships is that the friend nodes are created at the same time as the FRIEND relationships. For example, the “Charlie” Person node does not exist when the first statement is executed, but the statement creates a FRIEND relationship from the existing “Michael” Person node to a new Person node with the name “Charlie”. You can pull up all Person nodes and verify that the node was created as shown in Figure 8.

We have a pretty good social graph started, so let’s try writing a more involved query to find all the friends of my children:

MATCH (steven:Person {name:"Steven"})-[:HAS_CHILD]-(:Person)-[:FRIEND]-(friend:Person) RETURN friend

The results are shown in Figure 9.

In this query, we start with the Person node with the name “Steven”, traverse across all HAS_CHILD relationships to Person nodes, traverse across all of those Person nodes’ FRIEND relationships, and return the list of friends. We could have included directional relationships, but omitting the arrowhead allows us to traverse both directions.

Key/value pairs in the social graph

In addition to defining a relationship between two nodes, relationships themselves can have key/value pairs. For example, we might decide to create Movie nodes, then create HAS_SEEN relationships between people and movies they have seen. In those HAS_SEEN relationships we could also add a “rating” property. The following code creates a Movie with the title Avengers and then creates a HAS_SEEN relationship between Michael and the movie Avengers, with a rating of 5.


        CREATE (movie:Movie {title:"Avengers"}) RETURN movie
        MATCH (michael:Person {name:"Michael"}), (avengers:Movie {title:"Avengers"}) CREATE (michael)-[:HAS_SEEN {rating:5}]->(avengers) return michael, avengers
    

Figure 10 shows the results.

Graph analytics in Java

For our final example before getting into Java code, let’s try a simple experiment with graph analytics. We’ll add a few movies to my children’s friends, set the gender of my children, and then query for movies that one of my children (Michael) might like to see. The results are shown in Figure 11.


        CREATE (movie:Movie {title:"Batman"}) RETURN movie
        CREATE (movie:Movie {title:"Gone with the Wind"}) RETURN movie
        CREATE (movie:Movie {title:"Spongebob Square Pants"}) RETURN movie
        CREATE (movie:Movie {title:"Avengers 2"}) RETURN movie
        MATCH (charlie:Person {name:"Charlie"}), (movie:Movie {title:"Batman"}) CREATE (charlie)-[:HAS_SEEN {rating:4}]->(movie) return charlie, movie
        MATCH (charlie:Person {name:"Charlie"}), (movie:Movie {title:"Gone with the Wind"}) CREATE (charlie)-[:HAS_SEEN {rating:0}]->(movie) return charlie, movie
        MATCH (koby:Person {name:"Koby"}), (movie:Movie {title:"Batman"}) CREATE (koby)-[:HAS_SEEN {rating:4}]->(movie) return koby, movie
        MATCH (koby:Person {name:"Koby"}), (movie:Movie {title:"Avengers 2"}) CREATE (koby)-[:HAS_SEEN {rating:5}]->(movie) return koby, movie
        MATCH (grant:Person {name:"Grant"}), (movie:Movie {title:"Spongebob Square Pants"}) CREATE (grant)-[:HAS_SEEN {rating:1}]->(movie) return grant, movie
        MATCH (jordyn:Person {name:"Jordyn"}), (movie:Movie {title:"Spongebob Square Pants"}) CREATE (jordyn)-[:HAS_SEEN {rating:5}]->(movie) return jordyn, movie
        MATCH (michael:Person {name: "Michael"}) SET michael.gender = "male" RETURN michael
        MATCH (rebecca:Person {name: "Rebecca"}) SET rebecca.gender = "female" RETURN rebecca
        MATCH (steven:Person {name:"Steven"})-[:HAS_CHILD]-(child:Person)-[:FRIEND]-(friend:Person)-[hasSeen:HAS_SEEN]-(movie:Movie) WHERE child.gender = "male" AND hasSeen.rating > 3 RETURN DISTINCT movie.title
    

The first four statements above create four movies. The next six statements create HAS_SEEN relationships between friends of my children and the movies they’ve seen, with different ratings. The next two statements add a gender to my children, which is accomplished by finding the Person node by name and then calling SET childName.gender = "male|female". in Cypher, the SET statement allows you to change an existing property, add a new property, or delete a property by setting the value to NULL.

The final query takes a little work to understand. We start with the Person with the name “Steven”, follow his HAS_CHILD relationships to children Person nodes, follow those Person nodes to FRIEND Person nodes, follow those friend Person nodes to Movie nodes through HAS_SEEN relationships, and then adds a WHERE clause that checks both the gender of Steven’s child and the value of the HAS_SEEN rating property.

Finally, because some children have seen the same movie (Batman), we want to only return DISTINCT movie titles. In this case we do not return the movie node, but rather the movie’s title property, which is why the output is presented in a table. For the clever observer, we could simplify this a little by adding the gender to the child node query, as follows:

MATCH (steven:Person {name:"Steven"})-[:HAS_CHILD]-(child:Person {gender:"male"})-[:FRIEND]-(friend:Person)-[hasSeen:HAS_SEEN]<-(movie:Movie) WHERE hasSeen.rating > 3 RETURN DISTINCT movie.title

Conclusion to Part 1

Cypher is a different way of thinking about writing queries and I encourage you to read through the formal documentation to learn more. Once you have a handle on writing Cypher queries, the Java programming will be the easy part! We’ll pick that up in the second half of this introduction to graphing data and relationships with Neo4j.

Steven Haines

Steven Haines is a senior technologist, accomplished architect, author, and educator. He currently is a principal software engineer at Veeva Systems, where he builds Spring-based Java applications. Steven previously worked on two startups: Chronosphere, where he helped customers design and implement large-scale observability strategies, and Turbonomic, where he was a principal software architect for cloud optimization products. He's also worked for Disney as a technical architect and lead solution architect, building out the next generation of Disney's guest experience and other Disney solutions. Steven specializes in performance and scalability, cloud-based architectures, high-availability, fault tolerance, business analytics, and integration with new and emerging technologies.

As an author, he has written two books on Java programming and more than 500 articles for publications such as InfoWorld, InformIT.com (Pearson Education), JavaWorld, and Dr. Dobb's Journal. He has also written over a dozen white papers and ebooks on performance management and cloud-based architectures.

Steven has taught computer science and Java programming at Learning Tree University and the University of California, Irvine. He also maintains a personal website dedicated to helping software developers and architects grow in their knowledge: www.geekcap.com (by Geeks for Geeks).

More from this author