Posts in this series
- Pythonicus
- Pythonicus Addendum
- Pythonicus Finitum
We are in the last part of three part “Pythonicus” series. In this section, we are going to see how I used Neo4j, a graph database, to plot different categories and the databases that belong to each category.
Getting Graphical
I have no particular reason to use Neo4J other than perhaps Neo4J being the first Graph database I’ve seen. Since then, I’ve also heard of OrientDB. I’ve also seen some material at Safari Book Online about Neo4J, particularly one about Python with Neo4J. When it comes to practicing new stuff, it’s crucial to have access to good and plenty of tutorials. I regularly use Safari Online thanks to the library card issued by the Toronto Public Library.
I’ve been itching to practice what little I have learned about Neo4J so this project would be a good candidate. I’m not going to discuss a great deal about how Graph Theory works but it’s enough to say that categories and databases will be nodes and there will be lines connecting each category to its databases. Therefore, there will be clusters or islands of category/databases. Now, this may not seem very useful since graph databases are mainly used to look at the relationships between different entities but this is me getting my feet wet with this product.
Imagine Facebook and you have a list of friends in which each friend has their own list of friends. In this ever growing list of lists, there are some people, let’s reduce them down to “node” or round shapes, will be your friends too. Between certain friend nodes there will be lines that connect these 2 friends; this defines a relationship. Another way of visualizing this would be thinking of a world map where each country’s trade relationships with other countries. Some countries will be exporting the same product to more than 2 countries where these two countries could also have a separate trade relationship. So, you could have one or many also of the same or different types of relationships.
Again, depending on your example or need of using a graph database, your definition of a relationship will change. This may very well be the distance between nodes or how often a product has been listed under different lists. In my case, the only relationship I have between a group of database and a category is “belongs to” so let’s see how it is done.
import urllib.request import json from bs4 import BeautifulSoup from py2neo import * with urllib.request.urlopen("http://nosql-database.org/") as response: html = BeautifulSoup(response.read(), "html.parser") nosql_databases = {} #Start the Graph database by authentication and cleaning up old data authenticate("localhost:7474", "neo4j", "old4j") graph = Graph() graph.cypher.execute("MATCH (n) DETACH DELETE n") #each h2 is a group of NoSQL database categories = html.find_all("h2") for category in categories: category_id = str(category.text).replace(" ","") nosql_databases[category_id] = {"name":category.text, "entries":[]} #plot the node for category category_node = Node("Category", name=category.text) ''' By first finding h2 tags in html we actually got rid of the extra section at the beginning Therefore we have to go up one level with 'parent' and seek article tags Another alternative would be finding the sibling of h2 or category but there might be more tags other than article so this method is safer ''' databases = category.parent.find_all("article") for database in databases: #some entries don't have much details than just a mention to a product name if database.find("h3"): database_name = database.find("h3").text else: continue database_url = database.find("a").get('href') #by getting rid of h3 completely the whole article tag is now holding the content for the database database.h3.extract() database_content = database.text.strip() entry = {'name':database_name, 'url':database_url, 'content':database_content} nosql_databases[category_id]["entries"].append(entry) #plot the entry-category relationship entry_node = Node("Database", name=database_name) entry_category_relationship = Relationship(entry_node, "BELONGS TO", category_node) #for each entry we must register the relationship in the graph graph.create(entry_category_relationship) json.dump(nosql_databases, open("dump.json","w"), indent="\t", sort_keys=True)
I highlighted the changes since my last post. I’m using Neo4J’s py2neo library to interact with the database. Lines 12-14 authenticates me with the local Neo4J instance. We actually have to follow a similar logic to what we did with category and database creation in our JSON output. For each category created and used as a new Python dictionary or item or a JSON object, we must create a new Node so line 26 does exactly that.
Then, in the inner for loop where we concern ourselves with database items and attach them to the newly created category, we create individual Nodes for each database. It’s at this point that we should also create a relationship so Database node and Category node can be connected. Line 53, 54 and 56 do that in this order : create a database node, create a “belongs to” relationship with this new database node and finally create a relationship with the current (in the outer loop) category node.
Finally, if you go to your browser and open up the Neo4J database instance you can pull visualize all this by typing the following a command similar to this : MATCH (n) RETURN n LIMIT 55. I took a screenshot of the result and moved the islands a bit so it could fit closely.
Where To Go From Here
I may actually try my hand at each one of the databases listed at http://nosql-database.org/. This might be an ambitious task but what better way to learn what the deal is which every one of these. I know that, at some point for one reason or another, it may not be feasible or possible to install and use a product due to dependencies but I’m willing to at least look at the documentation and feature page.
Additionally, I wanted to keep the scope of this introductory series limited to creating basic nodes with categories and databases. My ultimate attempt with this exercise was to create the relationship between different databases. I’ve noticed that my current code structure is too flat and it does not take advantage of separation of concerns. Ideally, I should have a class method or at least a function that outputs data first. Then, I can move on to preparing the graph in a different function. Perhaps the last step would be to create relationships between databases, again in a separate function. This might actually be necessary because it’s practically impossible to look ahead and know about an unexacting database node that will be created later because it’s listed in a category that comes later in the HTML code.
All in all, this exercise has helped me investigate different options. I tried to submit my code regularly to Github. I’ll most likely write another article once I modify my code and make it more modular. In that version, I’m sure I’ll have figured out how to plot the relationships between databases.