Pythonicus Addendum

Posts in this series
  1. Pythonicus
  2. Pythonicus Addendum
  3. Pythonicus Finitum

In my previous Pythonicus article I briefly talked about my need to parse a list of NoSQL Databases at http://nosql-database.org. My solution back then was as it was presented in that article, using regular expressions to find child and sibling nodes. It was a bit crude but it got the job done.

I’m constantly fighting against the “get’er done” mentality on a daily basis so I’ll try to get the job done right this time by using BeautifulSoup. That’s actually where I left the article off so let’s see how tasty the soup is!

It Ain’t jQuery But…

As I mentioned before, I needed something similar to jQuery for Python. BeautifulSoup syntax isn’t exactly the same but it’s intuitive enough and after having worked with jQuery for several years, it’s not too hard to grasp the concept. Also, their documentation is clear enough and it’s full of code examples.

At first, I was tempted to use jQuery syntax for finding siblings of siblings but this quickly turned into a mental exercise : “you are on a different turf buddy, get used to it!”. Since I often use for-in loops in Jquery, find_all would probably be the most used method. I’ve also had to use parent selector because I suppose I was already focusing in too deep with my selectors. This was also due to how the website I was referencing was structured.

Additionally, I used extract method to get rid of <h3> tag inside articles since I needed the content next to these <h3> titles but the content was arbitrary as it wasn’t wrapped in a special tag. Again, looking at the website’s source code will give you a very good idea about how to access elements and use them in a loop. So, here is where the project is :

import urllib.request
import json
from bs4 import BeautifulSoup

with urllib.request.urlopen("http://nosql-database.org/") as response:
    html = BeautifulSoup(response.read(), "html.parser")

nosql_databases = {}

#each h2 is a group of NoSQL database
categories = html.find_all("h2")

for category in categories:

    category_id = str(category.text).replace(" ","")
    nosql_databases[category_id] = {"name":category.text, "entries":[]}

    '''
    By first finding h2 tags in html we actually got rid of the extra section at the beginning
    Therefore we have to go up one level with 'parent' and seek article tags
    Another alternative would be finding the sibling of h2 or category but there might be more tags other than
    article so this method is safer
    '''
    databases = category.parent.find_all("article")
    for database in databases:

        #some entries don't have much details than just a mention to a product name
        if database.find("h3"):
            database_name = database.find("h3").text
        else:
            continue

        database_url = database.find("a").get('href')
        #by getting rid of h3 completely the whole article tag is now holding the content for the database
        database.h3.extract()
        database_content = database.text.strip()

        entry = {'name':database_name, 'url':database_url, 'content':database_content}

        nosql_databases[category_id]["entries"].append(entry)


json.dump(nosql_databases, open("dump.json","w"), indent="\t", sort_keys=True)

The core part of the code is finding main categories, h2 tags, which are grouped in section tags. Besides each h2 there are one or many article tags, each one being a separate database entry. So, the task is to divide by h2 and then parse each article next to these h2 tags. While I’m picking out each h2, I’m registering a unique id for these categories which will consist a dictionary item.

The inner loop is simply for parsing different database entries that are siblings of h2. This job become so trivial thanks to jQuery like syntax. I had to create an exception for parsing h3 tags, that were the name of the database, since there were some experimental entries. Later, I found out that different people have been contributing to this list and some people added placeholder database names. I guess they will follow it up by putting more content. To prevent non-existent h3 tags, there is a simple if block that skips the loop to the next entry.

Once I collect each entry and construct my dictionary item it’s now time to inject each entry into its proper category in the main dictionary : nosql_databases. The code then, at last, dumps the content of nosql_databases dictionary into a JSON file with proper indentation and sort.

Finitum?

This project has one more step left. That is to plot this JSON file in a Graph Database. After all, it would be appropriate, if not cyclic, to use a list of databases in a database! Ad infinitum? Perhaps.

See you on the next one.

 

 


Recent posts