Posts in this series
- Pythonicus
- Pythonicus Addendum
- Pythonicus Finitum
I started to learn Python in May. I finished two introductory courses about it on Coursera. Then, after a summer full of indecisiveness and frustration about where I want to take my skills to, I came to the conclusion that becoming a data scientist would be the right move. I guess that was around the last week of October. Since then, I’ve finished a third Python course and the fourth one is starting pretty soon.
I can already feel that I’ve opened the doors to a whole other world about computing and programming. My feelings about Data Science deserves another article so I’ll only focus on Python in this one. More specifically, how I approached a specific problem and how I’ve been able to produce the same results with better tools.
No Woman No SQL
I have long been getting emails from http://dbweekly.com but I never really had time to look at their stuff closely to be honest. I guess signing up for their newsletter was the result of a typical compulsive behavior like “I must hit that sign up button for no good reason because I want to clutter my email client with stuff I won’t read but I’ll admit to myself that I haven’t read it and, worse, I promise myself that I’ll allocate some time in the future”.
As I tweeted on October 24th,
For now, I’m done with ShitML, ShitSS, ShitScript. Bye Web. Hi Python, Data Visualization, Data Analytics and Big Data.
— Kumsal Obuz (@kubarium) October 24, 2015
I’ve embarked on a journey to the dark side. Therefore, it became more important to read DB Weekly because, after all, path to Data Science would surely take me to many database pit stops for refueling, regrouping or readjusting the course to salvation. So, I followed the first link in their 78th issue , A List of Over 200 NoSQL Databases. What I found out was a list of databases grouped by categories like Document bases, Key-Value, Graph Database etc.
Since I’m the curious type and I always keep an eye on different resources and I had already come across a graph database called Neo4J, I decided to take this list from the website and represent it in a Graph Database format. Of course, this graph part will be the final destination and the first checkpoint is to parse this data with Python. The code I’m going to be sharing in this and next article might change but you can follow the changes in the Github project.
RegEx to Rescue
My first iteration of reading and parsing the data was the following:
import urllib.request import re with urllib.request.urlopen("http://nosql-database.org/") as response: html = response.read() #clean up html from \n and \t and double spaces html = html.decode().replace('\n','').replace('\t','').replace(' ','') #each h2 is a group of NoSQL database categories = re.findall("<h2>(.*?)</h2>", html) for category in categories: print(category) #each category is followed by a number of articles and these articles are superceded by a final sectin before the next category category_content = re.findall("<h2>"+category+"<\/h2>(<article>.*?<\/article>)<\/section>", html) articles = re.findall("(<article>.*?</article>)", str(category_content).strip()) for article in articles: print(article)
I believe the code has enough line of comments but basically it’s reading the HTML from the site and doing a few basic regex calls to create lists of categories and articles. During this process I noticed that there was an error in the source HTML so I decided to contact the website’s owner. Luckily, his website source code was hosted in Github so I created an issue and he responded with the fix pretty quickly.
Now that I was finally able to print out the list, the next phase was to put this data in a JSON file perhaps. As I mentioned previously, the third course I was taking was Python Network Data on Coursera. My task of parsing this database list aligned so well with the course material because the topics were about accessing web data, XML and JSON, reading lines etc. Since I’ve been doing web development for the last 15 years most part of course was trivial and Python parts were really straightforward. However, one thing was really helpful : BeautifulSoup
Summary
After finishing my basic attempt that I shared in the previous code example, I started to wonder if there was an easy way to read HTML tags. It almost felt like what I was used to doing with jQuery to select tags in a front end project was desperately needed for Python. I could have easily looked for a library that would do this job but since I’m taking 5 more courses and other stuff got in the way I didn’t investigate it. Maybe, it was just good enough to be able to parse what I needed with what I knew. It was a good excuse to practice regular expressions anyway. So, now that I know what BeautifulSoup can do I’ll try the same exercise again.