My Little Ruby 18-10-2011

As a software engineer and as a programming enthusiast, I am always trying to learn, to add new tools to my Utility Belt. Since a great portion of programming is, you know, writing code, new programming languages are of particular educational interest. And as a person with a seemingly unquenchable innate curiosity, I like to stay tapped into the world around me. This includes a concentrated focus on items related to technology, as well as those recommended by other technological enthusiasts. My desire to learn new programming languages lead me to Ruby; my desire to stay up-to-date with technology lead me to Hacker News.

As some one who had just read why's Poignant Guide to Ruby, I was feeling pretty antsy to dive on in and write some code. Somewhat simultaneously, I was beginning to more thoroughly analyze my Hacker News usage patterns. During an average work day, I would typically check out the HN front page every hour or so, with my eyes naturally scanning for headlines that had certain telltale characteristics, such as some fuzzy, arbitrary threshold for points or comments or relative freshness (did I already see this article during my last scan through?). Though this did not have any measurable or qualitative effect on my productivity, I did have the nagging itch that this was a fairly static, redundant way to view a website: like clockwork, point the browser to the URL, scan my eyes down the list of articles, searching for roughly the same qualities that I always look for, open up new tabs to the article itself and the HN comments for anything that meets my vague filtering system to read later, and then go back to my work. Clearly, this could be automated! So, why not combine the two aforementioned interests and write a Ruby script to bring Hacker News submissions to me?

One of the things that attracted me to learn Ruby in particular (aside from the fact that I had begun to write some Ruby code at work) was that its design philosophy reminded me of that of my beloved Python: human-first. Part of why I enjoy writing Python code so much is that it feels like a natural extension of my thoughts and of my languages. From mind to machine via finger tips, ideas are transmuted into the digital, seamlessly. Unlike some other languages that I have written my fair share of code in, Python tries its best not to force me to jump through hoops. There are few arbitrary obstacles to writing good, clean code. For me, it is just plain fun. What I initially read about (and in) Ruby gave me optimism that I would experience the same aesthetic.

So I sat down to design this script. I wanted something that I could set up a basic cron job to run periodically, with a relatively flexible filter to pick out front-page submissions based on a number of variables. It would take in this customized filter from command line arguments, parse the front page of Hacker News to grab the first thirty headlines, apply the filter to pick out the articles that I am most likely to be interested in, and give them to me. This last phrase is a tad vague, because the method of delivery is mostly arbitrary. Because I was designing this for me, I decided to shoot a quick email to my inbox: I always have my Mail app up and running and the familiar red badge and ding sound gets my attention.

This design provided a pretty basic flow to hammer out, which aids greatly in the process of writing one's first non-trivial code in a new language. It allows for code which is straight-forward and procedural: take in some arguments, follow some simple steps, return some articles. Good to go, does what I need. No monstrous design patterns necessary.

And this simplicity proved to be both a gift and a curse. It was a gift in that it allowed me to ease into Ruby and write some code with functionality and utility. But it was a curse in that it failed to present me the opportunity to really explore all that the language has to offer. But before we get into subjective evaluations, let's take a look at some actual code.

The Code

The script, as she is written:

One of the first things that I learned about Ruby was the awesome-ness of RubyGems. With a few "gem install X" statements in bash and a few "require X" statements in my script, I was able to abstract a ton of code out of my script. 'open-uri' makes the GET request to retrieve the Hacker News front page in JSON format. 'json' parses the JSON text into a data structure that is easy to iterate over. 'pony' allows me to fire off my prepared HN email with a single statement. The use of gems simplifies the process of putting together a functional Ruby script by letting you build on the contributions of your Rubyist peers.

(Note that the "-rubygems" in the shebang is present for backwards compatibility: RubyGems was not made part of Ruby's standard library until version 1.9.)

Another tool that I took advantage of to simplify the script is the Unofficial Hacker News API. Originally, I began writing my own code to parse the HTML of news.ycombinator.com for the article metadata using why's Hpricot. This quickly started to balloon and it felt like I was writing a Hacker News HTML parser instead of a Hacker News headline filterer/retriever. But the API allows me to make a quick GET call to http://api.ihackernews.com/page and receive some much more easily handled JSON. The returned JSON maps very handily to a Ruby hash, which I can then quickly convert to my Headline object.

Perhaps above all else, though, the greatest simplification agent at play was Ruby itself. As you step through the code, you find artifacts of Ruby's usage of and adherence to the Principle of Least Astonishment. Everything just feels organic. Syntactically, there is a stark lack of cruft and boilerplate. Semantically, code means, far more often than not, exactly what it appears to mean. Much like Python, Ruby has a tendency to look and read like natural written language, insofar as any sufficiently high-level programming language can.

The class and function definitions are nothing to write home about: basic, simple. We have an initializer and a to_s method in the Headline class: no accessors or mutators required, instead opting to use Ruby's object orientation as a way to collect article metadata and output it in a readable, customized format. The to_minutes function allows us to use standard inline regex expressions to pick out what time unit we are converting from and get the minute equivalent. But after these simplicities, we move onto a much more interesting fare.

The Metaprogramming

The ARGV special variable contains all passed in command line arguments, in order. We use this array to iterate through each argument, matching flags with what should be the immediately following values. Using the each_with_index method, we iterate through each item in ARGV while maintaining access to its index in the array, allowing us to look forward toward the next values and check for proper formatting. Again, using inline regex matching, and the handy-dandy .nil? method, we can quickly verify and process our filter arguments, displaying some help text if an exception is caught (rescued).

To give an idea of what the command line arguments look like when entered, here is a sample that specifies a filter of at least 75 points, more than 10 comments and no older than 5 hours:

./hn_email.rb -p '75,>=' -c '10,>' -t '5,hour,<='

What may seem unusual, however, is the manner that we store the command line argument values. We have a hash, named filters, where each key is the name of the argument, such as points or comments, and the value is a length-three array with a String-formatting placeholder "%d", a comparator (less than, greater than, equal, etc.) and a value. So after parsing the previous sample command line arguments, we are left with a hash that looks something like this:

{"points" → ["%d", ">=", 75], "comments" → ["%d", ">", 10], "time" → ["%d", "<=", 300]}

Hopefully, you are beginning to notice the pattern here. But let's continue onward to some more Ruby-enabled fun!

After a scant few lines, we have snagged our JSON-formatted Hacker News data and organized it into a neat little hash. Then we can begin to apply our filters. We iterate through each article from the front page and retrieve the points, comments and time values from the JSON hash. With these values on hand, we make a pass through our filters, substituting these variables (dynamically chosen by calling eval on the matching filter keys) into the index-0 "%d" placeholders in the filter values. Now we can take the three values in the filter arrays and pass them into placeholders in the eval("%d %s %d") statement.

So say a given article had 90 points, 45 comments and was posted four hours (240 minutes) ago and we used the aforementioned filter arguments. We can iterate through these statements and, by starting with a true boolean value (named valid here), determine if a given article matches all of the filters. Its first filter evaluation, for points, ends up as eval("90 >= 75"), which evaluates to true, and this is AND'ed with valid. valid remains true and we progress to its second evaluation, comments, which is eval("45 > 10"), also true. We AND it with valid, which remains true, and progress to the third and final evaluation, eval("240 <= 300"). Again we see that this statement is true and, AND'ed with valid, maintains valid as true. After passing through each filter, we see that our article is indeed ready for prime time, and so we gather the rest of the article metadata, create a Headline object around it and append it to the array of filtered headlines. Easy!

With just the slightest bit of metaprogramming, we have an elegant way to dynamically construct and evaluate statements for filtering our articles! On the fly! With minimal code! Awesome.

This filtering system allows the user to specify any combination of the three built-in filters (including the option of foregoing any filtering and just receiving the entire front page), determine the nature of the comparison for each (be it less than (or equal), greater than (or equal) or equivalent) and set the threshold values for each (allowing for specification of precision on the time filter). However, it could certainly be improved upon. But more on that later; let us wrap up what we have going on here.

The Rest

After applying filters to each of the thirty front page Hacker News submissions, they are ready for delivery. As previously mentioned, the Pony gem allows us to rapidly fire off an email to whomever we wish: we just need the credentials to do so. Sure, these could be hardcoded in the script, but that is self-evidently lame (or preposterous, or absurd, or whatever other derogatory adjective happens to be floating around my head at the time). So we are afforded the opportunity to check out Ruby's file I/O chops.

With a quick call to File.open, we open up a "credentials.txt" file and iterate through, line by line. If the file is properly formatted, the username, password, address and port for the email server is retrieved. Using some more regex matching and some hacky String splitting, we are good to go. A correctly formatted version of this file is included in the repository with the script's source, but here is a sample for reference's sake:

email:my.email@gmail.com

password:myPassword

address:smtp.gmail.com

port:587

You may notice the strange looking parameter that is passed into File.open: File.join(File.dirname(__FILE__), 'credentials.txt'). What this bit of code does is find the directory where the Ruby code is executing, and join it up with "credentials.txt" to form the relative path to the file. This effectively allows us to execute the script without being in the script's directory; simply passing in "credentials.txt" into File.open would cause the method to attempt to open up a file by that name in the current directory. So if you have this script executing from a cron job and the credentials file is not in cron's directory, the file will not be found and an exception will be your demise. Avoid this like the plague!

With that information, we hand our message off to the Pony Express, tell it where to go and how to get there, and wait for the email to appear in our inbox. And when it does?

24 year old student lights match: Europe versus Facebook [http://www.identityblog.com/?p=1201] - 5 hours ago | 283 points
117 comments [http://news.ycombinator.com/item?id=3127185]

How 1 HN post compelled me to leave Intuit and create a new startup for startups [http://www.copyhackers.com/2011/10/18/how-1-hn-post-compelled-me-to-leave-intuit-create-new-startup-for-startups/] - 4 hours ago | 117 points
63 comments [http://news.ycombinator.com/item?id=3127550]

ruby lessons right in your browser [http://rubymonk.com] - 3 hours ago | 91 points
18 comments [http://news.ycombinator.com/item?id=3127635]

Python For The Web [http://gun.io/blog/python-for-the-web/] - 5 hours ago | 110 points
61 comments [http://news.ycombinator.com/item?id=3127215]

Mailgun API 2.0: forget MIME [http://blog.mailgun.net/post/11622797058/mailgun-api-2-0-forget-mime] - 5 hours ago | 85 points
39 comments [http://news.ycombinator.com/item?id=3127059]

Ta-da! Fresh, curated, Hacker News goodness right to your doorstep. Using a few basic Ruby gems, the flexibility afforded by the standard library, some standard pattern matching with regular expressions and only a modicum of Ruby's powerful metaprogramming capabilities, we were able to automate the process of scanning the HN front page for intriguing headlines. Now instead of wandering on over to news.ycombinator.com and manually crawling the screen with your eyes, a neat and tidy report of the content that meets your requirements will be sent to your inbox regularly. And with cron, the possibilities are manifold.

The cron Job

Say you wanted to retrieve articles that met the filtering seen in the examples above, every hour on the hour. We could edit our crontab (try in /etc/crontab) file to include this line:

0 * * * * ./hn_email.rb -p '75,>=' -c '10,>' -t '5,hour,<='

This line roughly translates to every hour, every day of the month, every month, every day of the week, as long as it is the 0^th minute of the hour. Now say we wanted the entire front page, unfiltered, but only every eight hours, at the thirtieth minute of those hours, starting at 6:30am:

30 5,13,21 * * * ./hn_email.rb

How about only articles that have reached the 300 point barrier, regardless of comment count or age, once a day at midnight?

0 0 * * * ./hn_email.rb -p '300,>='

And of course, you can forego cron and just execute the script in the midst of a frantic daily workflow. You're writing code, you're cruising along, you open up your terminal, you run a Maven clean install (skip tests!) on an upstream artifact, you switch tabs, you SSH into the dev server, you tail a log file, you switch back, Maven's done, BUILD SUCCESSFUL, you fire up the debugger in your IDE, you set some breakpoints, still cruising, you start stepping through the code, variables are flying around, things are downright chaotic, you switch back to the terminal to check on that log, you switch right back to the debugger, you step in, you step out, an exception gets thrown, your mind races, neural pathways ablaze, you spot the problem, you stop the debugger, you re-factor that suspicious code, you switch back to the terminal, you open up another tab, you pound in ./hn_email.rb -p '500,>' -c '100,>' because you demand only the best, you watch your Mail app ding with a new message, you switch to it, pop it open, race through what's inside, and you dive right back into your important work.

You can't be stopped. You eat Hacker News submissions for brinner. All because of this script.

The Deficiencies

But, hey, no one's perfect. Not even this wonderful utility that has sped up your daily workflow by at least a minute or two. So what could be improved upon? The filtering, of course!

One key feature missing from the current filtering capability is the ability to search by words. Having the ability to pass in a command line argument such as "-w 'ruby'" and only retrieve articles that feature the word "ruby" in the Hacker News headline would immediately bolster the value of this script. In fact, it is my next aim with this itty, bitty project.

Another feature would be the ability to specify more complex boolean structures in the filter. Say you only wanted those submissions with at least 250 points or with no more than 25 points. Say you wanted those with at least 300 points, as long as they are no more than 12 hours old, otherwise you only wanted those with at least 500 points. Combining this functionality with the previously mentioned word-based filtering: retrieve only those articles that mention the word "Apple" as long as it does not feature the word "TV", it has at least 100 comments, and it is at least 3 hours old or it is from Daring Fireball. Allowing arbitrarily complex filtering would be quite the task and would definitely push the limits of Ruby's meta-power. But it would be absolutely fantastic to have the capability to get so in-depth.

Filtering dreams aside, the ability to specify delivery method would be quite helpful. At present, the filtered articles are emailed to a specified inbox and nothing else is permitted. However, being able to specify an output file or to simply print to standard output would be nice.

And of course there is the fact that this script relies on an API entirely out of my control. As nice as the Unofficial Hacker News API is, it does return HTTP errors with non-trivial frequency, and there are at least a couple of bugs that I have run into with my own rudimentary manual testing (I'm in the process of hammering out exactly how these arise in order to report them). But, the vast majority of the time, it does work perfectly fine. Trading simplicity for performance is a fine line that we all must walk time and again. Here, with my very first Ruby script, I chose simplicity.

From a more general perspective, it is entirely probable that there are better ways to implement what I have written here. There are almost certainly places where I stray from idiomatic Ruby conventions, or where there is something basic and simple that virgin Ruby eyes cannot decode. I look forward to rooting these out later on in my Ruby life.

The major issue, however, is conceptual: part of the fun of Hacker News, as with any web site, is the very act of browsing. Taking the time to manually peer through headlines, letting impulse drive your clicks or taps, foregoing objective determination of what is "interesting" or otherwise worthy of one's time, can be immensely fun. Automating the process takes that enjoyment right out of the process, turning the act of reading HN into pure consumption (it could of course be argued that this is what surfing the web already is, but I would be hesitant to make that argument). However, the script does serve a purpose for the user who is looking for a way to retrieve content in a regular, automatic way, for whatever reason there may be. So there's that.

The Impressions

Ruby is perrrrrrrty. Conceptually, Ruby represents a language created with the hope of avoiding the unnecessary. When writing Ruby code, I felt the overwhelming sensation that my code was straight to the point, and not for some natural Ruby brilliance that resides in my head. Ruby's engineer-first focus enables programmers to write what they think, not what they think a machine would expect. Not only does this make programming a more cerebral and less mundane, banal task, it enhances the pure fun of it all.

This latter point is further helped by Ruby's aesthetic perrrrrrrty-ness. It feels generally clean and readable, and is not overpopulated with superfluous constructs and structures, placing impetus on concision and elegance. This makes Ruby code easier to maintain and debug, and the decrease in required development time leaves the programmer to write more wonderful code, something we all love.

These qualities -- evident meaning and succinct syntax -- make Ruby highly enjoyable for me to wander around in. Though I love forests in real life, I do not always love them in my code: sometimes I have a vision for where I want to go and I want to execute upon that vision without getting lost in a language's unique set of rules and constraints. Ruby, seemingly, will allow me to do that.

Hopefully, I will have many more gems in my coding future.