Writing a URL Shortener in Python 29-03-2011

I often find that I have miniature curiosity itches that I want to scratch, questions about how something works that I need to find the answer to. Recently, I realized that I had never given any thought to how URL shorteners work. I have been using TinyURL since at least 2006 and more recently have used bit.ly a fair deal and I suddenly realized this past week that I wanted to know how one goes about building a shortening service. For many of us, the best way to learn something is to get our hands dirty and just build.

And as it turns out, a basic URL shortener is not terribly difficult or intimidating. I had fun building a little imaginary shortener, though I have no intention of placing it online because there are just way too many of these already. So I figured I would document some steps in building a basic URL shortening engine for anyone who may have this same curiosity some day.

How it Works

This little proof of concept is quite basic and does not feature some of the frills of today's popular shorteners, such as personalized links or more randomized codes. Essentially, what we do is take in a URL, insert it into a database, get the auto-incrementing row id of this insert (an integer, of base 10) and perform a base conversion on this to base 56. We specifically convert to base 56 because of our URL code alphabet, which consists of all lower case letters, all upper case letters and the digits 0-9, except for o, O, 0, I, l and 1 due to the similarity in appearance for these characters. Once we convert the decimal id into a combination of these characters, we append the new code to the end of our imaginary URL, "http://shr.tn/" (let's pretend .tn is a Top Level Domain) and return this to the user. So something like "http://jessex.ch/blog/writing-a-url-shortener-in-python.html" can become something like "http://shr.tn/j8Kin3".

Some effects of the design of this system include:

shortened URLs will increase in length as more URLs are added into the system, with the first 56 URLs having only one character in the code, the next 56*56 URLs after that having only two characters in the code, etc.
identical URLs will have one shortened link, which is to say if someone attempts to shorten "http://example.com/" some three months after someone else already did this, they will receive the same link as the first person
9,223,372,036,854,775,808 distinct URLs can be shortened by our service without altering its structure, due to the fact that the largest possible auto-incrementing row id in SQLite (our database of choice) is 9,223,372,036,854,775,807

When a user enters a shr.tn link into their browser and reaches our server, we will do the inverse of our shortening action: convert the base 56 code to a base 10 integer and look up the row in our database with that integer as its id. This is the original, "long" URL which our service will issue a 301 re-direct to. Easy squeezy.

Implementation

Writing this basic system does not take a great deal of work [when we completely disregard the front end]. The source code for the backend, written entirely in Python and with plenty of commenting and white space, takes only ~175 lines of code. We separate the engine into two separate modules (shrtn.py and database.py) which handle the actual shortening/lengthening and database operations, respectively.

Implementing a database adapter for our service is made much easier by the sqlite3 module, which provides the basics for connecting to our database and executing our queries. The first thing we do to work with our database is to establish a connection:

This function will attempt to open the SQLite database file, stored at the path pointed to by "location", and will return a Connection object, which essentially acts as our database within the system. We will use this object for all database operations, including searches for row id's (when we lengthen shortened URLs) and searches for already entered URLs (when we shorten long URLs):

In these two functions, we enter very basic queries and call the execute() method of the Connection object. Here is one trick where the sqlite3 module comes in handy: in most database APIs, you use a Cursor object, taken from the database object, to execute queries (something which you can do in this module, too). However, the module allows us to call these shortcut methods for execution directly on the Connection object: it creates a cursor implicitly and returns it from the execute() call. This allows us to a) stay concise by avoiding superfluous code, b) stay as modular as possible while avoiding having to pass around a Cursor object into our functions as parameters and c) iterate over query results more efficiently.

With both functions, we construct a query, execute it directly on the Connection object and retrieve a list of results, each of which is a 2-tuple corresponding to the table columns (row id and original URL). If there are no results, we just return false. Otherwise, we take the first result tuple (because we should only have one result for these queries due to the design of the system) and unpack the value that we want. Note that in search_id() we pass the URL into the str() function because it is stored in SQLite as Unicode.

However, these database functions are useless for our purposes without being called by our actual shortening engines. When a user enters a URL to shorten, we pass it into the shorten_url() function in shrtn.py, along with our Connection object, which cleans the URL up a bit, if necessary, and checks for its existence in our database. If it is already entered, we use that row id for our conversion. Otherwise, we insert the URL into our database (with the insert_url() function in database.py) and then use that row id. Both functions are shown below:

Lengthening a url is accomplished when a user points their browser to one of our shortened URLs. When the request reaches the server, our lengthen_url() function is called with our Connection object:

With this function, we convert the code to a decimal row id and search for it in the database, returning the original, "long" URL found at that row. If that row was not found in the database, then the code was never actually generated and the link is incorrect: thus, we return a 404 error. We do the same in the event that the shortened URL is improperly formatted. But none of these functions address the actual conversions between decimal row id and shortened base 56 code.

Shortening and Lengthening URLs

The actual shortening of URLs is simply a conversion between a decimal integer and a base 56 string of text from our alphabet of characters.

We pass in the row id and, optionally, an alphabet of eligible characters. If omitted, we simply use our own default alphabet: "abcdefghijkmnpqrstuvwxyzABCDEFGHJKLMNPQRSTUVWXYZ23456789". The process is essentially to continually divide our id down by our base of 56, mapping the symbols of the conversion to the alphabet. For instance, to convert a row id of 150, we notice that 56 * 2 = 112 with 38 to go before 150. This gives us symbols of [2] and [38] which point to the characters of 'c' and 'Q' in our alphabet. Thus, 150 converts to "cq" in our system. We generate these characters in reverse order, so we reverse the list and convert it to a string to return.

The inverse of this operation, resolving a code to a decimal id, involves reading the characters of the code one by one and mapping each to its own individual decimal value. These are summed up and the total id is returned. This is shown below:

Conclusions

And that's it. The process of creating and reading back our shortened URL codes requires only two functions, with only some ten lines of source code. The meat and potatoes of the process for this basic URL shortener was really that easy. Do a little conversion. Execute some queries on a one table database. Sip some lemonade and watch the URLs shorten before your eyes. Curiosity sated.

The rest of our little engine is rounded out with some functions for things like URL validation (I use that term extremely lightly) and some always exciting regular expressions. The full source code can be examined on Github which includes a test rig (adequately named testrig.py) that has a handful of unit tests. A read me file is included for anyone interested in setting the engine up locally.