Latest posts.

RSSCloud vs. PubSubHubbub

Progress, get out of the way

Progress (cc photo by lawmoment)

Dave Winer and Brad Fitzpatrick are going at it now, politely arguing about their respective visions for the future of the real-time web. Winer is resurrecting a portion of the RSS spec from 2001 called RSSCloud, and Fitzpatrick (among others) is working on a new system called PubSubHubbub (PuSH). Both systems are meant to give users instant notification when an RSS or Atom feed is updated, rather than requiring client software to regularly poll the feed. After working with both systems, I think PuSH is superior to RSSCloud because of two killer features: third-party subscription and heavy notification.

PuSH allows a user to subscribe to a feed by having notifications sent to any arbitrary server, while RSSCloud requires that notifications are sent back to the computer that sends the subscription request. PuSH wins here, because third-party subscriptions are fundamental to the user experience of the internet today. Imagine if email didn’t allow third-party subscriptions. To give my email address to example.com I would have to get my mail server to send my info to them. Similarly, I shouldn’t have to subscribe to feeds through my feed reader. I should be able to tell the feed hub that Uncle Sergey is handling my notifications today, or have it send updates from feeds relevant to my business to Spiral16*.

The other big advantage of PuSH is that hubs include feeds’ new content in the update notifications they send to subscribers, while RSSCloud simply notifies subscribers that a feed has updated, leaving them to get content on their own. While RSSCloud’s lightweight notifications are better for the hub, they do nothing to save on the feed’s bandwidth, and may actually be more harmful.

It’s interesting that RSSCloud’s lightweight pings seem counterproductive to Winer’s subtly stated goal of tearing Twitter away from Twitter. The best way to handle streams of short messages like that seems to be fat notifications, otherwise the network overhead for every update at least doubles as all subscribers are notified then all subscribers request the new content. It would actually be much worse than double if the server sends the entire feed with every request.

I think RSSCloud would have been a great system in it’s time, but it solved a problem that didn’t exist. Now that Twitter has shown the world what instant, asynchronous, multicast messaging looks like, PuSH is the spec that will open it up.

* See what I did there?

Scalability #1: Kestrel

A clever visual pun.

Kestrel: small, fast, deadly (image: Property#1)

As our business grows, scalability has become an important issue. Not only are we pushing more and more data through our pipeline, we’re adding systems as our analysis becomes more sophisticated. We’ll be covering some of the tools and methods we use to reduce the burden of scalability on the dev team starting with Kestrel, a tiny system for managing distributed, persistent queues.

Our semantic analysis and sentiment extraction systems can process web pages concurrently with no communication between instances, so we run them as a cluster with several servers each running several instances. We first tried to let Spark push data into the cluster through a load balanced web server, but we had a number of problems. Variation in demand meant the cluster had to either expand and contract, or sit mostly idle. The possibility of network errors, server crashes, and bugs meant that Spark had to track items it sent to the cluster and guess if they were lost or delayed. This was not the way to put data in a cluster.

We were able to simplify the interface by using queues in a Kestrel server. Spark pushes data into the queue, and the cluster pulls data from the queue when it has a free instance. We can expand or contract the cluster at any time, we don’t have to worry about lost data, and there is minimal information shared between the cluster and Spark.

Kestrel has a number of advantages over simple queues in Python or C. First, it is persistent so we can skip a lot of logic to detect and recover loss of data due to server crashes. Second, it is a server system that can easily be on a different box than either the consumer or producer process. In our systems we use Kestrel as an interface layer between the producers and the consumers: each side only needs to know about the single Kestrel server while expanding indefinitely. Once we need more throughput than one instance of Kestrel can provide, we can add more servers to that layer while maintaining the well-defined boundaries.

Another feature is especially important for dev teams like ours which mix many different programming languages and operating systems with reckless abandon. Kestrel runs in a JVM and speaks the memcache protocol, which means it can be deployed anywhere and used in any language. Additionally, building Kestrel with ant generates a handy tarball complete with an init.d script to daemonize Kestrel on compatible Unices. However, if you’re using the same queue with multiple languages be aware of how your memcache library serializes values. For example, the python-memcached library will use cPickle by default, but can use a user-defined serializer.

Kestrel can make a great interface to a cluster, isolating the gritty details of scalability from other systems. Adding it to our cluster has made data submission essentially fire and forget.

Bonus: A Python class for using Kestrel through python-memcached.

the switch to git

GitHub Mascot

View the GitHub talk via Yahoo!

Recently we’ve begun to use git as our preferred method for source control.  For ages we’ve been fighting Visual SourceHell and Subversion with the different workflows we use in our internal development.  The most common problem we’ve run into surrounds releasing new versions of any project we work on.  Typically developers need to make changes after the release of a major revision to software, its inevitable that bugs will be introduced into any project.  Unfortunately this poses a problem with SVN or VSS since the method of tagging or branching is rather cumbersome and never allows true control over a project.  We can’t have our tools hinder our flow anymore, git here we come!

The trouble with switching to any of the new tool on a development team is the adoption and migration of developers and existing processes to the shiny new method, sometimes there is even fear!  Never fear though, the folks at GitHub (probably my favorite open source site ever) have given a rather lengthy talk while at the Yahoo! developer meet-up.  I think they have done the best at explaining all the benefits of git over any existing system, and they remained fairly objective while doing so.  Sure they poked fun at old systems being the equivalent of a shack, but the truth hurts sometimes.  We will use this video as a training resource internally for any new developers we bring on who are unfamiliar with git.

Oh and the best part about git, its written by Linus Torvalds, so you know its fast and as close to logical as a system can get.

tr.im is back!

Stop Over Logging before its too late!

As you may know, the url shortening service tr.im announced on Sunday evening that the service would be shutting down.  This caused a tremendous amount of overall user sentiment being highly disappointed in their decision.  Many felt that their decision to shut down was premature and whiny about bit.ly being picked as the default on Twitter.

But never fear, they have received so much feedback that the service will live on indefinitely, read their post on the decision.  Apparently they have decided that throwing in the towel this early was a bad move to their already failing business.  This brings up the question how can a company make money from a short url service?

One of the methods would be to capture data to determine behavioral models for a given social network.  This requires a large installed user base which could be used to statistically derive a model tracking attention span and interest for a social network.  Getting to this point is a daunting task for any of the 50 services which have popped up in the last year with the growth of Twitter and Facebook.  Its also important to note that the statistcal sample required would need to span the whole community and not just a particular network within the whole, otherwise the data gleamed may not be as relevant.  I remain unconvinced that this data provides any actual value for so many companies to run their own systems.  This is one case where having a giant company reselling this data at a cheap price would be best.

The other argument I have against url shortening services is the disservice that they provide for the Internet.  If there is one thing this whole tr.im debaucle has taught me it would be that when one of these service shuts down it will break a large portion of the Web we all rely on.  I would like to think that search engines handle these redirects properly and will not index these urls as a primary url for the content, but I doubt that is the case across the board.  So how do we deal with the mess created when a service shuts down and all their links go with it?  Does a service such as 301works.com which indexes and stores all of these links in a system similar to Internet Archive make sense?  Not really in the long run since this too could be shutdown due to cost overruns.

Luckily we had thought of these things when designing much of the Spark backend and have already put in place measures to recover from issues like this.  We archive every url reference for a page we encounter and will store that data indefinitely.

I will keep monitoring this space and see what develops since it is important to some of the work I do here.  Hopefully the Internet doesn’t break as badly as I imagine it might.

How we use SphinxSearch

Sphinx Search

Sphinx Search

Internally our backend systems heavily use Sphinx Search, which provides us with full-text indexing on all of the content we discover.  Our biggest use of this software package is used to extract posts from RSS and ATOM posts we’ve indexed while crawling.  I wanted to highlight some of the features that are coming soon and how we might put them to use.

Version 0.9.10 is still undergoing development, but two big features are string attributes support, and common subtree cache.  The string attributes feature roughly translates into the ability to index meta surrounding each post, such as Author and Categories without having to build a massive set of cross-reference tables for our DB.  While normalization of data is important, we have found that there are many upcoming extensions to RSS/Atom formats that we’re excited to begin indexing.  One of those being the Atom Threading Extensions which has started to gain steam across the web.  We could index this data easily as a string attribute.

Another feature I am excited about is the Common Subtree Cache.  The backend search process will be able to identify common parts of multiple queries and cache a subset of the data shared amongst each query.  This will help us in some of the automated query optimization research we’ve got underway.  Our current research direction is the extraction of surrounding keywords which appear within a word proximity threshold to the user’s base query.  Being able to optimize our search backend using this new feature would be a great benefit to our data acquisition teams.

I anticipate the next version of Sphinx will be a big boost for the entire community.

Welcome

Mad Scientist

Mad Scientist

Welcome to the new Spiral16 Labs blog.  We’re going to be using this to share interesting technologies we run into while developing our main product Spark.  Some of the posts we write may have juicy details on the inner workings of our systems, and some might be general technology related topics.

You might have noticed that we have a large gap in the content on this blog, there is a good reason I promise!  Long ago (okay only 7 months) we began writing content for a data visualization website, called ComplexAwareness.com, we were experimenting with.  Since that project has been shelved we thought it would be a good idea to pull the posts into our new Labs blog.  Hope that clears up any confusion.

Beautiful Graph from the New York Times

Beautiful New York Times Box Office Visualization
Beautiful New York Times Box Office Visualization

The perfect data visualization should present data in a manner that is quickly understandable, be visually striking, and allow the person looking at it to pick up patterns they would not otherwise see. 

This graph from the New York Times accomplishes all of the above.  Not only does it provide powerfull insight into the lifespan of movies in the box office in an easy to read form, it is so beautiful that I would gladly frame it on my wall if I could strip out the strangely ugly black overlay of text.  This should serve as a powerfull example to anyone working with data visualization.

Blog : FlowingData

FlowingData - Strength in Numbers
FlowingData – Strength in Numbers

Anyone interested in how to better present or understand data should check out the blog at FlowingData. From their about page:

“FlowingData explores how designers, statisticians, and computer scientists are using data to understand ourselves better – mainly through data visualization. Money spent, reps at the gym, time you waste, and personal information you enter online are all forms of data. How can we understand these data flows? Data visualization lets non-experts make sense of it all.”

Sprint’s “Plug Into Now” Dashboard

Sprint's Imaginative "Plug Into Now" Dashboard
Sprint’s Imaginative “Plug Into Now” Dashboard

To advertise their new network broadband cards, Sprint created this interesting dashboard filled with widgets that show live data from a variety of sources.  While some of the widgets are faked, and there is no real congruency or purpose to the different sources they picked other than entertainment and aesthetics, it is still interesting to see an advertising agency catch on to the idea of data mashups and live data feeds.

Currently, widgets (or gadgets depending who you ask), are useful to house and present very discreet units of data.  As demonstrated here, most of the cross-widget analytics that can be done with this system comes from pasting different gadgets next to each other on a page.

Soon, however, I envision a change where data APIs and widgets such as these begin to “talk” to each other, enabling a wider range of possibilities for comparing, overlaying and combining a variety of disparate data sources.  In Sprint’s example, you could tie the days left till Christmas counter with the money being spent online counter to measure the correlation between proximity to Christmas and online spending.  Or, match the feeds from Fox, CNN, Boing Boing and others with the hot google searches of the day to find what news stories people are most researching online.

What the Sprint dashboard really shows off then is not the future of where we are going on the web, but rather that the direction we are going is being adopted and understood by the general population.  Very cool.

Kevin Kelly on the Future of the Web and Semantic Data Sharing

The Web is changing quickly, and as it does, data and the tools to interpret it become ever more important. In the above speech presented at the recent Web 2.0 Conference, Kevin Kelly of Wired discuses the history of the web and explains his vision of its future evolution. The web, he explains, has evolved from linking computers, to linking pages of content, and is now moving towards linking data. Soon, the web will grow into a World Wide Database of information, allowing users to restructure that data in new and interesting ways.

For that process to happen, the existing information online, stored in pages, will have to be de-structured and broken down into its elemental units to enable machines to “understand” and process that data more efficiently. This concept is not new — it is just a re-stating of the goals of the semantic web, a change that has been steadily on the horizon for the last several years. However, what interests me about these changes is not how or why or when they are happening, but rather what we can do with this new abundance of information? How can we reshape, interpret, and use this information to expand what is currently possible?

It is unlikely that we will know the full extent of the changes that the web is bringing until they happen, but by paying close attention, we will have an opportunity to shape it into something incredible. Kevin Kelly closes his speech by urging people to “get better at believing the impossible,” and looking back on how incredible the growth of the web has been so far, it is obvious that what we currently believe is possible is only a shadow of where we will be in the near future.