Collaborative Filtering
"Expert opinons" without the experts

Collaborative filtering (CF) is a common Web technique for generating personalized recommendations. Examples of its use include Amazon, iTunes, Netflix, LastFM, StumbleUpon, and Delicious.

Thanks to collaborative filtering, a site like Amazon can say, "People who bought books A and B also bought book C." Amazon needs no understanding of people or books to generate this recommendation. All Amazon needs is a database of who has bought what, so that it can calculate who are "people like you." Then any time a "person like you" buys a book that you have not bought, Amazon can "personally" recommend that book to you.

Collaborative filtering needs no built-in subject knowledge to generate recommendations. Many other Web sites do rely on built-in expert subject knowledge to generate recommendations, and so do not use collaborative filtering. For example, Pandora hires music experts to listen to each song in its inventory. These experts code the "genome" of each song. Pandora then generates personalized recommendations by determining what musical "genes" each individual user likes and recommending songs with those "genes."

Even less related to collaborative filtering are pseudo-personalized recommendations: "Thanks, JOE SMITH, for buying Web hosting from us. Add Identity-Protector to your shopping cart for just $10 per month."

Delicious: example of CF

We will use Delicious to illustrate the basic principles of collaborative filtering.

Recommendations generated by Delicious are less personal than those of most other CF sites. But the database maintained by Delicious is more transparent, and that database is not just on the Web, it is about the Web (not about books, songs, or movies). So Delicious makes a good case study.

In the frame below, Delicious shows popular sites tagged "webdesign" on the left; and on the right it recommends other tags that are related to webdesign. Delicious hires no experts to inform these recommendations; they are generated by collaborative filtering.

Bookmarks: content of Delicious

We model the content of Delicious as a set of bookmarks. For example, below are five Delicious bookmarks pertaining to WordPress:


Delicious allows many users to bookmark the same page, and each user can specify her own set of tags for that page. Other bookmarking sites (such as Digg and Webwhompers) allow a page to be bookmarked only once; the first person to bookmark a page is the only person who can specify tags for that page; other people can vote for a bookmarked page but cannot specify tags for that page.

Whether hosted by Delicious or Digg, each bookmark includes the following elements:

  • User: the ID of the person who created the bookmark (e.g., MardiW)
  • Page: the URL of the Web page being bookmarked (e.g.,
  • Tags: the set of tags associated by that user with that page (e.g., {blog, blogging, wordpress})
Tuples: content of CF

A tuple is an ordered list. The term tuple is used when the length of the ordered list and the type of element in each position of the ordered list are predetermined. We say n-tuple to signify a tuple of length n.

For example:

  • A graph such as G=(V,E) is a 2-tuple. It is an ordered list of length two. The first element of the ordered list is always a set of nodes. The second element of the ordered list is always a set of edges.
  • A path in a graph is an ordered list but would not normally be considered a tuple. Paths have many different lengths.

We represent the user, page, and tags of a Delicious bookmark as a 3-tuple: (u,p,T)

  • u = the user who created the bookmark
  • p = the page being bookmarked
  • T = the set of tags associated by that user with that page

For example, the bookmarks illustrated above include these 3-tuples:

  • (valentinote,, {webdesign, blog, programming})
  • (Beth Castle,, {tools, wordpress, blog, software, free})
  • (Pisces6,, {blog, software})
Bipartite graphs: structure of CF

Users of Delicious not only bookmark the WordPress home page (as shown above) but also use "wordpress" as a tag for all sorts of Web pages. Below is a summary of bookmarks using the tag "wordpress":

wordpress bookmarks

The above summary features a right sidebar that recommends tags related to wordpress. Delicious needs no understanding of users, pages, or tags to generate these recommendations. All Delicious needs is a database of which tags have been used to describe which pages. Collaborative filtering does the rest.

A graph can be used to represent which tags have been used to describe which pages. For example, in the graph below

  • Each red rectangle represents a tag used on Delicious
  • Each blue circle represents a page bookmarked on Delicious
  • A tag and a page are joined by an edge if that tag has been used (by any bookmark) to describe that page

Because of the way it is constructed, the above graph has two notable properties

  • Each node is colored with one of two colors (red or blue)
  • Each edge joins two nodes of different colors. (In other words, no two nodes of the same color are adjacent.)

This is an example of a bipartite graph. Another way to describe a bipartite graph is by drawing the red nodes on one side of the picture and the blue nodes on the other side of the picture, just as above. Then each edge crosses from the left side to the right side. No two nodes on the same side are adjacent.

Most graphs are non-bipartite. A graph is non-bipartite when there is:
  • No way to color its vertices red and blue without creating red-red or blue-blue edges
  • No way to draw its vertices in two columns so that edges only cross from column to column and never join vertices in the same column
Note that the above two condidtions are both saying the same thing. For example, the graph below is non-bipartite:
Non-bipartite graph
Suppose we color the first node red; then we must color the second node blue. But then there is no way to color the third node red or blue without creating either a red-red edge or a blue-blue edge.
Structural equivalence: computation of CF

Finding clusters is one way to discover meaningful groups of nodes in a graph. Another way is based on the idea that nodes x and y are similar to the extent that they have similar network neighborhoods -- even if x and y are not adjacent to each other. This is the idea behind structural equivalence.

We define the structural equivalence of two nodes x and y as the similarity of their neighborhoods, as measured by the Jaccard index:

SE(x,y) = J(N(x),N(y)) = |N(x) ∩ N(y)| / |N(x) ∪ N(y)|.

This equals 1 when x and y have identical neighborhoods and 0 when the neighborhoods of x and y have no nodes in common.

Example: Consider the bipartite graph drawn below. The yellow nodes below left are tags on delicious and the orange nodes below right are some of the URLs associated with these tags:
Bipartite Tags
The structural equivalence of YouTube and Video is

|N(YouTube) ∩ N(Video)| / |N(YouTube) ∪ N(Video)|

= |{A, B, C, F} ∩ {A, B, C, D, E, F}| / |{A, B, C, F} ∪ {A, B, C, D, E, F}|

= |{A, B, C, F}| / |{A, B, C, D, E, F}|

= 4 / 6

= 2/3


The structural equivalence of MySpace and Video is

|N(MySpace) ∩ N(Video)| / |N(MySpace) ∪ N(Video)|

=|{H, I, K} ∩ {A, B, C, D, E, F}| / |{H, I, K} ∪ {A, B, C, D, E, F}|

=|{ }| / |{A, B,C, D, E, F, H, I, K}|

= 0 / 9

= 0

CF-Delicious: micro-synthetic summary

The content, structure, and computation discussed above give us a simple framework for an abbreviated impersonal version of collaborative filtering (adapted from Principia Cybernetica):

The two steps of impersonal collaborative filtering:

  1. A large set of affiliation data is collected. For Delicious, we represent this as a set of 3-tuples, from which we derive a bipartite graph indicating which tags are affiliated with which Web pages. 
  2. Given a tag of interest and the bipartite graph from step 1, structural equivalence is used to compute the similarity of other tags to that tag of interest. A subgroup of "Related Tags" is identified and presented in rank order.
The four steps of collaborative filtering

Our description of Delicious provides an abbreviated example of impersonal collaborative filtering. For a full-blown example of personalized collaborative filtering, imagine that you are a habitual Amazon customer. All you have to do is log in and you will see personalized recommendations informed by your history of purchases. The following four steps (again adapted from Principia Cybernetica) describe how this works.

The four steps of personalized collaborative filtering:

  1. A large set of affiliation data is collected. For Amazon and most other CF sites, the relevant bipartite graph indicates which users have bought (downloaded, used, recommended, etc.) which items.
  2. You log in. Given the bipartite graph from step 1, structural equivalence is used to compute the similarity of other users to you. A subgroup of "people like you" is identified.
  3. Using the subgroup derived in step 2, an average preference function is determined. This function describes what "people like you" typically buy (download, etc.) on the site.
  4. Using the preference function derived in step 3, the site displays the most preferred items that you have not yet bought.

Joomla Templates by Joomlashack