Kaggle for fun

Why a kaggle competition

I always have been interested in data. When I was a kid, I was an avid football fan. I once went to a game. Normally I would accommodate in the cheaper seats, very far from the field but this time I was in a very good spot, I could see everything perfectly. At the end of the game I was completely mad at the referee. I had seen some blatant injustices committed against my team. The same evening, not so high on passion anymore, I happened to see the same game on television. I watched the same episodes I was outraged against and, to my surprise, the referee was correct.

I think it was then that I began to mistrust gut feeling and started to love data.

As a business consultant and BI expert I always try to back up my recomendations with data. Now that data science is the word of the moment I feel very happy. In the last few years I investigated data science: trying R, reading and learing (more) Statistic and Machine Learning. My employer BBF is very interested in this; we are specialized in data and database technology so they helped me to pursue this dream. Together with a couple of colleagues, participants in the Data Science Specialization at John Hopkins University we developed our skill in R and Data Science.

I was honored to be one of the selected few in the first Capstone Project (dry run). We helped to tune the course for the next participants. It was very challenging but this is a theme for another post.

An obvious way to test old and new skills is to participate in a competition. Kaggle is a site that hosts all sort of data science competitions and attract the best data scientist around the world. I thought that if I could be in the upper half of a kaggle competition I could consider myself a real Data Scientist.

Learning social circles in networks

The competition was chosen, among those available, because was interesting, deep and with not too many data. Many competitions use tons of data and I did not want do embark in what, for the most part, would have been an exercise in optimization. Kaggle describes this competition as following:

Social Circles help users organize their personal social networks. These are implemented as “circles” on Google+, and as “lists” on Facebook and Twitter. Each circle consists of a subset of a particular user’s friends. Such circles may be disjoint, overlap, or be hierarchically nested.

The goal of this competition is to automatically infer users’ social circles. You are provided a set of users, each of whose circles must be inferred. To do this, participants have access to:

A list of the user’s friends
Anonymized Facebook profiles of each of those friends
A network of connections between those friends (their “ego network”)

The details are (somewhat tersely) summarized in this page on kaggle. We were provided with a small number of ego networks (around hundred) and we were ought to understand which are the social circles that are present in these networks. The networks provided were real Facebook networks that a number of volunteers manually labeled with the real social circles. Anonymized feature of each network center were also provided (attribute pertaining to the volunteer: school, work, city, etc..).

The most crucial characteristic of this problem were that the circles could overlap. A friend in the “University” circle could also be friend in the “Football fans” circle. This was a more challenging problem than to simply find a partition of the node (where a node should be member of exactly one circle). In this case the network theory divide between a partition (non overlapping circles) and a cover (overlapping circles).

The original research from Stanford University (see the paper from Julian McAuley and Jure Leskovec) found a complex machine learning algorithm to infer the circles from the network and the features. I urge you to read the paper as it is really interesting. The gist of it is that they modeled the probability to form a circle, providing the network and the feature. Once the model is in place the best circle are found using gradient descent (which in this case require some hairy math, as you can see in the paper).

The solution is completely unsupervised and performs well. Features are taken into account with a similarity function between two nodes that is proportional to the dot product of the features of the nodes. The proportionality parameter is one of those parameter that are learned thanks to the machine learning algorithm.

Choosing the tools.

I started trying to learn as much as possible on network theory, and it required some time. The science of networks is surprisingly vast. I found the best starting point in this paper from Santo Fortunato which is fairly comprehensive (and also more than 100 pages long!).

I did not want to replicate the Stanford solution. They found a direct machine learning solution and coded a new algorithm. In my view, data scientists are divided in two categories. There are the “core” data scientists, stars that work for academia or very big corporations, who devise new algorithm and break ground. The second category is represented from the “applied” data scientists (or more modestly, data analysts), who apply this stellar knowledge to the real world. They usually reuse the algorithms that have been devised from the geniuses. I unfortunately cannot aspire to be in the first category but I am convinced that I could be a member of the second one.

Had McAuley and Leskovec wrote an R package I would have tried it, but they choose instead to write a specialized software snap that is very interesting but too specialized for my taste. In the kaggle forum I have seen that some people tried to install and use it; I don’t know how it went at the end.

My objective was trying to make the best I could with plain vanilla R code, with the help from the libraries already available in CRAN (the repository of most R libraries), and this is what I’ve done. In the following paragraphs I will explain what I did, without to enter too much in code details.

My solution.

The first step of the work is to load the ego networks and the features. That is quite easy: just load each network with readLines(), interpret it with strsplit() and load it in a data frame. Features are more complex. It turns out (I did not understand this in the beginning) that every category of features (56 of them) is multivalued. For example, the category “work;employer;id” could have N examples for the same node (person) in a network, as the person has worked in his history for many employers.

My first plan was to mimic the Stanford Researchers and consider a dot product between features as the measure of feature correlation between two nodes. But the feature relation is a many to many, so this is impossible. So I used a feature list composed of a non normalized (in the database theory sense) comma separated feature element. For example two nodes could have a feature “work->employer->id”=“1, 2, 41” (first node) and “work->employer->id”=“2, 41” (second node). I wrote a routine naiveSimilarity(node1, node2) that assign one point for each feature number in common between the two node (in this case the result would have been 2 for the “work;employer;id” feature). This is the closest I could get to the dot product used by the researchers at Stanford.

At this point I had all the pieces to start experimenting. I tried some package for partitioning network and at the end I settled to the very good igraph package. I tried different methods, thinking that a clique percolation method would have been the best in this situation (due to the fact that it could generate overlapping circles) but at the end the infomap algorithm yielded the best results.

Features were used as a weight to help the infomap algorithm to find the best circles. At the end the kernel of the algorithm is surprisingly simple (housekeeping code omitted for brevity):

    # load the ego network for this node
    ego <- loadEgo(egodir, myNode)

    # build a graph
    g <- graph.edgelist(as.matrix(ego, ncol=2), directed=F)
    
    # assign the weight based on node similarity, calculated on features
    weig <- NA
    ll <- get.edgelist(g)
    for(j in 1:nrow(ll)){ weig[j] <- naiveSimilarity(ll[j,1], ll[j,2]) } 
    
    # launch infomap algorithm
    infomap <- infomap.community(g, e.weights = weig, v.weights = NULL, 
                                 nb.trials = 10, modularity = TRUE)
    
    # transform infomap in a data frame for building the submission
    df <- data.frame(node=infomap$names, group=infomap$membership, 
                     stringsAsFactors=FALSE)

I was very surprised that clique percolation o another method based on cliques, see for example this one, would not give better results. I think that a partial reason (other than my inexperience in tuning the parameters) is that the method that kaggle use to evaluate results heavily penalize a solution with too many circles. Clique percolation (in the R implementation, at least) tend to produce more circles. I also tried to reduce this effect purging the small circles, but I could not get a good result.

There are many more ideas to test (the field has a fascinating complexity), but I was limited from the decision to not implement new algorithm (moreover a new implementation would have been very time consuming).

I tried some variation using the features: should I enhance or soften a double feature concordance in the same feature (say two nodes have two cities or two workplaces in common)? At the end I think it does not matter much, but this feeling could be very wrong as I did not rigorously tested it. In the end I used the naiveSimilarity(node1, node2) routine that simply gave a vote (+1) for each concordance.

The peril of optimization

A fair amount of time was used trying to understand how to utilize the solutions that were given for some networks. The objective was to find a way to utilize the features provided to understand the real circles that were provided as a training set. I loaded the networks in Gephi and I tried to find a clue on the feature role in circles formation in the network. Unfortunately I did not found anything useful.

Looking back, this was probably a lucky thing. I disregarded the circles provided and considered the problem as a fully unsupervised one (imitating for necessity what McAuley and Leskovec have done). At the end this proved very fruitful. The real circles seem to be in a tenuous correlation with the features. Moreover the features in the real world are for necessity very dirty (try to remember the last time you corrected your lists in Facebook). Tuning your parameter on dubious data enhances the danger to overfit your data to the training set.

The Results

Life interfered with the project and I could not work on it for the last two months. There are some things that I left untried (of course I could have made things worse, as the public leaderboard is a poor indication for the final result).

At the end of the competition allotted period I was a little over the 100th place in the public leaderboard (there were slightly more than 200 teams). Particularly bad (for my placement and my target to be in the upper half) was the coming in the competition of a bunch of bright Russian students of the course “Applied Problems of Data Mining” (Lomonosov Moscow State University).

When it was over and the final result was published, it was way over my expectation: I was 13th. Not bad at all. Without the Russians (five of them are ahead of me) I would have made it to the top ten.

Not bad

Evidently many contestants have overfitted their model. With an unsupervised model I avoided that risk.

Conclusions, lessons learned

I learned a lot of things. Most of all I had my introduction in this interesting and deep field. There is still so much to learn here. Igraph is a beautiful package and very handy to use. The documentation could be better but this is to be expected.

I found gephi fascinating and very useful. You can analyze and visualize (and print) networks in a very simple way. I just scratched the surface of it, but this is one of the things that I want to get better at. I was also very impressed with the performance of the infomap package.

I also learned that you could go a long way, just with R and standard packages. There is a lot of people out there that are very kind and answer your questions, in forums and via email without hesitation.

It was very important for me to have an employer that holds the training in new technologies in high regard: BBF was so keen to devoid some of my office time to this competition.