Microblogging and Advertisting Data Mining Challenge

Completed • $8,000

KDD Cup 2012, Track 1

Mon 20 Feb 2012
– Fri 1 Jun 2012 (4 years ago)

Important Dates

Mar 1, 2012 Datasets Available
Mar 15, 2012 Competition Begins
Jun 1, 2012 Competition Ends
Jun 8, 2012 Winners Notified
Aug 12, 2012 Workshop

Questions? Contact us at kddcup_questions@kdd2012.com »

Predict which users (or information sources) one user might follow in Tencent Weibo.


Online social networking services have become tremendously popular in recent years, with popular social networking sites like Facebook, Twitter, and Tencent Weibo adding thousands of enthusiastic new users each day to their existing billions of actively engaged users. Since its launch in April 2010, Tencent Weibo, one of the largest micro-blogging websites in China, has become a major platform for building friendship and sharing interests online. Currently, there are more than 200 million registered users on Tencent Weibo, generating over 40 million messages each day. This scale benefits the Tencent Weibo users but it can also flood users with huge volumes of information and hence puts them at risk of information overload. Reducing the risk of information overload is a priority for improving the user experience and it also presents opportunities for novel data mining solutions. Thus, capturing users’ interests and accordingly serving them with potentially interesting items (e.g. news, games, advertisements, products), is a fundamental and crucial feature social networking websites like Tencent Weibo. 


The prediction task involves predicting whether or not a user will follow an item that has been recommended to the user. Items can be persons, organizations, or groups and will be defined more thoroughly below. 


First, we define some notations as follows:

Item”: An item is a specific user in Tencent Weibo, which can be a person, an organization, or a group, that was selected and recommended to other users. Typically, celebrities, famous organizations, or some well-known groups were selected to form the ‘items set’ for recommendation. The size of this is about 6K items in the dataset. 

Items are organized in categories; each category belongs to another category, and all together they form a hierarchy. For example, an item, a vip user Dr. Kaifu LEE,

vip user: http://t.qq.com/kaifulee (wikipedia: http://en.wikipedia.org/wiki/Kai-Fu_Lee)

represented as

  • science-and-technology.internet.mobile

We can see that categories in different levels are separated by a dot ‘.’, and the category information about an item can help enhance your model prediction. For example, if a user Peter follows kaifulee, he may be interested in the other items of the category that kaifulee belongs to, and might also be interested in the items of the parent category of kaifulee’s category.

Tweet”: a “tweet” is the action of a user posting a message to the microblog system, or the posted message itself. So when one user is “tweeting“, his/her followers will see the “tweet”.

Retweet”: a user can repost a tweet and append some comments (or do nothing), to share it with more people (my followers).

Comment”: a user can add some comments to a tweet. The contents of the comments  will not be automatically pushed to his/her followers as ‘tweeting’ or ‘retweeting’,but will appear at the ‘comment history’ of the commented tweet.

Followee/follower”: If User B is followed by User A, B is a followee to A, and A is a follower to B.

We describe the datasets as follows:

The dataset represents a sampled snapshot of Tencent Weibo users’ preferences for various items –– the recommendation of items to users and the history of users’ ‘following’ history. It is of a larger scale compared to other publicly available datasets ever released. Also it provides richer information in multiple domains such as user profiles, social graph, item category, which may hopefully evoke deeply thoughtful ideas and methodology.

The users in the dataset, numbered in millions, are provided with rich information (demographics, profile keywords, follow history, etc.) for generating a good prediction model. To protect the privacy of the users, the IDs of both the users and the recommended items are anonymized as random numbers such that no identification is revealed. Furthermore, their information, when in Chinese, will be encoded as random strings or numbers, thus no contestant who understands Chinese would get advantages. Timestamps for recommendation are given for performing session analysis.

Two datasets in 7 text files, downloadable:

a) Training dataset : some fields are in the file rec_log_train.txt 

b) Testing dataset: some fields are in the file rec_log_test.txt

Format of the above 2 files:


Result: values are 1 or -1, where 1 represents the user UserId accepts the recommendation of item ItemId and follows it (i.e., adds it to his/her social network), and -1 represents the user rejects the recommended item.

We provide the true values of the ‘Result’ field in rec_log_train.txt, whereas in  rec_log_test.txt, the true values of the ‘Result’ field are withheld (for simplicity, in the file they are always 0). Another difference from rec_log_test.txt to rec_log_train.txt is that repeated recommended (UserId,ItemId) pairs were removed.

c)      More fields of the training and the testing datasets about the user and the item are in the following 5 files:

          i.              User profile data: user_profile.txt

Each line contains the following information of a user: the year of birth, the gender, the number of tweets and the tag-Ids. It is important to note that information about the users to be recommended is also in this file.



Year of birth is selected by user when he/she registered.

Gender has an integer value of 0, 1, or 2, which represents “unknown”, “male”, or “female”, respectively.

Number-of-tweet is an integer that represents the amount of tweets the user has posted.

Tags are selected by users to represent their interests. If a user likes mountain climbing and swimming, he/she may select "mountain climbing" or "swimming" to be his/her tag. There are some users who select nothing. The original tags in natural languages are not used here, each unique tag is encoded as an unique integer.

Tag-Ids are in the form “tag-id1;tag-id2;...;tag-idN”. If a user doesn’t have tags, Tag-Ids will be "0".

        ii.              Item data: item.txt

Each line contains the following information of an item: its category and keywords.



Item-Category is a string “a.b.c.d”, where the categories in the hierarchy are delimited by the character “.”, ordered in top-down fashion (i.e., category ‘a’ is a parent category of ‘b’, and category ‘b’ is a parent category of ‘c’, and so on.

Item-Keyword contains the keywords extracted from the corresponding Weibo profile of the person, organization, or group. The format is a string “id1;id2;…;idN”, where each unique keyword is encoded as an unique integer such that no real term is revealed.

      iii.              User action data: user_action.txt

The file user_action.txt contains the statistics about the ‘at’ (@) actions between the users in a certain number of recent days.


(UserId)\t(Action-Destination-UserId)\t(Number-of-at-action)\t(Number-of-retweet )\t(Number-of-comment)

If user A wants to notify another user about his/her tweet/retweet/comment, he/she would use an ‘at’ (@) action to notify the other user, such as ‘@tiger’ (here the user to be notified is ‘tiger’)..

For example, user A has retweeted user B 5 times, has “at” B 3 times, and has commented user B 6 times, then there is one line “A   B     3     5     6” in user_action.txt.

       iv.              User sns data: user_sns.txt

The file user_sns.txt contains each user’s follow history (i.e., the history of following another user). Note that the following relationship can be reciprocal.



         v.              User key word data: user_key_word.txt

The file user_key_word.txt contains the keywords extracted from the tweet/retweet/comment by each user.



Keywords is in the form “kw1:weight1;kw2:weight2;…kw3:weight3”.

Keywords are extracted from the tweet/retweet/comment of a user, and can be used as features to better represent the user in your prediction model. The greater the weight, the more interested the user is with regards to the keyword.

Every keyword is encoded as a unique integer, and the keywords of the users are from the same vocabulary as the Item-Keyword. 


Teams’ scores and ranks on the leaderboard are based on a metric calculated from the predicted results in submitted result file and the held out ground truth of a validation dataset whose instances were a fixed set sampled from the testing dataset in the beginning and, until the last day of the competition (June 1, 2012) by then the scores and associated ranks on leaderboard are based on the predicted results and that of the rest of the testing dataset. This entails that the top-3 ranked teams at the time when the competition ends are the winners. The log for forming the training dataset corresponds to earlier time than that of the testing dataset.

The evaluation metric is average precision. For a detailed definition of the metric, please refer to the tab ‘Evaluation’. 


The prizes for the 1st, 2nd and 3rd winners for task 1 are US Dollars $5000, $2000, and $1000, respectively.


Started: 12:01 am, Monday 20 February 2012 UTC
Ended: 11:59 pm, Friday 1 June 2012 UTC (102 total days)
Points: this competition awarded standard ranking points
Tiers: this competition counted towards tiers