SybilRank: Aiding the Detection of Fake Accounts
in Large Scale Social Online Services

Overview

Online social networking (OSN) services are vulnerable to Sybil attacks, under which a malicious user can create multiple fake OSN accounts. Fake (Sybil) OSN accounts are detrimental to both OSN providers and OSN users, due to the fact that OSN services are designed for real users. Sybils enable spammers to abuse an OSN's messaging system to post spam, or waste an OSN advertising customer's resources by making him pay for online ad clicks or impressions from or to fake profiles. They can be used to acquire users' private contact lists and personal information. Furthermore, Sybils can be used to manipulate search engine results or pollute crowdsourcing results. The goal of this project is to design a scalable and effective Sybil defense that enables an OSN with millions of users to detect a substantial fraction of Sybils.

Motivation

Due to the multitude of the reasons behind their creation, OSN Sybils manifest numerous and diverse profile features and activity patterns. Automated feature-based Sybil detection (e.g., Machine-Learning-based) suffers from high false negative and positive rates due to the large variety and unpredictability of legitimate and malicious OSN users' behaviors. This is reminiscent of how ML has not been very effective in intrusion detection. In response to the inapplicability of automated account suspension, OSNs employ CAPTCHA and photo-based social authentication to rate-limit suspected users, or manually inspect the features of accounts reported as abusive (flagged) by other users. The manual inspection involves matching profile photos to the age or address, understanding natural language in posts, examining the friends of a user, etc. The inspectors also use simple tools to parse activity and IP statistics of suspicious accounts. However, these tasks require human intelligence and intuition, rendering them hard to automate and scale. In addition, flagged accounts have by definition already caused annoyance to some users. Therefore, the need arises for a method that enables human verifiers to focus on accounts that are very likely to be fake and to disable them in a timely manner.

Figure 1: SybilRank as a part of an OSN's Sybil defense toolchain

Main idea

SybilRank is based on the assumption that Sybils have limited social connections to real users. It relies on the observation that an early-terminated random walk starting from a non-Sybil node in a social network has a higher degree-normalized (divided by the degree) landing probability to land at a non-Sybil node than a Sybil node. Intuitively, this observation holds because the limited number of attack edges forms a narrow passage from the non-Sybil region to the Sybil region in a social network. When a random walk is sufficiently long, it has a uniform degree-normalized probability of landing at any node, a property referred to as the convergence of a random walk. However, a shortened random walk originating from a non-Sybil node tends to stay within the non-Sybil region of the network, as the walk is unlikely to traverse one of the relatively few attack edges.
Figure 2: Early-terminated random walks

Our key insight is to formulate Sybil detection as scalable user ranking, and to derive quality ranking according to the landing probability of early-terminated random walks that start from non-Sybil nodes. We screen out low-ranked nodes as potential fake users. SybilRank's user ranking can be efficiently computed using a parallel computing framework, resulting in a practical solution for large OSNs.

Figure 3: Total execution time
as a function of the graph size on an EC2 cluster.

Scalability

We evaluate the scalability of our prototype on an Amazon EC2 cluster to process large-scale synthetic graphs with hundreds of millions of nodes. The cluster consists of 11 m1.large instances, one of which serves as the master and the other 10 as slaves. The large synthetic graphs are generated with exponentially increasing sizes from 10M to 160M nodes, i.e., 10M, 20M, ...,160M. As can be seen in Figure 3, the total execution time increases almost linearly with the size of social graphs. For the largest graph (160M nodes), our implementation finishes in less than 33 hours on this 11-node cluster.

SybilRank in the real world

We deployed SybilRank in Tuenti, the largest online social network in Spain. We ran SybilRank on a snapshot of Tuenti's complete social friendship graph, which was obtained in August 2011. The complete Tuenti social graph has 1.4 billion edges and 11 million nodes. Tuenti's security team inspected 2K users at the bottom of the resulting ranked list, and all of them were fake (Sybils). We further examined the ranked list by inspecting the first lowest-ranked one million users. We randomly selected 100 users out of each 50K-user interval for inspection. Up to the first 200K lowest-ranked users, around 90% are fake, as opposed to the ~5% hit rate of Tuenti's abuse-report-based method in 2011. This suggests an 18-fold increase in the efficiency with which Tuenti can process suspected accounts.

We studied the social connections among Sybils. As shown in Figure 4, the fake accounts exhibit various formations. We have captured Sybils that are loosely connected, and tightly connected among themselves. We also observed the full-mesh connection pattern among some of the detected Sybils.
Figure 4: Sybils in a 20K-user Tuenti community. Dark nodes are detected Sybils and grey nodes are real accounts. Edges represent social connections among users.

Publications

People

Contact

qiangcao AT cs DOT duke DOT edu