Christine Chai, Research on Facebook, A21
Project: Technical Presentations
Speech Date: 2012/09/28
Research on Facebook A21, Christine Chai
Not until Tuesday did I know that I had already signed up for the speech today due to a misunderstanding between the EVP team and me. I thought it was for the self-introduction of the mentor-mentee symposium. I originally wanted to cancel this speech, but since Bang-Ruei is going to join the army next Monday and Angel came here from a busy lifestyle, I decided to give this speech. I would like to talk about where I had been in the whole summer. Actually, I went to the Institute of Information Science in Academia Sinica as a summer intern, working on Facebook account misuse detection.
Due to the increasing popularity of social networks such as Facebook, account security plays an important role in the society. The server on Facebook approves a user’s login – as long as the account name and password are correct. However, this implies anyone who gets YOUR password can successfully login YOUR Facebook account. Is there a way to find out the actual user of an account?
To date, Facebook records the IP address, Operating System, and web browser of the user. When the user logs in with a different IP or place, the server may ask him/her to recognize some pictures of his/her friends. This sounds logical, but some people tag friends on cartoon characters or even objects. As a result, it will be more difficult for the real account owner to pass the test. On the other hand, if one logs in his/her friend’s account, it is very probable that he/she knows many of the account’s friends in real life. This makes acquaintances easier to pass the verification.
In my research team, we directly record the statistical behavior (webpages, to be exact) of accounts by a plugged-in program set beside the browser. By observing how the account user clicking on “like”, commenting on statuses, viewing webpages, and so on, we found out that the behaviors of using one’s own account or others’ are different. If one is using his/her own account, he/she is more active in commenting and clicking on “like”, and he/she views less personal messages or private clubs.
We conducted an experiment with two pairs of subjects, and all the four subjects had to do was to use their or other’s Facebook account. Subjects in the same pair knew each other well, while people in different pairs were strangers. The experiment consisted of three rounds, so every subject had the chance to use his/her own, the partner’s, and a stranger’s account. In this way, we acquired lots of account misuse data.
Before analyzing the data, we had to do some clean up. Some subjects did not respect the experiment because they did not concentrate on using Facebook. To further explain, they may fall asleep or watch a long movie on YouTube from a link shared by friends. As a result, we had to eliminate the account traces which were idle for more than five minutes.
The events generated by the subjects mainly fall into three categories: act, expand, and view. Every category is also divided into several subsets. These are the independent variables, and we hope to find a line separating self-users and nonself-users. As there are many variables, the line is so high-dimensional that we need to use a mathematical program to generate it.
Finally, we achieved accuracy around 80%, and the 20% misclassifications can be divided into false positives and false negatives. False positive means that we claim an account to be used by others while it’s actually used by its owner, while false negative is the opposite. Which kind of error is more serious? False negatives are dangerous because they leave some account hackers undetected, so it is more important to minimize them. The good news is that most misclassifications do not belong to this kind. However, for many account owners, too frequent verification is annoying. Imagine if every time you log in Facebook, the system asks you for your cell phone number again and again. Do you enjoy being treated like this?
So far, we are trying to improve the accuracy by finding new features and cutting off unrelated ones for the two types of Facebook account behavior. I hope to succeed in the project and make it the basis of my Master’s thesis.
It was really a difficult time in summer. I started learning everything from scratch, from the Machine Learning algorithm to the mathematic software. What’s more, I prepared for the TOEFL test at the same time, so I felt like a third year student in senior high school. It was a tiring but meaningful experience. Now I can remotely access the data from campus to Academia Sinica, so I don’t need to take the public transportation to the research center very often. I sincerely hope that my efforts can pay back in the future, and I wish the same to all of you.
After sharing some thoughts, I would like to share a conclusion – be aware of what you do online. Researchers and even attackers can get a lot of information from Facebook pages. According to another paper, “Facebook: What’s in a Name,” the mere user name reveals one’s gender because most users tend to put their real names, at least partial names on their profile. The research group obtained the popular baby name list from the US government website, added some local information such as who are already in a relationship. Please note that every bit of the information source was public. Among all the Facebook profiles in the same city with gender specified, the researchers used half of them for data training, and the other half for testing. To my surprise, the proposed scheme correctly identified the sex of more than 90% of account users in a different place.
To sum up, Facebook is an abundant data pool because it keeps track of everything we have done on the server. But don’t worry too much, even professors doing research on information security are using Facebook themselves, so we just need to be responsible for what information we reveal.
Toastmasters of the evening.