The Graduand: Big Data

Summary
Whoever said that “information is power” definitely was talking about big data. No one could have predicted that social media sites which were created for mere entertainment and social purposes could become a treasure chest full of valuable data about one’s very existence. However, while everyone knows what big data is, that it exists and where it is contained, there is much confusion as to what to do with it. Experts in the field of research have come to realize that their old methods of collecting and analyzing data is becoming less and less useful and individuals are becoming afraid that their personal information can be used against them. Many industries are also grappling with the fact that they are not equipped to handle the sheer volume of big data.

Despite the confusion, big data is being used in various industries in various ways, although reports suggest that so far, only about eight per cent of big data is being extracted and put to use. In their article, “The 6 Provocations of Big Data,” Danah Boyd and Kate Crawford put big data into perspective by identifying six challenges which social scientists need to be aware of when dealing with big data. Therefore, this paper, endeavors to delve deeper in each of the provocations to better understand the confusion surrounding big data.

Introduction

As more and more data is being collected, curated and analysed, the term “Big data” is becoming the buzz word in every industry and institution today. In simple terms, big data comprises of datasets whose size is beyond the ability of typical database software tools to capture, store, manage and analyse (McKinsey and Company, 2011). However, the term “Big Data” does not merely refer to the size or volume of datasets. The International Data Corporation (IDC) defines big data as:

“Big data technologies describe a new generation of technologies and architectures, designed to economically extract value from very large volumes of a wide variety of data, by enabling high-velocity capture, discovery, and/or analysis” (Gantz & Reinsel, 2011)

Hence, the definition of Big Data also concerns a computational turn in thought and research. With the advent of social platforms like Twitter, Facebook, Google Earth, data sets now have expanded to include maps and images along with texts and numbers. Facebook for example, allows users to upload an endless amount of photos and videos and with an estimated 800 million users (Hancock, 2012) and Twitter is publishing about 500 million tweets per day, while organising them and storing them in perpetuity (Raftery, 2012), therefore, contributing significantly to creating Big Data. Technological advancements also allow just about anyone to have access to large datasets, which were otherwise made available exclusively to academic and scientific institutions. For instance, just about any type of data is accessible at data.gov, by anyone with an internet connection (Weigel, 2012).

While proponents of Big Data celebrate it as creating value by enhancing productivity, improving organizational transparency, and forecasting models, sceptics see it as creating new social problems. In their article, “The 6 Provocations for Big Data,” Danah Boyd (Boyd) and Kate Crawford (Crawford) identify areas in which Big Data may create new issues in social science research. This paper thus, using those six provocations as a framework, will explore the challenges that big data presents social scientists.

Automating Research Changes the Definition of Knowledge

The scientific method is the process by which scientists, collectively and over time, endeavor to construct a reliable, consistent and non-arbitrary, representation of the world (Wolfs, n.d). To do this scientists built the scientific method around testable hypothesis, which are then tested to either falsify or confirm theoretical models of how of the world works. However, with the availability of Big Data, the theories developed through the scientific method could become obsolete, and in effect change the way the world is understood. The availability of Big Data and advanced technological tools can potentially generate more accurate results than specialists or domain experts who traditionally craft carefully targeted hypotheses and research strategies (Anderson, 2008). But in order to get more accurate results from these data, the scientific method as a whole needs to change. This is because large amount of data means more false correlations under the current types of scientific methods employed. For instance, with Big Data, one can easily get false correlations such as, “On Mondays, people who drive to work are more likely to get the flu.” While this assertion may be true under current scientific methods, it will not explain why it is true and if there is any causality. Thus, scientists now need to step outside their laboratories, to develop new methods of testing causality where they can no longer rely on control for variables (Edge.org, 2012). But, while data is being created in real time, the reality is, knowledge is going to outpace understanding. Therefore, data users are now able to predict outcomes , but without understanding what drove those outcomes (Weinberger, 2012).

Claims to Objectivity and Accuracy Are Misleading

As computational scientists begin to study society using quantitative methods, there is the danger, results from their research, will be accepted as facts rather than being open to interpretation. This is because the quantitative method comes with limitations, particularly with its use in social science. Since quantitative researchers need to carefully design their study even before data is collected, they need to know in advance what they are looking for. With Big Data, however, due to its volume and because it includes both unstructured and semi-structured data, it is impossible to know in advance what to look for. Since quantitative analysis deals with data in the form of numbers and statistics, it also ignores contextual detail, which Big Data is rich in.

Bigger Data Is Not Always Better Data

Bigger data is not necessarily better data simply because not all data is worth exploiting. With more data, there is an increased possibility of redundant data as well. Take for instance twitter updates and/or shares in the form of hashtags. When a particular update gets re-tweeted again and again, it creates a high volume of data in terms of the number of tweets. However, what one gets is more data that is being repeated to create Big Data, while the amount of information gathered from those data remains the same. Therefore, the value of information derived from Big Data is not dependent on the data size. However, one may argue that while a retweet itself gives no further data as it is a copy of another tweet, the number of tweets and retweets brings a lot of insights, as many marketers know, about the popularity or importance of something.

There is also the issue of limitations inherent in sampling. No matter how much data one can gather, it will never be the entire population set. This is because Big Data is being created in real time and therefore, at any point in time, it is impossible to collect the entire population set. Hence, researchers are always going to be dealing with a sample size, even though it may be a relatively bigger sample size. This brings one to the next question which is, “Does a bigger sample size mean better information?” This is not necessarily the case. While a larger sample size may provide better estimates with a lower standard of error, its inherent problem of biasness may skew the results and this may be the case even if the samples were picked randomly. Suppose for example that a company operating in a certain industry has collected 'big data' on its customers in that country. If it wants to use that data to make assertions about its existing customers in that country, then its data may yield accurate results. If however it wants to draw conclusions about a larger population - potential as well as existing customers, or customers in another country, then it become important to consider to what extent the customers about whom data has been collected are representative - perhaps in income, age, gender, education, etc - of the larger population (Brown, 2012).

Not all Data is Equal

Once again, it is simplistic to assume that bigger amounts of data will provide better information using the same analytical tools used with small data. This is because, data is not always interchangeable and with Big Data, context is critical. Traditionally, data sets have generally been structured. So, the “context” of those data can be found within the records or the files which contain those data. However, Big Data includes high volumes of unstructured and semi-structured data. In fact, while an average sixty per cent of the data created today is unstructured, businesses are currently only able to capture just eight per cent of the unstructured data.This is because, the tools and techniques that have proved successful in transforming structured data into business intelligence and actionable information, simply do not work when it comes to unstructured data (Bank of America Corporation, 2012). For example, researchers of social network analysis study networks through data traces in articulated and behavioural networks. But the problem with this kind of data is that even though it provides valuable information, it does not necessarily represent the nature and complexity of social behaviours (Boyd & Crawford, 2011).

Just Because It’s Accessible Doesn't Make It Ethical

Before tackling the issue of ethics in Big Data, it is important to understand what ethics means. The Oxford dictionary defines ethics as,

“the branch of knowledge that deals with moral principles, governing a person’s behaviour or the conducting of an activity.”

The definition itself highlights the main issue with the ethics of Big data – the conflicting moral principals of the parties involved, mainly, the people who create the data and the users of those data. A huge amount of Big Data is created by people who do not understand how information about themselves is being used and for what reasons(Boyd & Crawford, 2011). While Big Data is ethically neutral, the use of Big Data is not. In his book, “Ethics of Big Data – Balancing Risk and Innovation,” Kord Davis (Davis) identifies four aspects of Big data ethics – Identity, Privacy, Ownership and Reputation. This paper will thus touch on those four aspects to understand the issue of ethics in Big Data.

Identity

Most social media sites such as Facebook and Google, function on the basis that each user has only one identity and hence, is portraying a mirror image of themselves in the virtual world. The founder and CEO of Facebook, Mark Zuckerberg believes in a singular identity so much so that he said,

“Having two identities for yourself is an example of a lack of integrity…The days of you having a different image for your work friends or co-workers and for the other people you know are probably coming to an end pretty quickly (Newman, 2011).”

Indeed, Facebook and Google have real-name policies, which make it mandotary for its users to only use their true names online, failing which they may have their accounts deleted. For the purposes of using Big Data, real names are thought as being more useful, for example, in terms of advertising, where the advertiser can target relavent advertising if it knew who they are dealing with. It also helps researchers gather more meaningful and accurate data if they knew the real identities of a particular user. However, problems arise where on Twitter for instance, users are allowed pseudonyms and can have multiple accounts. Users of data will find it difficult to back track who the users are to derive any kind of useful information.

But, proponents of annonymity claim that real names deny users of freedom of expression and thus places people such as political activists and abuse survivors at risk (Heussner, 2012). One such proponent is Chris Poole, the founder of 4chan and Canvas, who argues that identity is prismatic, and that one chooses to act out various “selves” to different groups of people anyways and therefore, permitting a user to have multiple identities allows for more accurate information about people in general. Supporting this notion is Universal Mccan’s ex Vice President, David Cohen, who said,

“There’s the use of pseudonyms to mask behaviour that we wouldn’t condone in the world of marketing and the use of masking that is simply another personification of your persona, along the lines of interests and passions.”

The latter, he says, could create additional consumer insights (Heussner, 2012).

Privacy

The next aspect of ethics in Big Data is privacy ,where unlike the above mentioned identities, it is a black and white issue. In fact, the Associate Professor at the University of Colorado Law School, Paul Ohm, describes the issue best by saying that data can either be useful or perfectly annonymous but never both (Ohm, 2009). The harvesting of large data sets from social media platforms clearly indicate privacy concerns. It is still unclear how major social media sites such as Twitter and Facebook use data created by their users, although through their friends’ suggestion tools and targetted advertsing, it is obvious that it is being used in some way. Before technological advancements, one could rely on annonymity, having used pseuodonyms, to have some relief that one’s privacy is still maintained when data about them is being collected. This is because organisations used methods of de-identification such as, anonymisation, pseudonymisation, encryption, key coding and data sharding to distance data from real identities, allowing analysis to proceed while maintaining privacy (Polonetsky & Tene, 2012). However, reasearch by computer scientists found that anonymised data can often be re-identified and attributed to specific individuals, thus causing huge privacy concerns (Narayanan & Schmatikov, 2008).

Ownership

The third issue Davis identifies is the ownership of collected data and a good example to demonstrate the complexity of the issue is Wikileaks. Who does the information released on Wikileaks rightfully belong to? Does it belong to the governments, the general public, Wikileaks or the newspapers which published them? Facebook for instance has its privacy policy set such that any information that anyone puts on their site belongs to them (Facebook.com, 2012). Basically, this means that one holds rights to their information or data until they choose to put them up on the internet which then changes the ownership altogether. While this may seem like a new problem, it in fact is an old one. If someone decides to confide in another about some personal information, then the other person is being given the right to do whatever he wants to do with that piece of information. He may choose to repeat the information to some other persons to pass it on, he may decide to collect that information together with other information that he already knows about that person to derive at some sort of a conclusion or he may simply choose to forget that piece of information. Same problem, but on a different platform and in a bigger scale. This is why many pundits argue that if one is afraid of their private data being misused, say for example in Facebook, then they simply should either not sign up for their service or put up posts that may get “leaked.”

Reputation

The final aspect of ethics in big data has to do with reputation, which begs the question,

“Is data trustworthy?”

In March of 2012, TheMotleyFool.com published its top ten lists of “Big Data Stocks” where LinkedIn.com (LinkedIn) was placed in at the third position (Reeves, 2012). However, in May that year LinkedIn fell victim to hackers and put at risk, the data and privacy of more than six million of its users, thus jeopardising the website’s reputation (Barlow, 2012). With the advent of big data, data security becomes a major issue for data mining companies like LinkedIn. Another problem this brings is that by losing control of their data, LinkedIn users were stripped of their private identities on that site, thus putting their reputation at risk as well. In fact, LinkedIn is known to be the preferred social network for government agencies (Williams, 2012).

Gone are the days where one could act one way to a certain group of people and another to a different group of people without each group finding out. Now, as more and more information gets collected about someone, organisations and even groups of people that someone does not even know can make assumptions about a person. For instance, in 2011, Kurt Nordland was seeking workers compensation from his company and the insurance company paying his claim decided to look him up on Facebook, where they found pictures of him drinking beer and relaxing at the beach. Based on those pictures, the insurance company cancelled his payments, cut off his medical benefits and Nordland had to delay surgery to repair torn cartilage in his shoulder (Romero, 2011).

Limited Access to Big Data Creates New Digital Divides

When pundits talk about digital divide, they are generally referring to the issue of accessibly of technology. Boyd and Crawford argue that big data is not easily accessible. This is because only social media sites have access to these data and they have no obligation to make it available. For example, Twitter has created different grades of access to its data. The Gardenhose access level grants ten per cent of public statuses for free while requiring case-by-case approval by Twitter and the most popular Spritzer level provides any user with just one per cent of the public statuses. If money is not an issue, then the highest level of access called the Firehose grants users access to all of its data, although Twitter varies the price from time to time (Gannes, 2012). This grading system alone proves that not everyone gets equal access to data.

However, the new era of big data also heralds in a new kind digital divide, which refers to a divide between the people who are able to use data and those who cannot. Even if a researcher has access to big data, he must have the technical knowledge to collect and analyse the data – a skill that many in the social sciences lack and inadvertently favours the computational scientists (Boyd & Crawford, 2011).

Conclusion

Using the article by Boyd and Crawford as a framework, this paper made a deeper analysis of each of their provocations. Although big data is generally identified by its volume, it is the quality of the data that truly matters. To be able to sift quality data however, one needs the technical skills and financial capability to collect them and analyse them. With the Twitter example, to have access to its Firehose, one needs to be able to afford to purchase the data in the first place. In addition, because big data includes a high percentage of unstructured and semi-structured data, such as images and texts, the research methodologies traditionally used in social sciences needs to change to suit the new type of data. Researchers also have to contend with correlations of data without understanding the causality as data is created in real time. Therefore, scientists do not have the luxury of time to understand causality. But of course, privacy concerns present a major roadblock to the availability and collection of those data. Creators of big data often do not understand how information about them is used and more and more people are pushing for better privacy policies on social media sites. Another problem is that many people also use pseudonyms to hide their real identity so that information about them cannot be traced easily. However, technological advancements have allowed for the de-anonymisation of people’s fake identities to reveal their true self. The digital divide also presents another problem in that most of the critical data is only available to the financially and politically elite with sites like Twitter charging a huge amount of money to purchase most of their data.

Although their article highlights major challenges with big data, one cannot deny that big data has significantly improved people’s lives. The IDC reported an increase of three hundred billion dollars in potential annual revenue to the American healthcare and a hundred billion dollars increase in revenue for telecommunications service providers. Technology research giant, Gartner, reported a sixty per cent potential increase in net margin in the retail industry while predicting a two hundred and fifty euro potential annual value to Europe’s public sector (DataArt, 2012). Therefore, while the potential of big data is enormous, the task of overcoming the six provocations identified by Boyd and Crawford are equally as big.

References

Anderson, C., 2008. The End of Theory: The Data Deluge Makes the Scientific Method Obsolete. [Online] Available at: http://www.wired.com/science/discoveries/magazine/16-07/pb_theory
[Accessed 5 November 2012].

Bank of America Corporation, 2012. Tech's 'Big Data' Drive. [Online]
Available at: http://wealthmanagement.ml.com/wm/Pages/ArticleViewer.aspx?title=techs-big-data-drive.

[Accessed 28 October 2012].

Barlow, R., 2012. LinkedIn Hacking: What You Need to Know [Online]. Available at: http://www.bu.edu/today/2012/linkedin-hacking-what-you-need-to-know/ [Accessed 3 November 2012].

Boyd, D. and Crawford, K., 2011 ‘6 Provocations of Big Data’ paper given at Oxford Internet Institute Decade in Time Symposium on the Dynamics of the Internet and Society.Oxford, England.

Brown, M., 2012. Big Data Blasphemy: Why Sample?. [Online]

Available at: http://smartdatacollective.com/node/47591

[Accessed 5 November 2012].

DataArt, 2012. Industries Set to Benefit from Big Data. [Online]

Available at: http://www.dataart.com/software-outsourcing/big-data/industry-benefits

[Accessed 4 November 2012]

Davis, K. (ed). 2012. Ethics of Big Data: Balancing Risk and Innovation. California: O'Reilly Media, Inc.

Edge.org, 2012. Reinventing Society in the Age of Big Data. [Online]
Available at: http://www.edge.org/conversation/reinventing-society-in-the-wake-of-big-data
[Accessed 4 November 2012].

Facebook, 2012. Statement of Rights and Responsibilities [Online]. Available at: https://www.facebook.com/legal/terms [Accessed 4 November 2012].

Gannes, L., 2012. Twitter Firehose Too Intense? Take a Sip from the Gardenhose or Sample the Spritzer [Online]. Available at: http://allthingsd.com/20101110/twitter-firehose-too-intense-take-a-sip-from-the-garden-hose-or-sample-the-spritzer/

[Accessed 4 November 2012].

Gantz, J. & Reinsel, D., 2011. The 2011 Digital Universe Study: Extracting Value from Chaos, Massachusetts: IDC Go-to-Market Services.

Hancock, C., 2012. Broadband is important but BIG DATA is the future!. [Online]
Available at: http://www.abc.net.au/technology/articles/2012/04/16/3478362.htm
[Accessed 22 October 2012].

Heussner, K.H., 2012. The Internet Identity Crisis [Online]

Available at: http://www.adweek.com/news/technology/internet-identity-crisis-137991 [Accessed 26 October 2012].

McKinsey and Company, 2011. Big Data: The next frontier for innovation, competition and productivity, s.l.: McKinsey Global Institute.

Narayanan, A. and Schmatikov, V., 2008 ‘Robust De-anonymization of Large Sparse Datasets’ paper given at IEEE Symposium on Security and Privacy.Washington, United States of America

Newman, J., 2011. Facebook comments Expose a Flaw in Zuckerberg's Vision [Online]. Available at: http://technologizer.com/2011/03/07/facebook-comments-zuckerberg-vision/ [Accessed 4 November 2012].

Ohm, P., 2009. Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization. UCLA Law Review. 57, pp.1701.

Polonetsky, J and Tene,O. (2012). Privacy in the Age of Big Data: A Time for Big Decisions. Stanford Law Review. 64, pp.63.

Raftery, T., 2012. Sustainability, social media and big data. [Online]
Available at: http://theenergycollective.com/tom-raftery/138166/sustainability-social-media-and-big-data
[Accessed 3 November 2012].

Reeves, J., 2012. Top 10 Big Data Stocks: LinkedIn [Online] Available at: http://www.fool.com/investing/general/2012/03/06/top-10-big-data-stocks-linkedin.aspx [Accessed 4 November 2012].

Romero, R., 2011. Are insurance companies spying on your Facebook page? [Online] Available at: http://abclocal.go.com/kabc/story?section=news/consumer&id=8422388

[Accessed 2 November 2012].

Weigel, M., 2012. What is Big Data? Research roundup, reading list. [Online]
Available at: http://journalistsresource.org/studies/economics/business/what-big-data-research-roundup
[Accessed 9 October 2012].

Weinberger, D., 2012. Understanding big data vs. theory. [Online]
Available at: http://www.kmworld.com/Articles/Column/David-Weinberger/Understanding-big-data-vs.-theory-85784.aspx
[Accessed 29 October 2012].

Williams, M., 2012. Social Media Still Has Skeptics in Government [Online]. Available at: http://www.govtech.com/policy-management/Social-Media-Skeptics-Government.html [Accessed 1 November 2012].

Wolfs, F., n.d. Introduction to the Scientific Method. [Online]
Available at: https://sites.google.com/site/kapostase/home/fyr/use-of-the-scientific-method
[Accessed 30 October 2012].

The Graduand

Tuesday, December 25, 2012

Big Data