Summary
Whoever said that “information is power” definitely was talking about big data. No one could have predicted that social media sites which were created for mere entertainment and social purposes could become a treasure chest full of valuable data about one’s very existence. However, while everyone knows what big data is, that it exists and where it is contained, there is much confusion as to what to do with it. Experts in the field of research have come to realize that their old methods of collecting and analyzing data is becoming less and less useful and individuals are becoming afraid that their personal information can be used against them. Many industries are also grappling with the fact that they are not equipped to handle the sheer volume of big data.
Despite the confusion, big data is being used in
various industries in various ways, although reports suggest that so far, only
about eight per cent of big data is being extracted and put to use. In their
article, “The 6 Provocations of Big Data,” Danah Boyd and Kate Crawford put big
data into perspective by identifying six challenges which social scientists
need to be aware of when dealing with big data. Therefore, this paper, endeavors to delve deeper in each of the provocations to better understand the
confusion surrounding big data.
Introduction
As more and more data is being
collected, curated and analysed, the term “Big data” is becoming the buzz word
in every industry and institution today. In simple terms, big data comprises of
datasets whose size is beyond the ability of typical database software tools to
capture, store, manage and analyse (McKinsey and Company, 2011) . However, the term
“Big Data” does not merely refer to the size or volume of datasets. The
International Data Corporation (IDC) defines big data as:
“Big data
technologies describe a new generation of technologies and architectures, designed
to economically extract value from very large volumes of a wide variety of
data, by enabling high-velocity capture,
discovery, and/or analysis” (Gantz & Reinsel, 2011)
Hence, the definition of Big Data also
concerns a computational turn in thought and research. With the advent of
social platforms like Twitter, Facebook, Google Earth, data sets now have
expanded to include maps and images along with texts and numbers. Facebook for
example, allows users to upload an endless amount of photos and videos and with
an estimated 800 million users (Hancock, 2012) and Twitter is
publishing about 500 million tweets per day, while organising them and storing
them in perpetuity (Raftery, 2012) , therefore,
contributing significantly to creating Big Data. Technological advancements
also allow just about anyone to have access to large datasets, which were
otherwise made available exclusively to academic and scientific institutions.
For instance, just about any type of data is accessible at data.gov, by anyone
with an internet connection (Weigel, 2012) .
While proponents of Big Data
celebrate it as creating value by enhancing productivity, improving
organizational transparency, and forecasting models, sceptics see it as
creating new social problems. In their article, “The 6 Provocations for Big
Data,” Danah Boyd (Boyd) and Kate Crawford (Crawford) identify areas in which
Big Data may create new issues in social science research. This paper thus,
using those six provocations as a framework, will explore the challenges that big
data presents social scientists.
Automating Research Changes the Definition of Knowledge
The scientific
method is the process by which scientists, collectively and over time, endeavor
to construct a reliable, consistent and non-arbitrary, representation of the
world (Wolfs, n.d) . To do this
scientists built the scientific method around testable hypothesis, which are
then tested to either falsify or confirm theoretical models of how of the world
works. However, with the availability of Big Data, the theories developed
through the scientific method could become obsolete, and in effect change the
way the world is understood. The availability of Big Data and advanced
technological tools can potentially generate more accurate results than
specialists or domain experts who traditionally craft carefully targeted
hypotheses and research strategies (Anderson, 2008) . But in order to get
more accurate results from these data, the scientific method as a whole needs
to change. This is because large amount of data means more false correlations
under the current types of scientific methods employed. For instance, with Big
Data, one can easily get false correlations such as, “On Mondays, people who
drive to work are more likely to get the flu.” While this assertion may be true
under current scientific methods, it will not explain why it is true and if
there is any causality. Thus, scientists now need to step outside their
laboratories, to develop new methods of testing causality where they can no
longer rely on control for variables (Edge.org, 2012) . But, while data is
being created in real time, the reality is, knowledge is going to outpace
understanding. Therefore, data users are now able to predict outcomes , but
without understanding what drove those outcomes (Weinberger, 2012) .
Claims to Objectivity and Accuracy Are Misleading
As computational
scientists begin to study society using quantitative methods, there is the
danger, results from their research, will be accepted as facts rather than
being open to interpretation. This is because the quantitative method comes with
limitations, particularly with its use in social science. Since quantitative researchers need to carefully design their study even before data is
collected, they need to know in advance what they are looking for. With Big
Data, however, due to its volume and because it includes both unstructured and
semi-structured data, it is impossible to know in advance what to look for.
Since quantitative analysis deals with data in the form of numbers and
statistics, it also ignores contextual detail, which Big Data is rich in.
Bigger Data Is Not Always Better Data
Bigger data is
not necessarily better data simply because not all data is worth exploiting.
With more data, there is an increased possibility of redundant data as well.
Take for instance twitter updates and/or shares in the form of hashtags. When a
particular update gets re-tweeted again and again, it creates a high volume of
data in terms of the number of tweets. However, what one gets is more data that
is being repeated to create Big Data, while the amount of information gathered
from those data remains the same. Therefore, the value of information derived
from Big Data is not dependent on the data size. However, one may argue that while a retweet
itself gives no further data as it is a copy of another tweet, the number of
tweets and retweets brings a lot of insights, as many marketers know, about the
popularity or importance of something.
There is also
the issue of limitations inherent in sampling. No matter how much data one can
gather, it will never be the entire population set. This is because Big Data is
being created in real time and therefore, at any point in time, it is
impossible to collect the entire population set. Hence, researchers are always
going to be dealing with a sample size, even though it may be a relatively
bigger sample size. This brings one to the next question which is, “Does a
bigger sample size mean better information?” This is not necessarily the case.
While a larger sample size may provide better estimates with a lower standard of
error, its inherent problem of biasness may skew the results and this may be
the case even if the samples were picked randomly. Suppose for example that a
company operating in a certain industry has collected 'big data' on its
customers in that country. If it wants to use that data to make assertions
about its existing customers in that country, then its data may yield accurate results. If however it wants to draw conclusions about a larger population -
potential as well as existing customers, or customers in another country, then
it become important to consider to what extent the customers about whom data
has been collected are representative - perhaps in income, age, gender,
education, etc - of the larger population (Brown, 2012).
Not all Data is Equal
Once again, it
is simplistic to assume that bigger amounts of data will provide better
information using the same analytical tools used with small data. This is
because, data is not always interchangeable and with Big Data, context is
critical. Traditionally, data sets have generally been structured. So, the
“context” of those data can be found within the records or the files which
contain those data. However, Big Data includes high volumes of unstructured and
semi-structured data. In fact, while an average sixty per cent of the data
created today is unstructured, businesses are currently only able to capture
just eight per cent of the unstructured data.This is because, the tools and
techniques that have proved successful in transforming structured data into
business intelligence and actionable information, simply do not work when it
comes to unstructured data (Bank of America Corporation, 2012). For example,
researchers of social network analysis study networks through data traces in
articulated and behavioural networks. But the problem with this kind of data is
that even though it provides valuable information, it does not necessarily
represent the nature and complexity of social behaviours (Boyd & Crawford,
2011).
Just Because It’s Accessible Doesn't Make It Ethical
Before tackling
the issue of ethics in Big Data, it is important to understand what ethics
means. The Oxford dictionary defines ethics as,
“the branch of
knowledge that deals with moral principles, governing a person’s behaviour or
the conducting of an activity.”
The definition
itself highlights the main issue with the ethics of Big data – the
conflicting moral principals of the
parties involved, mainly, the people who create the data and the users of those
data. A huge amount of Big Data is created by people who do not understand how
information about themselves is being used and for what reasons(Boyd &
Crawford, 2011). While Big Data is ethically neutral, the use of Big Data is
not. In his book, “Ethics of Big Data – Balancing Risk and Innovation,” Kord
Davis (Davis) identifies four aspects of Big data ethics – Identity, Privacy,
Ownership and Reputation. This paper will thus touch on those four aspects to understand the issue of ethics in Big Data.
Identity
Most social
media sites such as Facebook and Google, function on the basis that each user
has only one identity and hence, is portraying a mirror image of themselves in
the virtual world. The founder and CEO of Facebook, Mark Zuckerberg believes in a singular identity so much so that he said,
“Having two identities
for yourself is an example of a lack of integrity…The days of you
having
a different image for your work friends or co-workers and for the
other people you know are probably coming to an end pretty
quickly (Newman, 2011).”
Indeed, Facebook
and Google have real-name policies, which make it mandotary for its users to
only use their true names online, failing which they may have their accounts
deleted. For the purposes of using Big Data, real names are thought as being
more useful, for example, in terms of advertising, where the advertiser can
target relavent advertising if it knew who they are dealing with. It also helps
researchers gather more meaningful and accurate data if they knew the real identities
of a particular user. However, problems arise where on Twitter for instance,
users are allowed pseudonyms and can have multiple accounts. Users of data will
find it difficult to back track who the users are to derive any kind of useful
information.
But, proponents
of annonymity claim that real names deny users of freedom of expression and
thus places people such as political activists and abuse survivors at risk
(Heussner, 2012). One such proponent is Chris Poole, the founder of 4chan and
Canvas, who argues that identity is prismatic, and that one chooses to act out
various “selves” to different groups of people anyways and therefore,
permitting a user to have multiple identities allows for more accurate information about people in general. Supporting
this notion is Universal Mccan’s ex Vice President, David Cohen, who said,
“There’s the use
of pseudonyms to mask behaviour that we wouldn’t condone in the world of
marketing and the use of masking that is simply another personification of your
persona, along the lines of interests and passions.”
The latter, he
says, could create additional consumer insights (Heussner, 2012).
Privacy
The next aspect
of ethics in Big Data is privacy ,where unlike the above mentioned identities,
it is a black and white issue. In fact, the Associate Professor at the
University of Colorado Law School, Paul Ohm, describes the issue best by saying
that data can either be useful or perfectly annonymous but never both (Ohm,
2009). The harvesting of large data
sets from social media platforms clearly indicate privacy concerns. It is still
unclear how major social media sites such as Twitter and Facebook use data
created by their users, although through their friends’ suggestion tools and
targetted advertsing, it is obvious that it is being used in some way. Before
technological advancements, one could rely on annonymity, having used
pseuodonyms, to have some relief that one’s privacy is still maintained when
data about them is being collected. This is because organisations used methods
of de-identification such as, anonymisation, pseudonymisation, encryption, key
coding and data sharding to distance data from real identities, allowing
analysis to proceed while maintaining privacy (Polonetsky & Tene, 2012).
However, reasearch by computer scientists found that anonymised data can often
be re-identified and attributed to specific individuals, thus causing huge
privacy concerns (Narayanan & Schmatikov, 2008).
Ownership
The third issue
Davis identifies is the ownership of collected data and a good example to
demonstrate the complexity of the issue is Wikileaks. Who does the information
released on Wikileaks rightfully belong to? Does it belong to the governments,
the general public, Wikileaks or the newspapers which published them? Facebook
for instance has its privacy policy set such that any information that anyone
puts on their site belongs to them (Facebook.com, 2012). Basically, this means
that one holds rights to their information or data until they choose to put
them up on the internet which then changes the ownership altogether. While this
may seem like a new problem, it in fact is an old one. If someone decides to
confide in another about some personal information, then the other person is
being given the right to do whatever he wants to do with that piece of
information. He may choose to repeat the information to some other persons to
pass it on, he may decide to collect that information together with other
information that he already knows about that person to derive at some sort of a
conclusion or he may simply choose to forget that piece of information. Same
problem, but on a different platform and in a bigger scale. This is why many
pundits argue that if one is afraid of their private data being misused, say
for example in Facebook, then they simply should either not sign up for their
service or put up posts that may get “leaked.”
Reputation
The final aspect of ethics in big data has to do
with reputation, which begs the question,
“Is data trustworthy?”
In March of 2012, TheMotleyFool.com published its
top ten lists of “Big Data Stocks” where LinkedIn.com (LinkedIn) was placed in
at the third position (Reeves, 2012). However, in May that year
LinkedIn fell victim to hackers and put at risk, the data and privacy of
more than six million of its users, thus jeopardising the website’s reputation
(Barlow, 2012). With the advent of big data, data security becomes a major
issue for data mining companies like LinkedIn. Another problem this brings is
that by losing control of their data, LinkedIn users were stripped of their
private identities on that site, thus putting their reputation at risk as well.
In fact, LinkedIn is known to be the preferred social network for government
agencies (Williams, 2012).
Gone are the days where one could act one way to a
certain group of people and another to a different group of people without each
group finding out. Now, as more and more information gets collected about
someone, organisations and even groups of people that someone does not even know can make assumptions
about a person. For instance, in 2011, Kurt Nordland was seeking workers
compensation from his company and the insurance company paying his claim
decided to look him up on Facebook, where they found pictures of him drinking
beer and relaxing at the beach. Based on those pictures, the insurance company
cancelled his payments, cut off his medical benefits and Nordland had to delay
surgery to repair torn cartilage in his shoulder (Romero, 2011).
Limited Access to Big Data Creates New Digital Divides
When
pundits talk about digital divide, they are generally referring to the issue of
accessibly of technology. Boyd and Crawford argue that big data is not easily
accessible. This is because only social media sites have access to these data
and they have no obligation to make it available. For example, Twitter has
created different grades of access to its data. The Gardenhose access level grants ten per cent of public statuses for free
while requiring case-by-case approval by Twitter and the most popular Spritzer
level provides any user with just one per cent of the public statuses. If money
is not an issue, then the highest level of access called the Firehose grants
users access to all of its data, although Twitter varies the price from time to
time (Gannes, 2012). This grading system alone proves that not everyone gets
equal access to data.
However, the new era of big data also heralds in a new kind
digital divide, which refers to a divide between the people who are able to use
data and those who cannot. Even if
a researcher has access to big data, he must have the technical knowledge to
collect and analyse the data – a skill that many in the social sciences lack
and inadvertently favours the computational scientists (Boyd & Crawford,
2011).
Conclusion
Using
the article by Boyd and Crawford as a framework, this paper made a deeper
analysis of each of their provocations. Although big data is generally
identified by its volume, it is the quality of the data that truly
matters. To be able to sift quality data
however, one needs the technical skills and financial capability to collect
them and analyse them. With the Twitter example, to have access to its Firehose,
one needs to be able to afford to purchase the data in the first place. In
addition, because big data includes a high percentage of unstructured and
semi-structured data, such as images and texts, the research methodologies
traditionally used in social sciences needs to change to suit the new type of
data. Researchers also have to contend with correlations of data without
understanding the causality as data is created in real time. Therefore,
scientists do not have the luxury of time to understand causality. But of
course, privacy concerns present a major roadblock to the availability and
collection of those data. Creators of big data often do not understand how
information about them is used and more and more people are pushing for better
privacy policies on social media sites.
Another problem is that many people also use pseudonyms to hide their
real identity so that information about them cannot be traced easily. However,
technological advancements have allowed for the de-anonymisation of people’s
fake identities to reveal their true self.
The digital divide also presents another problem in that most of the
critical data is only available to the financially and politically elite with
sites like Twitter charging a huge amount of money to purchase most of their
data.
Although
their article highlights major challenges with big data, one cannot deny that
big data has significantly improved people’s lives. The IDC reported an
increase of three hundred billion dollars in potential annual revenue to the
American healthcare and a hundred billion dollars increase in revenue for
telecommunications service providers. Technology research giant, Gartner,
reported a sixty per cent potential increase in net margin in the retail
industry while predicting a two hundred and fifty euro potential annual value to
Europe’s public sector (DataArt, 2012). Therefore, while the potential of big
data is enormous, the task of overcoming the six provocations identified by
Boyd and Crawford are equally as big.
References
Anderson,
C., 2008. The End of Theory: The Data Deluge Makes the Scientific Method
Obsolete. [Online] Available at: http://www.wired.com/science/discoveries/magazine/16-07/pb_theory
[Accessed 5 November 2012].
[Accessed 5 November 2012].
Bank of America
Corporation, 2012. Tech's 'Big
Data' Drive. [Online]
Available at: http://wealthmanagement.ml.com/wm/Pages/ArticleViewer.aspx?title=techs-big-data-drive.
Available at: http://wealthmanagement.ml.com/wm/Pages/ArticleViewer.aspx?title=techs-big-data-drive.
[Accessed 28 October 2012].
Barlow, R., 2012. LinkedIn Hacking:
What You Need to Know [Online]. Available at: http://www.bu.edu/today/2012/linkedin-hacking-what-you-need-to-know/
[Accessed 3 November 2012].
Boyd, D. and Crawford, K., 2011 ‘6
Provocations of Big Data’ paper given at Oxford Internet Institute Decade
in Time Symposium on the Dynamics of the Internet and Society.Oxford, England.
Brown, M.,
2012. Big Data Blasphemy: Why
Sample?. [Online]
Available at: http://smartdatacollective.com/node/47591
[Accessed 5 November 2012].
DataArt, 2012. Industries Set to Benefit from Big Data. [Online]
Available at: http://www.dataart.com/software-outsourcing/big-data/industry-benefits
[Accessed 4 November 2012]
Davis,
K. (ed). 2012. Ethics of Big Data: Balancing Risk and Innovation. California:
O'Reilly Media, Inc.
Edge.org, 2012. Reinventing Society in the Age of Big
Data. [Online]
Available at: http://www.edge.org/conversation/reinventing-society-in-the-wake-of-big-data
[Accessed 4 November 2012].
Available at: http://www.edge.org/conversation/reinventing-society-in-the-wake-of-big-data
[Accessed 4 November 2012].
Facebook, 2012. Statement of Rights
and Responsibilities [Online]. Available at: https://www.facebook.com/legal/terms
[Accessed 4 November 2012].
Gannes, L., 2012. Twitter Firehose Too Intense? Take a Sip from the
Gardenhose or Sample the Spritzer [Online]. Available
at: http://allthingsd.com/20101110/twitter-firehose-too-intense-take-a-sip-from-the-garden-hose-or-sample-the-spritzer/
[Accessed 4 November
2012].
Gantz, J. & Reinsel, D., 2011. The 2011 Digital Universe Study:
Extracting Value from Chaos, Massachusetts: IDC Go-to-Market Services.
Hancock, C., 2012. Broadband is important but BIG DATA is the future!.
[Online]
Available at: http://www.abc.net.au/technology/articles/2012/04/16/3478362.htm
[Accessed 22 October 2012].
Available at: http://www.abc.net.au/technology/articles/2012/04/16/3478362.htm
[Accessed 22 October 2012].
Heussner, K.H., 2012. The Internet Identity Crisis [Online]
Available at: http://www.adweek.com/news/technology/internet-identity-crisis-137991
[Accessed 26 October 2012].
McKinsey and Company, 2011. Big Data: The next frontier for
innovation, competition and productivity, s.l.: McKinsey Global Institute.
Narayanan, A. and Schmatikov, V., 2008 ‘Robust De-anonymization of Large
Sparse Datasets’ paper given at IEEE Symposium on Security and
Privacy.Washington, United States of America
Newman, J., 2011. Facebook comments
Expose a Flaw in Zuckerberg's Vision [Online]. Available at: http://technologizer.com/2011/03/07/facebook-comments-zuckerberg-vision/
[Accessed 4 November 2012].
Ohm, P., 2009.
Broken Promises of Privacy: Responding to the Surprising Failure of
Anonymization. UCLA Law Review. 57,
pp.1701.
Polonetsky, J
and Tene,O. (2012). Privacy in the Age of Big Data: A Time for Big Decisions. Stanford Law Review. 64, pp.63.
Raftery, T., 2012. Sustainability, social media and big data. [Online]
Available at: http://theenergycollective.com/tom-raftery/138166/sustainability-social-media-and-big-data
[Accessed 3 November 2012].
Available at: http://theenergycollective.com/tom-raftery/138166/sustainability-social-media-and-big-data
[Accessed 3 November 2012].
Reeves, J., 2012. Top 10 Big Data
Stocks: LinkedIn [Online] Available at: http://www.fool.com/investing/general/2012/03/06/top-10-big-data-stocks-linkedin.aspx
[Accessed 4 November 2012].
Romero, R., 2011. Are insurance companies spying on your Facebook page? [Online] Available at: http://abclocal.go.com/kabc/story?section=news/consumer&id=8422388
[Accessed 2 November 2012].
Weigel, M., 2012. What is Big Data? Research roundup, reading list. [Online]
Available at: http://journalistsresource.org/studies/economics/business/what-big-data-research-roundup
[Accessed 9 October 2012].
Available at: http://journalistsresource.org/studies/economics/business/what-big-data-research-roundup
[Accessed 9 October 2012].
Weinberger, D., 2012. Understanding big data vs. theory. [Online]
Available at: http://www.kmworld.com/Articles/Column/David-Weinberger/Understanding-big-data-vs.-theory-85784.aspx
[Accessed 29 October 2012].
Available at: http://www.kmworld.com/Articles/Column/David-Weinberger/Understanding-big-data-vs.-theory-85784.aspx
[Accessed 29 October 2012].
Williams, M., 2012. Social Media Still Has
Skeptics in Government [Online].
Available at: http://www.govtech.com/policy-management/Social-Media-Skeptics-Government.html
[Accessed 1 November 2012].
Wolfs, F., n.d. Introduction to the Scientific Method. [Online]
Available at: https://sites.google.com/site/kapostase/home/fyr/use-of-the-scientific-method
[Accessed 30 October 2012].
Available at: https://sites.google.com/site/kapostase/home/fyr/use-of-the-scientific-method
[Accessed 30 October 2012].
No comments:
Post a Comment