On data aggregation: Benefits and Issues

Wednesday, April 21, 2010 Posted by Cecilia Loureiro-Koechlin 1 comments
My work at the moment has to do with digital repositories, registries and data aggregation. I work in a university. I am a project analyst and have the privilege of witnessing state of the art technology development and most importantly users’ reactions. Developers around me use semantic web technologies to create systems to harvest and update data about research from a variety of sources, store and record provenance as well as preserve, give access to and view these data. The result is a registry that mirrors data which can have a variety of uses.

Data about research is data that describes research activities and researchers. Most of these data come from already publicly available sources: departmental and project websites. These sources are in the hundreds (lots of URLs to remember!) They are dispersed and disconnected. The point with collecting all these data is to have them in one place and to build connections between data objects which originally were not connected. Data objects are researcher, project, grant, etc. For example, we can find researcher A’s biography in website “one,” a list of his publications in website “two,” his name in three project websites, his name in grants in a research council website and a list of research interests in a group website. We can put everything together and investigate whether all this data belongs to the same person. If so, we can present a much completer picture of this researcher.

The benefits of data aggregation are obvious, at least to us. We can create improved pictures of researchers and their research activities at individual, departmental, university and field levels. Having all these connections can facilitate discovery of research opportunities or trends. We can identify connections between researchers who do not know each other, for example if they have similar research interests. We can build connections between research groups or identify research islands. Instead of having to navigate through a huge amount of websites (via Google) users can access this information which is stored in one place. It can also help the inexperienced (e.g. students) to find information.

To give you an idea about how this happens --> users can see these data via a registry explorer (a search engine), or via APIs to create websites, widgets, etc. All these have been developed in the office. The list of benefits is much longer than this but I think the above can give you a good idea.

With all these fantastic benefits one could think no one could resist data aggregation. Everyone would prefer to access aggregated data rather than the individual, disconnected sources. Everyone would like to be aggregated so they can have a nice online profile. Well, that is not entirely true. While some people (geeks!) love the idea, some people think data aggregation raises many issues and brings with it many risks. Risks which they think are not worth taking.

--> This bit is a more general discussion
Having read some general literature on this topic I can summarise the main issues here. Data aggregation:
  • Threatens individuals’ privacy: one aspect of privacy is controlling information about oneself. I decide what and where to disclose (or not disclose) my information. Do we have the right to take data from sources which are not ours, store them, aggregate them and display them? even thought these data are already public? Even if we publish data in a public place, we have the right to their privacy. It doesn’t mean we want everyone to read it. -- Allows systems of surveillance: people may choose to disclose some of their private data in bits in different places, but by aggregating their data we are not only exposing those bits but creating a more comprehensive picture of people's activities and interests? Aggregated data can play a big role in big-brother monitoring people. That is invasion of privacy, isn’t it?
  • Can lead to security problems: since data aggregation makes it easier to identify people – people can be identified through bits of anonymous data put together – it can help identity theft and other kinds of crimes.
  • Can mislead people: aggregated data is not always comprehensible or true. How reliable are data aggregators and their sources? How do we know if the data presented is correct and belongs to the same person? One can get their profile mixed with someone else’s and that can lead to serious misunderstandings.
  • Does not always follow the same original intentions of the creators. Can we use data as we wish, for uses which are different than the originally intended by their owners? Would this be ethical? How can we reinforce principles like the use limitation principle and the purpose specification principle? (van Wel and Royakkers, 2004)
  • Can violate contextual integrity, in other words can de-contextualise data changing its original meaning: the process of collecting and aggregating data involves the moving of information from its original (appropriate) context to different ones not necessarily appropriate. Some people will find this morally offensive (Nissenbaum, 1998).
The above are general issues and apply mostly to online data aggregators which are spreading rapidly over the web. (e.g., http://www.nodalbits.com/bits/spokeo-latest-personal-data-aggregator-exposing-data-privacy-fears/) These aggregators are hungry machines, they pick up everything they can (with or without permission) and offer their data (?) to a variety of business and users.

Nissenbaum (1997) warns us about two misleading (but common) assumptions:
  • Erroneous assumption 1: There is a realm of public information about persons to which no privacy norms apply.
  • Erroneous assumption 2: An aggregation of information does not violate privacy if its parts, taken individualy, do not.
--> end of general discussion

These issues can be partly related to the work we are doing with information about research.

While we aggregate data in a much smaller, limited and controlled universe we are facing some challenges as well. We are using - not personal but research - data from official, public websites in the university and we make sure we always ask for consent from our contributors. If someone does not want to be in the registry we do not take their data. Simple.

Although not dealing with data of an intimate, personal, nature we are exposing the work of researchers. Whereas some researchers would like the publicity, some researches would consider this information as private - to themselves or a small circle of colleagues - at least at early stages of their work.

In some way we are creating a system of surveillance where others can monitor performance. Again, not everyone likes to be watched.

There are other things people have raised, things like:
  • How complete and accurate a picture we can build of their department or university if some people choose not to contribute and if we do not have control of the sources? How useful can an incomplete registry be?
  • How many errors or gaps can be identified or make more evident once data are aggregated? Can they be corrected?
  • Sharing: Do we need to share what we are doing? We do not want everyone to see what we are doing. (Research-data/Research activity privacy?)
  • Duplicty: Is it going to replace our official websites? why do you duplicated them?
  • Coverage: I am only interested in my field of research and I know where I can find relevant information. Aggregating research data is not useful.
Interesting isn’t it? There are more issues of course but again I hope the above gives you a good idea.

My work in the coming months is to try to clarify these issues with a set of users and to identify ways to address them (solve or soften them.) I can see this will involve three areas of work, one improving collection and visualisation of data, two educating users and publicising our services in better ways and three listening to what our contributors say about their aggregated data. Yes, software development is not only about coding but about finding out what people need and how they will react to what we do.

You can read:
Ethics of data mining and aggregation
Data aggregation: Actually a threat?
Lita van Wel and Lamber Royakkers (2004) Ethical issues in web data mining. Ethics and Information Technology 6: 129–140
Nissenbaum, (1997) “Toward an approach to privacy in public: the challenges of information technology,” Ethics and Behavior 7(3) , pp. 207–219.
Nissenbaum, H. (1998), “Protecting Privacy in an Information Age: The Problem of Privacy in Public,” Law and Philosophy, 17, pp. 559-596.

Also
Exploring a ‘Deep Web’ That Google Can’t Grasp

On Twitter social and not so social experiences

Monday, April 19, 2010 Posted by Cecilia Loureiro-Koechlin 0 comments
I’ve been on Twitter for over a year now and I have to say my opinion of it has changed a bit. http://clk0.blogspot.com/2009/08/this-is-what-i-think-about-twitter.html I joined when a friend of mine, Dr T, told me it was fun and that he found it extremely useful. I have to say that I find it useful too but perhaps not at the same level. Dr T’s experience has been quite different from mine.

Here I want to compare our two completely different Twitter experiences. Dr T’s has been extremely social, active and multi-dimensional whereas mine has been rather individual and uni-dimensional. Why has this happened? I guess that is because of our different original aims and motivations, and our behaviour.

On Twitter as in any other SNS you can create your own network of contacts and that network will define a great part of your future interactions. Depending on the time you devote to it you can build up a following list of people whose tweets you find interesting. Perhaps people who you think you would like to meet in real life! (and I am not talking about celebrities.) Dr T was keen on meeting new people and be part of something on Twitter. A group, a community? I was just curious and wanted access to information (news, trivia, etc.) I couldn’t (or didn’t have the time to) get by other means.

Dr T talks to people a lot. I just read tweets, broadcast a little and seldom address someone. Talking means using the @ symbol for example to address one or more people, means replying when they address you and means following threads of conversation. Conversations are extremely important to build online relationships. Conversations and socialisation in the online world basically mean the same. Conversations define the social in Social Networking.

Dr T tweets from his bed, his kitchen, his office, the gym, pubs, etc. I tweet from the office. Dr T uses Twitter in conjunction with other tools e.g., FourSquare, Tumblr, Facebook, etc. I cannot be bothered. He’s attended tweetups! and was part of a public Twitter art display. I found that amusing. He has tweeted 5 to 6 times more than me. He spends time looking after his following list and adding more people. He is much more conscientious of the people he follows and follow him. I do not have a strategy for following people. I don’t mind noise and I have never dedicated more than 2 minutes to check my following and followers lists. I follow people with different interests. Actually I do not have a topic per se but just follow random interesting people. I find people when they are referred to in tweets and sometimes when they talk to me. Many of the people I follow do not tweet more than once a week. Maybe that is why I do not get much noise! hmm No... Some of them tweet 24/7 but I am not watching 24/7.

The above just shows how different our online behaviour has been but the consequences of those behaviours have been even more dissimilar. Dr T has been able to build real friendships over Twitter. He has met some of these people and thinks they are cool. I, on the other had, haven’t been able to move beyond my computer screen. Not that I haven't tried. I tried mobile tweeting but got frustrated when the client's provider started to charge. I know. I could've looked for another client, but to be honest, I couldn't be bothered.

Update: I got Twitter on my mobile again. It took me a bit of time and a new mobile :)
Labels: