On data aggregation: Benefits and Issues

Wednesday, April 21, 2010 Posted by Cecilia Loureiro-Koechlin
My work at the moment has to do with digital repositories, registries and data aggregation. I work in a university. I am a project analyst and have the privilege of witnessing state of the art technology development and most importantly users’ reactions. Developers around me use semantic web technologies to create systems to harvest and update data about research from a variety of sources, store and record provenance as well as preserve, give access to and view these data. The result is a registry that mirrors data which can have a variety of uses.

Data about research is data that describes research activities and researchers. Most of these data come from already publicly available sources: departmental and project websites. These sources are in the hundreds (lots of URLs to remember!) They are dispersed and disconnected. The point with collecting all these data is to have them in one place and to build connections between data objects which originally were not connected. Data objects are researcher, project, grant, etc. For example, we can find researcher A’s biography in website “one,” a list of his publications in website “two,” his name in three project websites, his name in grants in a research council website and a list of research interests in a group website. We can put everything together and investigate whether all this data belongs to the same person. If so, we can present a much completer picture of this researcher.

The benefits of data aggregation are obvious, at least to us. We can create improved pictures of researchers and their research activities at individual, departmental, university and field levels. Having all these connections can facilitate discovery of research opportunities or trends. We can identify connections between researchers who do not know each other, for example if they have similar research interests. We can build connections between research groups or identify research islands. Instead of having to navigate through a huge amount of websites (via Google) users can access this information which is stored in one place. It can also help the inexperienced (e.g. students) to find information.

To give you an idea about how this happens --> users can see these data via a registry explorer (a search engine), or via APIs to create websites, widgets, etc. All these have been developed in the office. The list of benefits is much longer than this but I think the above can give you a good idea.

With all these fantastic benefits one could think no one could resist data aggregation. Everyone would prefer to access aggregated data rather than the individual, disconnected sources. Everyone would like to be aggregated so they can have a nice online profile. Well, that is not entirely true. While some people (geeks!) love the idea, some people think data aggregation raises many issues and brings with it many risks. Risks which they think are not worth taking.

--> This bit is a more general discussion
Having read some general literature on this topic I can summarise the main issues here. Data aggregation:
  • Threatens individuals’ privacy: one aspect of privacy is controlling information about oneself. I decide what and where to disclose (or not disclose) my information. Do we have the right to take data from sources which are not ours, store them, aggregate them and display them? even thought these data are already public? Even if we publish data in a public place, we have the right to their privacy. It doesn’t mean we want everyone to read it. -- Allows systems of surveillance: people may choose to disclose some of their private data in bits in different places, but by aggregating their data we are not only exposing those bits but creating a more comprehensive picture of people's activities and interests? Aggregated data can play a big role in big-brother monitoring people. That is invasion of privacy, isn’t it?
  • Can lead to security problems: since data aggregation makes it easier to identify people – people can be identified through bits of anonymous data put together – it can help identity theft and other kinds of crimes.
  • Can mislead people: aggregated data is not always comprehensible or true. How reliable are data aggregators and their sources? How do we know if the data presented is correct and belongs to the same person? One can get their profile mixed with someone else’s and that can lead to serious misunderstandings.
  • Does not always follow the same original intentions of the creators. Can we use data as we wish, for uses which are different than the originally intended by their owners? Would this be ethical? How can we reinforce principles like the use limitation principle and the purpose specification principle? (van Wel and Royakkers, 2004)
  • Can violate contextual integrity, in other words can de-contextualise data changing its original meaning: the process of collecting and aggregating data involves the moving of information from its original (appropriate) context to different ones not necessarily appropriate. Some people will find this morally offensive (Nissenbaum, 1998).
The above are general issues and apply mostly to online data aggregators which are spreading rapidly over the web. (e.g., http://www.nodalbits.com/bits/spokeo-latest-personal-data-aggregator-exposing-data-privacy-fears/) These aggregators are hungry machines, they pick up everything they can (with or without permission) and offer their data (?) to a variety of business and users.

Nissenbaum (1997) warns us about two misleading (but common) assumptions:
  • Erroneous assumption 1: There is a realm of public information about persons to which no privacy norms apply.
  • Erroneous assumption 2: An aggregation of information does not violate privacy if its parts, taken individualy, do not.
--> end of general discussion

These issues can be partly related to the work we are doing with information about research.

While we aggregate data in a much smaller, limited and controlled universe we are facing some challenges as well. We are using - not personal but research - data from official, public websites in the university and we make sure we always ask for consent from our contributors. If someone does not want to be in the registry we do not take their data. Simple.

Although not dealing with data of an intimate, personal, nature we are exposing the work of researchers. Whereas some researchers would like the publicity, some researches would consider this information as private - to themselves or a small circle of colleagues - at least at early stages of their work.

In some way we are creating a system of surveillance where others can monitor performance. Again, not everyone likes to be watched.

There are other things people have raised, things like:
  • How complete and accurate a picture we can build of their department or university if some people choose not to contribute and if we do not have control of the sources? How useful can an incomplete registry be?
  • How many errors or gaps can be identified or make more evident once data are aggregated? Can they be corrected?
  • Sharing: Do we need to share what we are doing? We do not want everyone to see what we are doing. (Research-data/Research activity privacy?)
  • Duplicty: Is it going to replace our official websites? why do you duplicated them?
  • Coverage: I am only interested in my field of research and I know where I can find relevant information. Aggregating research data is not useful.
Interesting isn’t it? There are more issues of course but again I hope the above gives you a good idea.

My work in the coming months is to try to clarify these issues with a set of users and to identify ways to address them (solve or soften them.) I can see this will involve three areas of work, one improving collection and visualisation of data, two educating users and publicising our services in better ways and three listening to what our contributors say about their aggregated data. Yes, software development is not only about coding but about finding out what people need and how they will react to what we do.

You can read:
Ethics of data mining and aggregation
Data aggregation: Actually a threat?
Lita van Wel and Lamber Royakkers (2004) Ethical issues in web data mining. Ethics and Information Technology 6: 129–140
Nissenbaum, (1997) “Toward an approach to privacy in public: the challenges of information technology,” Ethics and Behavior 7(3) , pp. 207–219.
Nissenbaum, H. (1998), “Protecting Privacy in an Information Age: The Problem of Privacy in Public,” Law and Philosophy, 17, pp. 559-596.

Also
Exploring a ‘Deep Web’ That Google Can’t Grasp
  1. Excellent article. I work with research data systems and work collaboratively with other institutions to aggregate research data (researcher, grant, publication and participant data). Aside from basic privacy issues like gaining access to the data in the first place, I haven't thought much about more complex issues with aggregation that you describe so well. Recording the context of aggregation and controlling the aggregation in various context to take place seems like a crucial issue. I would love to talk with you more about the topics. I'm going to read a few more of your articles and comment hopefully :)
    Thanks!
    Dusan

Post a Comment