OLKi


Open Language and Knowledge for Citizens (OLKi)

Vision

OLKi is first a vision about collaborative and interdisciplinary research around Artificial Intelligence that has emerged from the discussions of researchers in Nancy and Metz (France) in the end of 2017. This vision has now taken the form of an IMPACT project at Lorraine Université d'Excellence (LUE).

In less than 144 characters, this vision concerns the development and extraction of open knowledge from natural language data, both for scientists and citizens and with open, transparent, ethical and privacy-protecting approaches.

Components

The key components of our project are thus:

  • Advanced Artificial Intelligent research, including but not limited to the most recent deep learning methods, which will allow to analyze and generate language data in terms of lexical, syntactic, semantic, pragmatic, dialogic and discursive structures, and thus extract useful knowledge and resources for scientists. Transparency and explainability of algorithms (even deep learning ones) shall be an important aspect of such researches.
  • Analysis of media and their usage by citizens, in order to make the link between such a specialized knowledge and resources that are useful for citizens, but also to feed AI research with these analysis and help AI researchers to better understand the impact of their results on citizens, notably with regard to privacy and ethics.
  • Epistemology, which enables a critic view and a reflexion on the scientific process that is adopted and its results and impact.
  • Linguistic, because we consider language data only.

OLKi is thus focused on strongly interdisciplinary researches, involving hard sciences (mathematics and computer science) and human and social sciences. The researchers who are at the origin of this reflexion come from:

  • Archives Henri Poincar√©: epistemology
  • ATILF: linguistics
  • CREM: media and their usage
  • IECL: mathematics
  • LORIA: computer science

Implementation

All research results carried on within OLKi will be open-sourced and freely disseminated, as we value the concepts of transparency, sharing and open research. These research results will often take the form of resources, and the current possible dissemination channels of scientific resources are not enough compliant with our values: they often either rely on data silos or take the control of the resources away from their original contributors.

One of the main technological outcome of OLki is thus a new platform specialized in the dissemination of scientific resources, and which fulfills all of our requirements thanks to a novel federation paradigm and standard, ActivityPub. This platform is described next.


What is federation ?

A federation protocol allows users to host their own online server/service while being automatically connected to every other servers/services that comply with this protocol. Standard emails are backed by a federated protocol (SMTP), which means that anyone can host his own email server and be connected to all other email servers in the world.

Federated vs. centralized (from Bruce McVarish)

Federated networks have many advantages compared to centralized systems:

  • Shared costs amongst server owners
  • The network can scale at constant cost for every participant
  • The system is resilient to single point of failures
  • Information is shared by design and is not captured in silos
  • etc.

Despite all these advantages, most other communication services, such as instant messaging, do not adopt this strategy, mainly because the initial service provider has no more control over the federated network and the shared data than any other node in the network.

A federation for scientific resources

Although initially focused on person-to-person communication (see mastodon), federation protocols may also be used to share content, such as images (see pixelfed) audio files (see funkwhale) and videos (see peertube). All these services (and others, see the fediverse) actually implement the same ActivityPub protocol, which has been standardized by the W3C in January 2018 (see W3C). This means that users of one service may interact with users on the other services seamlessly.

The proposed OLKi platform will similarly implement the ActivityPub protocol to instantiate a federated network focused on:

  • Sharing scientific, natural language-derived resources (datasets, ontologies, tools...);
  • Communication between researchers working on these resources and topics (such a communication already takes place now mainly on Twitter, reddit...);
  • Diffusion of user-oriented resources (videos, tutorials, MOOCs...) derived from these scientific resources towards citizen;
  • Interaction between researchers and citizens.

The two last features will be made possible thanks to the ActivityPub protocol, which will enable interaction betwenn the OLKi platforms and the citizen social networks mastodon, peertube... In order to implement this network, we will study and take inspiration from Framasoft's projects, who develop Peertube and as such have an extensive knowledge about the implementation for sharing large files.

Why would this federated platform be more ethical than others ?

The topology of the platform by itself of course does not guarantee an ethical usage of the data; but it helps to achieve this goal. First, it is important to clarify that we do believe language data acquisition and processing is valuable for the progress of research as well as for improving the quality of life of citizens. But just like any technology, it may both be used wisely or in harmful ways.

The key assets of the OLKi platform with regard to ethics and privacy are:

  • The terms of services of the OLKi instance as well as all instances that we will connect to must include a section on privacy protection and ethical usage of the data; we will ban all other instances that do not conform to these values so that they can not be connected with our OLKi sub-network.
  • All data will be public and the whole network will be transparent to every participant. This policy is not obviously an asset with regard to privacy (although, see next item), but it is with regard to equity: all actors of the network are equal concerning access to information, which we believe is a solid fundation to build up ethical AI research based on mutual trust.
  • Privacy leaks become threats when they can be linked to a real, physical, identity. When only a virtual identity is concerned, privacy leaks are less dramatic, as it may be enough to just delete the impacted virtual identity and create a new one. Online anonimity is an often overlooked shield to protect anyone against such threats, and it should be the norm, instead of the exception. Removing names/address/IPs/etc from a corpus is not enough, as a person signature may be built from his context, his traces, but the real threat is, again, when this signature can be linked to a given real identity. So privacy protection starts by enabling and generalizing anonymous virtual identities. Conversely to most other networks, ActivityPub federation does not impose any link between virtual and real identities; it thus fulfills a fundamental requirement towards generalized online privacy protection that is violated by most other platforms. From the data processing point of view, only a minor part of the information is lost, as it is always possible to train models on anonymous user behaviors and data.