What is federation ?
A federation protocol allows users to host their own online server/service while being automatically connected to every other servers/services that comply with this protocol. Standard emails are backed by a federated protocol (SMTP), which means that anyone can host his own email server and be connected to all other email servers in the world.
(from Bruce McVarish)
Federated networks have many advantages compared to centralized systems:
- Shared costs amongst server owners
- The network can scale at constant cost for every participant
- The system is resilient to single point of failures
- Information is shared by design and is not captured in silos
Despite all these advantages, most other communication services, such as instant messaging, do not adopt this strategy, mainly because the initial service provider has no more control over the federated network and the shared data than any other node in the network.
A federation for scientific resources
Although initially focused on person-to-person communication (see mastodon), federation protocols may also be used to share content, such as images (see pixelfed) audio files (see funkwhale) and videos (see peertube). All these services (and others, see the fediverse) actually implement the same ActivityPub protocol, which has been standardized by the W3C in January 2018 (see W3C). This means that users of one service may interact with users on the other services seamlessly.
The proposed OLKi platform will similarly implement the ActivityPub protocol to instantiate a federated network focused on:
- Sharing scientific, natural language-derived resources (datasets, ontologies, tools...);
- Communication between researchers working on these resources and topics (such a communication already takes place now mainly on Twitter, reddit...);
- Diffusion of user-oriented resources (videos, tutorials, MOOCs...) derived from these scientific resources towards citizen;
- Interaction between researchers and citizens.
The two last features will be made possible thanks to the ActivityPub protocol, which will enable interaction betwenn the OLKi platforms and the citizen social networks mastodon, peertube...
In order to implement this network, we will try and interact with Framasoft, who develop Peertube and as such have an extensive knowledge about the implementation for sharing large files.
Why would this federated platform be more ethical than others ?
The topology of the platform by itself of course does not guarantee an ethical usage of the data; but it helps to achieve this goal. First, it is important to clarify that we do believe language data acquisition and processing is valuable for the progress of research as well as for improving the quality of life of citizens, and sometimes even life itself. But just like any technology, it may both be used wisely and responsively or be harmful.
The key assets of the OLKi platform with regard to ethics and privacy are:
- The terms of services of the OLKi instance as well as all instances that we will connect to must include a section on privacy protection and ethical usage of the data; we will ban all other instances that do not conform to these values so that they can not be connected with our OLKi sub-network.
- All data will be public and the whole network will be transparent to every participant. This policy is not obviously an asset with regard to privacy (although, see next item), but it is with regard to equity: all actors of the network are equal concerning access to information, which we believe is a solid fundation to build up ethical AI research based on mutual trust.
- Privacy leaks become threats when they can be linked to a real, physical, identity. When only a virtual identity is concerned, privacy leaks are less dramatic, as it may be enough to just delete the impacted virtual identity and create a new one. Online anonimity is an often overlooked shield to protect anyone against such threats, and it should be the norm, instead of the exception. Removing names/address/IPs/etc from a corpus is not enough, as a person signature may be built from his context, his traces, but the real threat is, again, when this signature can be linked to a given real identity. So privacy protection starts by enabling and generalizing anonymous virtual identities. Conversely to most other networks, ActivityPub federation does not impose any link between virtual and real identities; it thus fulfills a fundamental requirement towards generalized online privacy protection that is violated by most other platforms. From the data processing point of view, only a minor part of the information is lost, as it is always possible to train models on anonymous user behaviors and data.