Or how to move from a metadata language to a culture of collective intelligence…
THE METADATA ISSUE
Metadata are the data that organize the data. Data are like books in a library and metadata are like the library card index and catalog: their function is to identify the books in order to store and find them better. Metadata are less about describing things exhaustively (it is not about making maps at the same scale as the territory…) than about providing reference points from which users can find what they are looking for, with the help of algorithms. All information systems and software applications organize information through metadata.
We can distinguish between…
- material metadata, such as a file’s format, creation date, author, license, etc.
- semantic metadata which deals with the content of a document or a set of data (what it is about) as well as its practical dimension (what the data is used for, by whom, in what circumstances, etc.).
The main focus here is on semantic metadata. A semantic metadata system can be as simple as a vocabulary. At a higher level of complexity it can be a hierarchical classification or taxonomy. At the most complex level, it is an « ontology », i.e. the modelling of a domain of knowledge or practice, which may contain several taxonomies with transverse relationships, including causal relationships and automatic reasoning capabilities.
Semantic metadata are an essential part of artificial intelligence devices:
- they are used as skeletons for knowledge graphs – or knowledge bases – implemented by big techs (Google, Facebook, Amazon, Microsoft, Apple…) and more and more in large and medium-sized companies,
- they are used – under the name of « labels » – to categorize the training datasets for deep learning models.
Because they structure contemporary knowledge, whose medium is digital, metadata systems represent a considerable stake at the scientific, cultural, political levels…
One of the goals of my company INTLEKT Metadata Inc. is to establish IEML (Information Economy MetaLanguage) as a standard for the expression of semantic metadata systems. What is the contemporary landscape in this area?
THE SEMANTIC METADATA LANDSCAPE TODAY
The system of standard formats and « languages » proposed by the World Wide Web Consortium – W3C – (XML, RDF, OWL, SPARQL) to achieve the « Semantic Web » has been around since the late 20th century. It has not really caught on, and especially not in companies in general and big tech in particular, which use less cumbersome and less complex formats, such as « property graphs« . Moreover, manual or semi-manual categorization of data is often replaced by statistical approaches for automated indexing (NLP, deep learning…), which bypass the need to design metadata systems. The W3C system of standards deals with the *files formats and programs* handling semantic metadata but *not the semantics itself*, i.e. the categories, concepts, properties, events, relations, etc. that are always expressed in natural languages, with all the ambiguities, multiplicities and incompatibilities this implies.
On top of this system of standard formats, there are standard models to deal with the actual semantic content of concepts and their relationships. For example schema.org for web sites, CIDOC-CRM for the cultural domain, etc. There are standard models for many domains, from finance to medicine. The problem is, there are often several competing models for a domain and the models themselves are hypercomplex, to the point that even the specialists of a model master only a small part of it. Again, these models are expressed in natural languages, with the problems that this implies… and most often in English only.
Specific metadata systems
Taxonomies, ontologies and other metadata systems implemented in real applications for organizing data sets are mostly partial uses of standard models and standard formats. Users submit – to varying degrees of success – to these layers of standards in the hope that their data and applications will become the happy subjects of a realm of semantic interoperability. But their hopes are disappointed. The ideal of the decentralized intelligent Web of the late 1990s has given way to search engine optimization (SEO) more or less aligned with Google’s (secret!) knowledge graph. We have to admit, almost a quarter of a century after its launch, that the W3C’s Semantic Web has not kept its promises.
To achieve semantic interoperability, i.e. fluid communication between knowledge bases, information system managers submit to rigid models and formats. But because of the multitude of formats, models and their disparate applications, not to mention language differences, they do not achieve the expected gain. Moreover, producing a good metadata system is expensive, because it requires a multidisciplinary team including: a project manager, one or more specialists in the domain of use, a specialist in formal modelling of the type of taxonomy or ontology (cognitive engineering) who is able to find his way through the labyrinth of standard models, and finally a computer engineer specializing in semantic metadata formats. Some people combine several of these skills, but they are rare. Finally, a recent survey shows that W3C « linked data » tools (including RDF, OWL and SPARQL) are too complex for web developers and end-users.
HOW CAN IEML SOLVE PROBLEMS IN THE WORLD OF SEMANTIC METADATA?
IEML in a nutshell
IEML – patented by INTLEKT Metadata – is neither a taxonomy, nor a universal ontology, nor a model, nor a format: it is a *language* or *meta-ontology* composed of (1) a few thousand semantic primitives organized in paradigms and (2) a fully regular grammar.
Unique features of the IEML language
IEML is « agnostic » with respect to formats, natural languages and hierarchical relationships between concepts. IEML allows to build and share any concept, concept hierarchy or relationship between concepts. Therefore IEML does not produce a flattening of expressive possibilities. However, IEML does provide semantic interoperability, i.e. the ability to merge, exchange, recombine, connect and translate almost automatically metadata systems and the knowledge bases organized by these metadata. IEML thus reconciles maximum originality, complexity or cognitive simplicity on the one hand and interoperability or communication on the other, contrary to the contemporary situation where interoperability is « paid for » with a restriction of expressive possibilities.
Unique features of the IEML editor
Another advantage: unlike the main contemporary metadata editing tools (Smart Logic Semaphore, Pool Party, Synaptica, Top Braid Composer) the IEML editor designed by INTLEKT will be intuitive (visual interface based on tables and graphs) and collaborative. It is not designed for specialists in RDF and OWL (the standard formats), like the editors mentioned above, but for application domain experts. A method accompanying the tool will help specialists to formalize their domains in IEML. The software will automatically import and export the metadata in the standard formats chosen by the user. Thus, the IEML editor will reduce the complexity and cost of creating semantic metadata systems.
Market for metadata management and edition tools
It is easy to see that, as the amount of data produced continues to grow, along with the urge to extract usable knowledge from it, there is an increasing need to create and maintain good metadata systems. The market for semantic metadata system editing and management tools is now worth $2 billion and could reach (by a very conservative estimate) $16 billion by 2026. This projection aggregates:
- data from the semantic industry itself (companies that create metadata systems for their customers),
- semantic annotation tools for training datasets for machine learning used in particular by data scientists,
- management of their internal metadata systems by the big tech.
INTLEKT GOALS FOR THE NEXT 5-10 YEARS
We want IEML to become an open-source standard for semantic metadata around 2025. The IEML standard should be supported, maintained and developed by a non-profit foundation. This foundation will moderate a community dedicated to the collaborative edition of IEML metadata systems and provide a public knowledge base of IEML-categorized data. The foundation will create a socio-technical ecosystem conducive to the growth of collective intelligence.
The private company
INTLEKT will continue to maintain the collaborative editing tool and to design custom semantic knowledge bases for solvable clients. We will also implement a marketplace – or exchange system – for IEML-indexed private data that will be based on the blockchain. The IEML-indexed knowledge bases will be interoperable on the parallel planes of data analysis, automatic reasoning, and neural models training.
However, before reaching this point, INTLEKT must demonstrate the effectiveness of IEML through several real-world use cases.
INTLEKT’S MARKET IN THE 2-5 YEAR HORIZON
Interviews with numerous potential customers have enabled us to define our market for the coming years. Let’s define the relevant areas by elimination and successive approximations.
IEML is not relevant for modeling purely mathematical, physical or biological objects. The exact sciences already have formal languages and recognized classifications. On the other hand, IEML is relevant for objects from the humanities and social sciences or for interactions between objects from the exact sciences and objects from the humanities, such as technology, health, the environment or urban phenomena.
For the time being, we will not exhaust ourselves in translating all existing metadata models into IEML: they are very numerous, sometimes contradictory, and rarely used in full. Many users of these models are content to select a small, useful subpart of them and will not invest their time and money in a new technology without necessity. For example, many SEO (Search Engine Optimization) companies extract a useful subset from schema.org‘s classes (sponsored by Google) and Wikidata‘s entities (because they are trusted by Google) and have no need for additional semantic technologies. Other examples: the gallery, museum, library or archive sectors have to submit to rigid professional standards with limited possibilities of innovation. In short, sectors that are content to use an existing standard model are not part of our short-term market. We will not fight losing battles. In the long term, however, we envision a collaborative platform where the voluntary translation of current standard models into IEML can take place.
Let’s also eliminate the e-commerce market for the moment. This sector does use category systems to identify broad domains (real estate, cars, appliances, toys, books, etc.), but the multitude of goods and services within these broad categories is captured by automatic natural language processing or machine learning systems, rather than by refined metadata systems. We do not believe that IEML will be adopted in the near term in online commerce.
This leaves non-standard domains – which do not have ready-made models – or multi-standard domains – which must build hybrid models or crossroads – and for which statistical approaches are useful… but not sufficient. Think for example of collaborative learning, public health, smart cities, software documentation, analysis of complex corpora from several disciplines, etc.
Modelling and visualization of complex systems
Within the non-standard domains, we have identified the following needs that are not met by the semantic technologies in use today:
- The modelling of complex human systems, where several heterogeneous « logics » meet, i.e. groups obeying various types of rules. This includes data produced by processes of deliberation, argumentation, negotiation and techno-social interaction.
- The modelling of causal systems, including circular and intertwined causalities.
- The modelling of dynamic systems during which the objects or actants transform. These dynamics can be of various types: evolution, ontogeny, successive hybridizations, etc.
- The interactive 2D or 3D visualization and exploration of semantic structures in huge corpora, preferably in a memorable form, i.e. easy to remember.
In the coming years, INTLEKT intends to model complex dynamic systems involving human participation in a causal manner and to provide access to a memorable sensory-motor exploration of these systems.
IEML being a language, everything that can be defined, described and explained in natural language can be modelled in a formal way in IEML, thus providing a qualitative framework for quantitative measurements and calculations. It will be possible to perform automatic reasoning from rules, prediction and decision support, but the main contribution of IEML will be an increased capacity for analysis, synthesis, mutual understanding and coordination in action of user communities. Semantic interoperability should be the by-product of a cognitive gain, not a costly obligation.
THE NEXT SIX MONTHS
The IEML language already exists. Its development has been funded with one million dollars in an academic setting. We also have a prototype of the editor. We now need to move to a professional version of the editor in order to meet the market needs identified in the previous section. For this we are seeking a « seed » private investment of about 250 K US$, which will be used mainly for the development of a collaborative editing platform with the appropriate interface. Welcome to investors.