الوصف: |
In this position statement, we would like to contribute to the discussion about how to assess quality and coverage of a model. In this context, we verbalize the need of linguistic features’ interpretability and the need of profiling textual variations. These needs are triggered by the necessity to gain insights into intricate patterns of human communication. Arguably, the functional and linguistic interpretation of these communication patterns contribute to keep humans’ needs in the loop, thus demoting the myth of powerful but dehumanized Artificial Intelligence. The desideratum to open up the “black boxes” of AI-based machines has become compelling. Recent research has focussed on how to make sense and popularize deep learning models and has explored how to “probe” these models to understand how they learn. The BERTology science is actively and diligently digging into BERT’s complex clockwork. However, much remains to be unearthed: “BERTology has clearly come a long way, but it is fair to say we still have more questions than answers about how BERT works”. It is therefore not surprising that add-on tools are being created to inspect pre-trained language models with the aim to cast some light on the “interpretation of pre-trained models in the context of downstream tasks and domain-specific data”. Here we do not propose any new tool, but we try to formulate and exemplify the problem by taking the case of text simplification/text complexity. When we compare a standard text and an easy-to-read text (e.g. lättsvenska or simple English) we wonder: where does text complexity lie? Can we pin it down? According to Simple English Wikipedia, “(s)imple English is similar to English, but it only uses basic words. We suggest that articles should use only the 1,000 most common and basic words in English. They should also use only simple grammar and shorter sentences.” This characterization of a simplified text does not provide much linguistic insight: what is meant by simple grammar? Linguistic insights are also missing from state-of-the-art NLP models for text simplification, since these models are basically monolingual neural machine translation systems that take a standard text and “translate” it into a simplified type of (sub)language. We do not gain any linguistic understanding, of what is being simplified and why. We just get the task done (which is of course good). We know for sure that standard and easy-to-read texts differ in a number of ways and we are able to use BERT to create classifiers that discriminate the two varieties. But how are linguistic features re-shuffled to generate a simplified text from a standard one? With traditional statistical approaches, such as Biber’s MDA (based on factor analysis) we get an idea of how linguistic features co-occur and interact in different text types and why. Since pre-trained language models are more powerful than traditional statistical models, like factor analysis, we would like to see more research on "disclosing the layers" so that we can understand how different co-occurrence of linguistic features contribute to the make up of specific varieties of texts, like simplified vs standard texts. Would it be possible to update the iconic example |