The SATEK project is a DFG funded project since 2024, comprising researchers at Goethe University Frankfurt, the Leibniz Institute for the German Language in Mannheim and the Saxon Academy of Sciences in Leipzig. The project researches the hitherto lacking thematic indexing of very large text corpora. Through close integration of computer science and corpus linguistics, specialised classifications are developed for highly heterogeneous text types and widely varying document sizes.
As testbed, the project focuses on DeReKo, the German Reference Corpus provided by the project partner Leibniz Institute for the German language. Due to the breadth of content in the DeReKo texts, there are no suitable ontologies for thematic indexing. In addition, there is a general lack of training and test data explicitly tagged with thematic metadata, which significantly limits the use of supervised machine learning methods. Using DeReKo as an example, a thematic classification for big corpus data is therefore to be implemented and evaluated for the first time that is efficient, robust, open source, dynamic (no static and thus rapidly outdated category inventory) and fully reusable.
The main objectives are:
- Ensuring openness of content: In addition to standard data catalogues such as DDC/UDC, open content classifications are included. This applies to the Wikipedia category systems as well as the Wikidata classification system, which is not tied to any single language. Hierarchical classifiers are made trainable for dynamic application scenarios.
- Natural language pre-processing: To mitigate the conflict between processing quality and efficiency, the impact of alternative text pre-processing routines and frameworks on quality and time expenditure is being investigated.
- Reference corpus indexing: DeReKo is indexed thematically at both the individual text level and the text segment level using the above-mentioned classification systems.
- Semantic search: An interface for differentiated semantic searches at the text (segment) level is being implemented. All of the planned classification systems can be used or combined for this purpose.
The contribution of SAW consists in developing new document indexing methods involving topic modelling and active learning, as well as the development of a web service for automatic thematic indexing including a “user-in-the-loop” feedback iteration.
Funded by the German Research Foundation (Deutsche Forschungsgemeinschaft, DFG) - 531750631
Contact details
Sächsische Akademie der Wissenschaften zu Leipzig
Karl-Tauchnitz-Str. 1
04107 Leipzig
Tel.: +49 341 697642-33
heyer@saw-leipzig.de
Original project title
(Semi-)Automatisierte thematische Textklassifikation als Basis für korpuslinguistische Mehrwertdienste
