Analysis of news articles with Natural Language Processing
In recent years, big data and machine learning technologies have opened up huge potential for the analysis of large volumes of data in the media sector. The increasing digitization of media content, e.g. through the availability of newspaper articles in online archives, has also contributed to this trend. This Bachelor’s thesis, which was set up at the Bern University of Applied Sciences in 2023, aims to exploit this potential. All foreign pages from 2006 to 2022 from the Bern daily newspaper „Der Bund“ were analyzed with Natural Language Processing for their main topics. This involves a total of around 5000 issues. The aim was to create one thematic map per publication year, on which the thematic areas are to be highlighted in color.

Technology stack
For the implementation of the project, the PDFs of the foreign pages obtained in an earlier project were read out paragraph by paragraph using Apache Tika. The data was then stored in a MongoDB.
BERTopic, a pipeline based on Bidirectional Encoder Repre-sentations from Transformers (BERT), was used for topic extraction. BERTopic is designed for extracting topics and includes the following steps in detail:
- Convert data into numerical values (embedding)
- Reduce dimensionality
- Cluster data
- Divide topics into tokens
- Differentiate clusters from each other
- Optimize results if necessary
The resulting topics were stored in the Neo4j graph database. The identified topics could be summarized into a topic map using the graph visualization tool Gephi, and exciting changes in the topic landscape in the years 2006 to 2022 could be shown.

Topic groups
The Louvain method is used to measure the density of the compounds and thus identify subgroups in a network. These thematic categories were then differentiated by color. This allowed 12 thematic groups to be defined:
- European Union (light blue)
- Latin America and Vatican (turquoise)
- USA (light green)
- Russia (olive green)
- Eastern Europe and the Balkans (salmon)
- Politics of neighboring countries (purple)
- Asia (pink)
- Middle East (orange)
- Undefined topics (yellow, olive green and light pink)
- External topic Sport (lavender)

Top 100 Topics
The 100 most frequent topics are the topics with the most edges over all 17 years.

Sub-topics
Sub-topics of individual subject groups can also be identified using the Louvain method. This was exemplified here with the topic ‚Asia‘. This example clearly demonstrates the capability of the Louvain method.

Five meaningful subgroups were identified:
- China/Hong Kong (brown)
- Tibet (orange)
- India (green)
- Korea (turquoise)
- Japan (purple)

All topic maps indivudally
















