Baselight
Sign In
ecjrc

Biodiversity In Online News Media From 2015 To 2025, Dataset

Verified Source
European Commission Joint Research Centre (JRC)

@ecjrc.n_3d87c7d3_a3da_46bf_8a52_9231348ab1e4

Loading...
Loading...

Joint Research Centre

Dataset Description

This dataset comprises nearly 1 million news articles referring to biodiversity, published between 1 January 2015 and 30 June 2025. It served as the primary input for the report “Biodiversity in mainstream media and unverified sources from 2015 to 2025.” The data are organised by month.
Due to data sensitivity considerations, we are unable to provide a dataset containing a separate list of unverified online sources. For this reason, all sources, both mainstream and unverified, are included together within one dataset. The identification of unverified sources is based on assessments conducted by the European External Action Service (EEAS), as outlined in the “3rd EEAS Report on Foreign Information Manipulation and Interference Threats”, as well as evaluations made by independent external experts working in the field of disinformation. Some of these sources are reviewed by reputable fact-checking organisations such as butac, konspiratori, factcheck, mediascan, or published in reports such as VIGINUM Portal Kombat analysing a structured and coordinated pro-Russian propaganda network.
For each article, the following fields are provided: Link, Publication Date, Title, Cluster Title, Cluster Keyphrases, Sentiment, Framing Dimensions, and Persuasion Techniques. By tracking coverage over time and applying multilingual clustering combined with large language model (LLM)–based cluster titles and keyphrases, analysts identified key topics and events related to biodiversity and examines how they are represented in online news media. The clusters are derived from a multilingual clustering pipeline using the LaBSE sentence embedding model, PyNNDescent for approximate neighbourhood graphs, and LeidenAlg for community detection. Each cluster represents a story or narrative prominent in a given month. For each cluster, 50 random article excerpts (first 350 characters) were sampled, and GPT-4o was used to generate a cluster title and cluster keyphrases.
To calculate the sentiment value, we utilise a state-of-the-art sentiment model, namely the XLM-RLnews-8 model, which is specifically designed for document-level sentiment analysis across multiple languages. Based on XLM-RoBERTa-Large, this model has been fine-tuned for sentiment analysis using the Unified Multilingual Sentiment Analysis Benchmark (UMSAB) dataset. The sentiment classes are computed on the English translation of the headlines. More details on the model development are available in: Di Nuovo, E., Cartier, E., Bertrand De Longueville, ‘Meet XLM-RLnews-8: Not Just Another Sentiment Analysis Model’. In Natural Language Processing and Information Systems, 28th International Conference on Applications of Natural Language to Information Systems, NLDB 2024, Turin, Italy, June 25–27, 2024, Proceedings (pp. 1). Springer Science and Business Media Deutschland GmbH, 2024.
The “framing dimensions” and “persuasion techniques” fields contain specific frames and rhetorical strategies identified within each article. Articles may contain multiple instances. These labels were produced using in-house machine-learning classifiers. Framing refers to the perspective under which an issue or a piece of news is presented. We consider 14 frames: (1) Economic, (2) Capacity and resources, (3) Morality, (4) Fairness and equality, (5) Legality, constitutionality and jurisprudence, (6) Policy prescription and evaluation, (7) Crime and punishment, (8) Security and defence, (9) Health and safety, (10) Quality of life, (11) Cultural identity, (12) Public opinion, (13) Political, (14) External regulation and reputation . Persuasion techniques refer to the style of writing of a text with the aim to influence the reader. In this report we consider the following sub selection: (1) Appeal to Authority, (2) Appeal to Fear-Prejudice, (3) Appeal to Hypocrisy, (4) Appeal to Time, (5) Appeal to Values, (6) Causal Oversimplification, (7) Consequential Oversimplification, (8) Conversation Killer, (9) Doubt, (10) Exaggeration-Minimisation, (11) False Dilemma-No Choice, (12) Flag Waving, (13) Guilt by Association, (14) Loaded Language, (15) Name Calling-Labelling, (16) Questioning the Reputation, (17) Repetition, (18) Slogan. For more information see the JRC Technical Report: Piskorski, J., Stefanovitch, N., Bausier, V. A., Faggiani, N., Linge, J., Kharazi, S., Nakov, P. (2023). News categorization, framing and persuasion techniques: Annotation guidelines. European Commission, Ispra, JRC132862..
Publisher name: Joint Research Centre
Publisher URL: https://commission.europa.eu/about/departments-and-executive-agencies/joint-research-centre
Last updated: 2026-04-02T11:26:53Z


Related Datasets

Share link

Anyone who has the link will be able to view this.