Joint Research Centre

This dataset comprises nearly 1 million news articles referring to biodiversity, published between 1 January 2015 and 30 June 2025. It served as the primary input for the report “Biodiversity in mainstream media and unverified sources from 2015 to 2025.” The data are organised by month.
Due to data sensitivity considerations, we are unable to provide a dataset containing a separate list of unverified online sources. For this reason, all sources, both mainstream and unverified, are included together within one dataset. The identification of unverified sources is based on assessments conducted by the European External Action Service (EEAS), as outlined in the “3rd EEAS Report on Foreign Information Manipulation and Interference Threats”, as well as evaluations made by independent external experts working in the field of disinformation. Some of these sources are reviewed by reputable fact-checking organisations such as butac, konspiratori, factcheck, mediascan, or published in reports such as VIGINUM Portal Kombat analysing a structured and coordinated pro-Russian propaganda network.
For each article, the following fields are provided: Link, Publication Date, Title, Cluster Title, Cluster Keyphrases, Sentiment, Framing Dimensions, and Persuasion Techniques. By tracking coverage over time and applying multilingual clustering combined with large language model (LLM)–based cluster titles and keyphrases, analysts identified key topics and events related to biodiversity and examines how they are represented in online news media. The clusters are derived from a multilingual clustering pipeline using the LaBSE sentence embedding model, PyNNDescent for approximate neighbourhood graphs, and LeidenAlg for community detection. Each cluster represents a story or narrative prominent in a given month. For each cluster, 50 random article excerpts (first 350 characters) were sampled, and GPT-4o was used to generate a cluster title and cluster keyphrases.
To calculate the sentiment value, we utilise a state-of-the-art sentiment model, namely the XLM-RLnews-8 model, which is specifically designed for document-level sentiment analysis across multiple languages. Based on XLM-RoBERTa-Large, this model has been fine-tuned for sentiment analysis using the Unified Multilingual Sentiment Analysis Benchmark (UMSAB) dataset. The sentiment classes are computed on the English translation of the headlines. More details on the model development are available in: Di Nuovo, E., Cartier, E., Bertrand De Longueville, ‘Meet XLM-RLnews-8: Not Just Another Sentiment Analysis Model’. In Natural Language Processing and Information Systems, 28th International Conference on Applications of Natural Language to Information Systems, NLDB 2024, Turin, Italy, June 25–27, 2024, Proceedings (pp. 1). Springer Science and Business Media Deutschland GmbH, 2024.
The “framing dimensions” and “persuasion techniques” fields contain specific frames and rhetorical strategies identified within each article. Articles may contain multiple instances. These labels were produced using in-house machine-learning classifiers. Framing refers to the perspective under which an issue or a piece of news is presented. We consider 14 frames: (1) Economic, (2) Capacity and resources, (3) Morality, (4) Fairness and equality, (5) Legality, constitutionality and jurisprudence, (6) Policy prescription and evaluation, (7) Crime and punishment, (8) Security and defence, (9) Health and safety, (10) Quality of life, (11) Cultural identity, (12) Public opinion, (13) Political, (14) External regulation and reputation . Persuasion techniques refer to the style of writing of a text with the aim to influence the reader. In this report we consider the following sub selection: (1) Appeal to Authority, (2) Appeal to Fear-Prejudice, (3) Appeal to Hypocrisy, (4) Appeal to Time, (5) Appeal to Values, (6) Causal Oversimplification, (7) Consequential Oversimplification, (8) Conversation Killer, (9) Doubt, (10) Exaggeration-Minimisation, (11) False Dilemma-No Choice, (12) Flag Waving, (13) Guilt by Association, (14) Loaded Language, (15) Name Calling-Labelling, (16) Questioning the Reputation, (17) Repetition, (18) Slogan. For more information see the JRC Technical Report: Piskorski, J., Stefanovitch, N., Bausier, V. A., Faggiani, N., Linge, J., Kharazi, S., Nakov, P. (2023). News categorization, framing and persuasion techniques: Annotation guidelines. European Commission, Ispra, JRC132862..
Publisher name: Joint Research Centre
Publisher URL: https://commission.europa.eu/about/departments-and-executive-agencies/joint-research-centre
Last updated: 2026-04-02T11:26:53Z

Biodiversity In Online News Media From 2015 To 2025, Dataset

Joint Research Centre

Related Datasets

Social Media Ban For Minors: A Computational Analysis Of Media Coverage In Europe And Beyond, Dataset

Socioeconomic Tracker Using Unconventional Data

Update Of Selected AI Cases In The Public Sector (deprecated)

AI4Peatlands - Ireland Use Case

Teacher's Bias Dataset: A Factorial Survey Experiment

JRC-Farming-Practices Dataset (version 2023) – An Evidence Library Of The Effects Of Farming Practices On The Environment And The Climate