This repository contains the code and data supporting the working paper "Semantic Clustering of Italian Political News on Facebook: Comparing text-embedding-3-large and UmBERTo Embeddings using HDBSCAN and K-means".
This study compares the performance of OpenAI's text-embedding-3-large model against the BERT-based UmBERTo model for clustering Italian political news content. We utilize two distinct datasets of political news stories circulated on Facebook before the 2018 and 2022 Italian elections.
-
/
: R and Python scripts for data processing, embedding generation, clustering, and analysis -
rawdata/
: Title and description of 35,795 links circulated on Facebook prior to 2018 and 2022 Italian elections. Sample of pair links coded by thematic coherence by human expertsin JSONL -
output/
: Empty output folders -
output/
: Empty data folders -
LICENSE
: License information for the project
For questions or feedback, please open an issue in this repository or contact Fabio Giglietto.