The Jupyter
notebook contained in this repository is meant to demonstrate how URLs can be classified using the homepage2vec
library for Python
.
There are different ways in which you can run the Jupyter notebook (i.e., the .ipynb
file) contained in this repo. To test the URL classification with homepage2vec
, you can, e.g., clone or fork this repo and use GitHub Codespaces to run the notebook. Alternatively, you can also use Google Colab and upload and run the notebook there (see this StackOverflow post for instructions on how to do that or simply click this link and sign in with your Google account). Note: The notebook does currently not work with Binder (possibly due to resctrictions in the ports used for accessing the content of the websites to be classified).
The folder urls
in this repo contains two .txt
files with exemplary URLs to classify.
IMPORTANT: Depending on your subscription/plan for services like GitHub Codespaces or Google Colab, these options might not be the best choice for classifying a large number of URLs as the classification process can take quite some (computing) time.
If you want to use the functions/code provided here to classify a large number of URLs for your research, you might want to copy/clone the notebook and run the notebook (or the code it contains) on your local machine or your own server. The easiest way of using and editing Jupyter
notebooks on your machine is probably Anaconda. Note: If you do not use git
and GitHub, you can get a .zip
file containing everything in this repo by clicking on the green "Code" button on the repo website and then choosing "Download ZIP").
If you use Homepage2Vec
for your research, make sure to cite the associated conference paper:
Lugeon, S., Piccardi, T., & West, R. (2022). Language-Agnostic Website Embedding and Classification. arXiv preprint arXiv:2201.03677.
The homepage2vec
library is based on the dataset from curlie.org.
Note: If you work with web tracking data and (can) also use R
, the notebook in this repo pairs nicely with the webtrackR
package (which is still work in progress at the moment).