Language-Agnostic Website Embedding and Classification

Last update: Dec 27, 2022

Related tags

Deep Learning homepage2vec

Overview

Homepage2Vec

Language-Agnostic Website Embedding and Classification based on Curlie labels https://arxiv.org/pdf/2201.03677.pdf

Homepage2Vec is a pre-trained model that supports the classification and embedding of websites starting from their homepage.

Left: Projection in two dimensions with t-SNE of the embedding of 5K random samples of the testing set. Colors represent the 14 classes. Right: The projection with t-SNE of some popular websites shows that embedding vectors effectively capture website topics.

Curated Curlie Dataset

We release the full training dataset obtained from Curlie. The dataset includes the websites (online in April 2021) with the URL recognized as homepage, and it contains the original labels, the labels aligned to English, and the fetched HTML pages.

Get it here: https://doi.org/10.6084/m9.figshare.16621669

Getting started with the library

Installation:

Step 1: install the library with pip.

pip install homepage2vec

Usage:

import logging
from homepage2vec.model import WebsiteClassifier

logging.getLogger().setLevel(logging.DEBUG)

model = WebsiteClassifier()

website = model.fetch_website('epfl.ch')

scores, embeddings = model.predict(website)

print("Classes probabilities:", scores)
print("Embedding:", embeddings)

Result:

Classes probabilities: {'Arts': 0.3674524128437042, 'Business': 0.0720655769109726,
 'Computers': 0.03488553315401077, 'Games': 7.529282356699696e-06, 
 'Health': 0.02021787129342556, 'Home': 0.0005890956381335855, 
 'Kids_and_Teens': 0.3113572597503662, 'News': 0.0079914266243577, 
 'Recreation': 0.00835705827921629, 'Reference': 0.931416392326355, 
 'Science': 0.959597110748291, 'Shopping': 0.0010162043618038297, 
 'Society': 0.23374591767787933, 'Sports': 0.00014659571752417833}
 
Embedding: [-4.596550941467285, 1.0690114498138428, 2.1633379459381104,
 0.1665923148393631, -4.605356216430664, -2.894961357116699, 0.5615459084510803, 
 1.6420538425445557, -1.918184757232666, 1.227172613143921, 0.4358430504798889, 
 ...]

The library automatically downloads the pre-trained models homepage2vec and XLM-R at the first usage.

Using visual features

If you wish to use the prediction using the visual features, Homepage2vec needs to take a screenshot of the website. This means you need a working copy of Selenium and the Chrome browser. Please note that as reported in the reference paper, the performance improvement is limited.

Install the Selenium Chrome web driver, and add the folder to the system $PATH variable. You need a local copy of Chrome browser (See Getting started).

Getting involved

We invite contributions to Homepage2Vec! Please open a pull request if you have any suggestions.

Original publication

Language-Agnostic Website Embedding and Classification

Sylvain Lugeon, Tiziano Piccardi, Robert West

Currently, publicly available models for website classification do not offer an embedding method and have limited support for languages beyond English. We release a dataset with more than 1M websites in 92 languages with relative labels collected from Curlie, the largest multilingual crowdsourced Web directory. The dataset contains 14 website categories aligned across languages. Alongside it, we introduce Homepage2Vec, a machine-learned pre-trained model for classifying and embedding websites based on their homepage in a language-agnostic way. Homepage2Vec, thanks to its feature set (textual content, metadata tags, and visual attributes) and recent progress in natural language representation, is language-independent by design and can generate embeddings representation. We show that Homepage2Vec correctly classifies websites with a macro-averaged F1-score of 0.90, with stable performance across low- as well as high-resource languages. Feature analysis shows that a small subset of efficiently computable features suffices to achieve high performance even with limited computational resources.

https://arxiv.org/pdf/2201.03677.pdf

Dataset License

Creative Commons Attribution 3.0 Unported License - Curlie

Learn more how to contribute: https://curlie.org/docs/en/about.html

You might also like...

PyTorch implementation of the supervised learning experiments from the paper Model-Agnostic Meta-Learning (MAML)

pytorch-maml This is a PyTorch implementation of the supervised learning experiments from the paper Model-Agnostic Meta-Learning (MAML): https://arxiv

516 Jan 5, 2023

ICCV2021 Oral SA-ConvONet: Sign-Agnostic Optimization of Convolutional Occupancy Networks

Sign-Agnostic Convolutional Occupancy Networks Paper | Supplementary | Video | Teaser Video | Project Page This repository contains the implementation

64 Jan 5, 2023

An Agnostic Computer Vision Framework - Pluggable to any Training Library: Fastai, Pytorch-Lightning with more to come

IceVision is the first agnostic computer vision framework to offer a curated collection with hundreds of high-quality pre-trained models from torchvision, MMLabs, and soon Pytorch Image Models. It orchestrates the end-to-end deep learning workflow allowing to train networks with easy-to-use robust high-performance libraries such as Pytorch-Lightning and Fastai

789 Dec 29, 2022

This codebase is the official implementation of Test-Time Classifier Adjustment Module for Model-Agnostic Domain Generalization (NeurIPS2021, Spotlight)

Test-Time Classifier Adjustment Module for Model-Agnostic Domain Generalization This codebase is the official implementation of Test-Time Classifier A

47 Dec 28, 2022

Comments

Requirements too generic

Hi,

could you please make the requirements specific (versions) and also could you maybe split them for those who do not want to use GPU/visual features - my environment doesn't really need to have a 370MB nvidia_cublas packages I expect unless I am using them.

Thanks

opened by janfait 3
More Info about the TEST Set

Could you share more information about the testing dataset?

Info I know by now: There are 2 subsets of the testing set. #1 is with a "balanced setup". There are 14 sub-subsets each of which contains balanced number of positives and negatives for a single label. #2 is with a "unbalanced setup". I think all samples are randomly sampled under the original distribution of the dataset.

My questions: Could you share all the testing sets you used in the experiments? If so, I could benchmark in exactly the same way. If not, could you please share more info about: Number for each sub-subset in #1 subset? Are all sub-subsets share a same negative set? Is the #2 subset are sampled from the 1.05M dataset which is released in [https://figshare.com/articles/dataset/Language-agnostic_Website_Embedding_and_Classification_Curlie_dataset_/16621669?file=31222348]

Thanks a lot!

opened by xingener 2

Language-Agnostic Website Embedding and Classification

Related tags

Overview

Homepage2Vec

Curated Curlie Dataset

Getting started with the library

Installation:

Usage:

Using visual features

Getting involved

Original publication

Dataset License

You might also like...

PyTorch implementation of the supervised learning experiments from the paper Model-Agnostic Meta-Learning (MAML)

ICCV2021 Oral SA-ConvONet: Sign-Agnostic Optimization of Convolutional Occupancy Networks

An Agnostic Computer Vision Framework - Pluggable to any Training Library: Fastai, Pytorch-Lightning with more to come

This codebase is the official implementation of Test-Time Classifier Adjustment Module for Model-Agnostic Domain Generalization (NeurIPS2021, Spotlight)

Arch-Net: Model Distillation for Architecture Agnostic Model Deployment

PyExplainer: A Local Rule-Based Model-Agnostic Technique (Explainable AI)

A fast, dataset-agnostic, deep visual search engine for digital art history

MARS: Learning Modality-Agnostic Representation for Scalable Cross-media Retrieva

PyExplainer: A Local Rule-Based Model-Agnostic Technique (Explainable AI)

Comments

Requirements too generic

More Info about the TEST Set

Owner

(ICCV'21) Official PyTorch implementation of Relational Embedding for Few-Shot Classification

Pytorch Implementation of Adversarial Deep Network Embedding for Cross-Network Node Classification

FEDn is an open-source, modular and ML-framework agnostic framework for Federated Machine Learning

Contrastive learning of Class-agnostic Activation Map for Weakly Supervised Object Localization and Semantic Segmentation (CVPR 2022)

Implement face detection, and age and gender classification, and emotion classification.

Elegy is a framework-agnostic Trainer interface for the Jax ecosystem.

Supervised domain-agnostic prediction framework for probabilistic modelling

Code for the paper Task Agnostic Morphology Evolution.

MODALS: Modality-agnostic Automated Data Augmentation in the Latent Space

ICCV2021 Oral SA-ConvONet: Sign-Agnostic Optimization of Convolutional Occupancy Networks