PyTorch implementation for NCL (Neighborhood-enrighed Contrastive Learning)

RUCAIBox

Last update: Jan 3, 2023

Related tags

Data Analysis NCL

Overview

NCL (Neighborhood-enrighed Contrastive Learning)

This is the official PyTorch implementation for the paper:

Zihan Lin*, Changxin Tian*, Yupeng Hou* Wayne Xin Zhao. Improving Graph Collaborative Filtering with Neighborhood-enriched Contrastive Learning. WWW 2022.

Overview

We propose a contrastive learning paradigm, named Neighborhood-enriched Contrastive Learning (NCL), to explicitly capture potential node relatedness into contrastive learning for graph collaborative filtering.

Requirements

recbole==1.0.0
python==3.7.7
pytorch==1.7.1
faiss-gpu==1.7.1

Quick Start

python main.py --dataset ml-1m

You can replace ml-1m to yelp, amazon-books, gowalla-merged or alibaba to reproduce the results reported in our paper.

Datasets

For alibaba, you can download alibaba.zip from Google Drive. Then,

mkdir dataset
mv alibaba.zip dataset
unzip alibaba.zip
python main.py --dataset alibaba

For others, they will be downloaded automatically via RecBole once you run the main program. Take yelp for example,

python main.py --dataset yelp

Acknowledgement

The implementation is based on the open-source recommendation library RecBole.

Please cite the following papers as the reference if you use our codes or the processed datasets.

@inproceedings{lin2022ncl,
    author={Zihan Lin and
            Changxin Tian and
            Yupeng Hou and
            Wayne Xin Zhao},
    title={Improving Graph Collaborative Filtering with Neighborhood-enriched Contrastive Learning},
    booktitle={{WWW}},
    year={2022},
}

@inproceedings{zhao2021recbole,
  title={Recbole: Towards a unified, comprehensive and efficient framework for recommendation algorithms},
  author={Wayne Xin Zhao and Shanlei Mu and Yupeng Hou and Zihan Lin and Kaiyuan Li and Yushuo Chen and Yujie Lu and Hui Wang and Changxin Tian and Xingyu Pan and Yingqian Min and Zhichao Feng and Xinyan Fan and Xu Chen and Pengfei Wang and Wendi Ji and Yaliang Li and Xiaoling Wang and Ji-Rong Wen},
  booktitle={{CIKM}},
  year={2021}
}

Comments

loss function

Dear Author: There are some problems with the loss function, and I sincerely wish can get the answer to the question. when I study the code, some prebloe for code is eremgeen. The BPR loss in the paper means up all sample loss in every min-batch, but it is different from the described in equation 5. But it isn't the key to the problem. However, The contrastive learning sums up the contrastive loss of all samples. I didn't this difference will bring what influence for the result. Good Wish!

opened by Iwillcome 4
关于warm_up_step的疑问

您好！在您的trainer.py中，我注意到(line150-151): if epoch_idx < self.config['warm_up_step']: losses = losses[:-1] 能否再解释一下，当epoch_idx < warm_up_step时，losses只取前两个损失的原因呢，谢谢

opened by LiuZS10 4
amazon-books datasets

when i try run code on amazon-books datasets base on amazon-books.yaml, the result is vastly different compared with the paper. The pre-procees dataset just contain the 4610 user node and 4138 item node, which is stange. while the origianl dataset is normal and unerror. i wish i can't the correct method to use amazon-boos. The detail descibe is fellow. ////// General Hyper Parameters: gpu_id = 1 use_gpu = True seed = 2020 state = INFO reproducibility = True data_path = dataset/amazon-books show_progress = True save_dataset = False save_dataloaders = False benchmark_filename = None

Training Hyper Parameters: checkpoint_dir = saved epochs = 300 train_batch_size = 4096 learner = adam learning_rate = 0.001 eval_step = 1 stopping_step = 10 clip_grad_norm = None weight_decay = 0.0 loss_decimal_place = 4

Evaluation Hyper Parameters: eval_args = {'split': {'RS': [0.8, 0.1, 0.1]}, 'group_by': 'user', 'order': 'RO', 'mode': 'full'} metrics = ['Recall', 'NDCG'] topk = [10, 20, 50] valid_metric = NDCG@10 valid_metric_bigger = True eval_batch_size = 4096000 metric_decimal_place = 4

Dataset Hyper Parameters: field_separator = seq_separator =
USER_ID_FIELD = user_id ITEM_ID_FIELD = item_id RATING_FIELD = rating TIME_FIELD = timestamp seq_len = None LABEL_FIELD = label threshold = None NEG_PREFIX = neg_ load_col = {'inter': ['user_id', 'item_id', 'rating']} unload_col = None unused_col = None additional_feat_suffix = None rm_dup_inter = None val_interval = {'rating': '[3,inf)'} filter_inter_by_user_or_item = True user_inter_num_interval = [15,inf) item_inter_num_interval = [15,inf) alias_of_user_id = None alias_of_item_id = None alias_of_entity_id = None alias_of_relation_id = None preload_weight = None normalize_field = None normalize_all = None ITEM_LIST_LENGTH_FIELD = item_length LIST_SUFFIX = _list MAX_ITEM_LIST_LENGTH = 50 POSITION_FIELD = position_id HEAD_ENTITY_ID_FIELD = head_id TAIL_ENTITY_ID_FIELD = tail_id RELATION_ID_FIELD = relation_id ENTITY_ID_FIELD = entity_id

Other Hyper Parameters: neg_sampling = {'uniform': 1} repeatable = False MODEL_TYPE = ModelType.GENERAL eval_setting = {'split': {'RS': [0.8, 0.1, 0.1]}, 'order': 'RO', 'group_by': 'user', 'mode': 'full'} embedding_size = 64 n_layers = 3 reg_weight = 1e-06 ssl_temp = 0.05 ssl_reg = 1e-06 hyper_layers = 1 alpha = 0.8 proto_reg = 1e-07 num_clusters = 2000 m_step = 1 warm_up_step = 20 MODEL_INPUT_TYPE = InputType.PAIRWISE eval_type = EvaluatorType.RANKING device = cuda train_neg_sample_args = {'strategy': 'by', 'by': 1, 'distribution': 'uniform'} eval_neg_sample_args = {'strategy': 'full', 'distribution': 'uniform'}

28 Nov 20:25 INFO amazon-books The number of users: 4610 Average actions of users: 34.132566717292256 The number of items: 4138 Average actions of items: 38.026831036983324 The number of inters: 157317 The sparsity of the dataset: 99.17532231295783% Remain Fields: ['user_id', 'item_id', 'rating'] ////

opened by Iwillcome 4
关于消融实验

我注意到作者在4.3.1中提到为了验证不同邻居的作用，进行了两个消融实验，分别是只用结构邻居和只用语义邻居，实验结果是NCL>只用结构邻居>只用语义邻居>LightGCN，但是我在数据集ML-1M上进行实验的时候是，语义邻居的结果奇差（比Light GCN还差），不知道作者在这里是不是进行了一些参数细微的调整？如果是，方便告知如何调整的吗？

opened by malajikuai 4
about the distribution of item embeddings

Hi, thanks for your great work. I am confused about Figure 6 when reading this paper.

We plot item embedding distributions with Gaussian kernel density estimation (KDE) in two-dimensional space

Are codes for this figure available? Thank you.

opened by FinchNie 3
Running issue

Dear author： There are some issue about how run the baseline. The NeuMF user MLP repleace the inner product, which mean need more resource during running. when i try run it on yelp dataset based on the same data process manner with the NCL, it show that a larger memory is need, over 100G, but i find that the ncl is run in 1080 TPX. so i want to know how run the NueMF with te all rank startegy.

Error log:

command line args [--dataset yelp --model NeuMF --dropout_prob =0.0] will not be used in RecBole 19 Dec 23:29 INFO
General Hyper Parameters: gpu_id = 1 use_gpu = True seed = 2023 state = INFO reproducibility = True data_path = dataset/yelp show_progress = False save_dataset = False save_dataloaders = False benchmark_filename = None

Training Hyper Parameters: checkpoint_dir = saved epochs = 300 train_batch_size = 2048 learner = adam learning_rate = 0.001 eval_step = 1 stopping_step = 10 clip_grad_norm = None weight_decay = 0.0 loss_decimal_place = 4

Evaluation Hyper Parameters: eval_args = {'split': {'RS': [0.8, 0.1, 0.1]}, 'group_by': 'user', 'order': 'RO', 'mode': 'full'} metrics = ['Recall', 'NDCG', 'Precision', 'Hit'] topk = [5, 10, 15, 20, 25, 40, 50, 60, 100, 150, 200] valid_metric = Recall@20 valid_metric_bigger = True eval_batch_size = 4096000 metric_decimal_place = 4

Dataset Hyper Parameters: field_separator = seq_separator =
USER_ID_FIELD = user_id ITEM_ID_FIELD = item_id RATING_FIELD = rating TIME_FIELD = timestamp seq_len = None LABEL_FIELD = label threshold = None NEG_PREFIX = neg_ load_col = {'inter': ['user_id', 'item_id', 'rating']} unload_col = None unused_col = None additional_feat_suffix = None rm_dup_inter = None val_interval = {'rating': '[3,inf)'} filter_inter_by_user_or_item = True user_inter_num_interval = [15,inf) item_inter_num_interval = [15,inf) alias_of_user_id = None alias_of_item_id = None alias_of_entity_id = None alias_of_relation_id = None preload_weight = None normalize_field = None normalize_all = None ITEM_LIST_LENGTH_FIELD = item_length LIST_SUFFIX = _list MAX_ITEM_LIST_LENGTH = 50 POSITION_FIELD = position_id HEAD_ENTITY_ID_FIELD = head_id TAIL_ENTITY_ID_FIELD = tail_id RELATION_ID_FIELD = relation_id ENTITY_ID_FIELD = entity_id

Other Hyper Parameters: neg_sampling = {'uniform': 1} repeatable = False mf_embedding_size = 64 mlp_embedding_size = 64 mlp_hidden_size = [32, 16, 8] dropout_prob = 0.1 mf_train = True mlp_train = True use_pretrain = False mf_pretrain_path = None mlp_pretrain_path = None MODEL_TYPE = ModelType.GENERAL eval_setting = {'split': {'RS': [0.8, 0.1, 0.1]}, 'order': 'RO', 'group_by': 'user', 'mode': 'full'} embedding_size = 64 reg_weight = 0.0001 warm_up_step = -1 MODEL_INPUT_TYPE = InputType.POINTWISE eval_type = EvaluatorType.RANKING device = cuda train_neg_sample_args = {'strategy': 'by', 'by': 1, 'distribution': 'uniform'} eval_neg_sample_args = {'strategy': 'full', 'distribution': 'uniform'}

19 Dec 23:29 INFO yelp The number of users: 45478 Average actions of users: 39.09151878971788 The number of items: 30709 Average actions of items: 57.89256871173635 The number of inters: 1777765 The sparsity of the dataset: 99.87270617988263% Remain Fields: ['user_id', 'item_id', 'rating'] 19 Dec 23:30 INFO [Training]: train_batch_size = [2048] negative sampling: [{'uniform': 1}] 19 Dec 23:30 INFO [Evaluation]: eval_batch_size = [4096000] eval_args: [{'split': {'RS': [0.8, 0.1, 0.1]}, 'group_by': 'user', 'order': 'RO', 'mode': 'full'}] 19 Dec 23:30 INFO NeuMF( (user_mf_embedding): Embedding(45478, 64) (item_mf_embedding): Embedding(30709, 64) (user_mlp_embedding): Embedding(45478, 64) (item_mlp_embedding): Embedding(30709, 64) (mlp_layers): MLPLayers( (mlp_layers): Sequential( (0): Dropout(p=0.1, inplace=False) (1): Linear(in_features=128, out_features=32, bias=True) (2): ReLU() (3): Dropout(p=0.1, inplace=False) (4): Linear(in_features=32, out_features=16, bias=True) (5): ReLU() (6): Dropout(p=0.1, inplace=False) (7): Linear(in_features=16, out_features=8, bias=True) (8): ReLU() ) ) (predict_layer): Linear(in_features=72, out_features=1, bias=True) (sigmoid): Sigmoid() (loss): BCELoss() ) Trainable parameters: 9756801 19 Dec 23:31 INFO epoch 0 training [time: 60.39s, train loss: 783.6968] Traceback (most recent call last): File "/home/xxx/anaconda3/envs/recbole/lib/python3.7/site-packages/recbole/trainer/trainer.py", line 376, in _full_sort_batch_eval scores = self.model.full_sort_predict(interaction.to(self.device)) File "/home/xxx/anaconda3/envs/recbole/lib/python3.7/site-packages/recbole/model/abstract_recommender.py", line 66, in full_sort_predict raise NotImplementedError NotImplementedError

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "run.py", line 35, in run_recbole(model=args.model, dataset=args.dataset, config_file_list=args.config_file_list) File "/home/xxx/anaconda3/envs/recbole/lib/python3.7/site-packages/recbole/quick_start/quick_start.py", line 60, in run_recbole train_data, valid_data, saved=saved, show_progress=config['show_progress'] File "/home/xxx/anaconda3/envs/recbole/lib/python3.7/site-packages/recbole/trainer/trainer.py", line 334, in fit valid_score, valid_result = self._valid_epoch(valid_data, show_progress=show_progress) File "/home/xxx/anaconda3/envs/recbole/lib/python3.7/site-packages/recbole/trainer/trainer.py", line 196, in _valid_epoch valid_result = self.evaluate(valid_data, load_best_model=False, show_progress=show_progress) File "/home/xxx/anaconda3/envs/recbole/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context return func(*args, **kwargs) File "/home/xxx/anaconda3/envs/recbole/lib/python3.7/site-packages/recbole/trainer/trainer.py", line 459, in evaluate interaction, scores, positive_u, positive_i = eval_func(batched_data) File "/home/xxx/anaconda3/envs/recbole/lib/python3.7/site-packages/recbole/trainer/trainer.py", line 383, in _full_sort_batch_eval scores = self.model.predict(new_inter) File "/home/xxx/anaconda3/envs/recbole/lib/python3.7/site-packages/recbole/model/general_recommender/neumf.py", line 133, in predict return self.forward(user, item) File "/home/xxx/anaconda3/envs/recbole/lib/python3.7/site-packages/recbole/model/general_recommender/neumf.py", line 111, in forward mlp_output = self.mlp_layers(torch.cat((user_mlp_e, item_mlp_e), -1)) # [batch_size, layers[-1]] RuntimeError: CUDA out of memory. Tried to allocate 1.95 GiB (GPU 0; 10.92 GiB total capacity; 5.09 GiB already allocated; 885.00 MiB free; 5.11 GiB reserved in total by PyTorch)

opened by Zero-ck 2
Result problem

Dear Authors: The NCL realize by recbole make a great progress compared with the SATO model. However, when i read papre, some question are occur. The result of NCL poorer than the most model in XSImGCL (XSimGCL: Towards Extremely Simple Graph Contrastive Learning for Recommendation, Github URL:https://github.com/Coder-Yu/SELFRec ), and even thougth can't over SGL. The two different implement the model and get the result in different dataset. But, The result is poor, THe NCL just excel LightGCN.

Result fellow:

opened by Zero-ck 2
parameter setting

There are some problem about parameter setting. In the paper,the range of regularization λ3 did't introducted. Meantime, i find it is different for different dataset in yaml file, for example, the λ3 is 1e-6 and 1e-4 in amazon-books and yelp datasets respectively. Other paramter setting also different the description in paper. So, I want know the range of λ3 and the current λ3 whether or notthe best value in experiment.

opened by Iwillcome 2
prototype contrastive learning相关问题请教，感谢回复

您好，有个问题存在疑问，请教一下，非常感谢：对于prototype CL，看相关文章介绍，获取prototype表示的计算方式常见的有两种： 1、通过对训练集进行k-means聚类产生K类prototype表示； 2、对训练集中的每一类的samples进行求平均获得每一类对应的prototype表示；对于一个anchor sample，第二种构建positive pair的方式是(anchor, prototype_2),其中prototype_2是anchor对应的class类别所计算产生的prototype；Negative pair(anchor，prototype_2_),prototype_2_是与anchor不对应的class类别的prototype；

问题： 1、请问对第一种方式(获得聚类的结果后)是如何构建正负pair呢？ 2、请问是否有尝试第二种计算prototype的表示的方法，有什么区别？

opened by topDreamer 2
您好，请教一下代码实现的细节

您好，我刚看完了您的文章，目前在浏览paper实现的代码。这里想和您确定一下NCL模型的输入的形式，是否是采取的random生成的方法？

self.user_embedding = torch.nn.Embedding(num_embeddings=self.n_users, embedding_dim=self.latent_dim) self.item_embedding = torch.nn.Embedding(num_embeddings=self.n_items, embedding_dim=self.latent_dim)

以及如果是的话，采取这种输入的思路来源在哪里（因为我自己还是刚入门正方面不是很清楚？

望不吝赐教，非常感谢！

opened by ithok 2

数据集问题

您好，感谢您提供这个项目。

我尝试了您的代码，可以跑通。但是当我尝试将数据持久化为inter文件后，我发现数据处理有以下问题：

数据划分没有按照8:1:1 (1466190 + 155063 + 155075)
数据中存在重复值未去除情况 (1405361 / 1466190)。

def fix_dataset(args):
    config = Config(
        model=NCL,
        dataset=args.dataset, 
        config_file_list=args.config_file_list
    )
    init_seed(config['seed'], config['reproducibility'])

    # logger initialization
    init_logger(config)
    logger = getLogger()
    logger.info(config)

    # dataset filtering
    dataset = create_dataset(config)
    logger.info(dataset)

    # dataset splitting
    train_data, valid_data, test_data = data_preparation(config, dataset)
    
    datas = [train_data, valid_data, test_data]
    modes = ['train', 'valid', 'test']
    id1 = 'user_id'
    id2 = 'item_id'
    if args.dataset == 'yelp':
        id2 = 'business_id'

    for cur_data, mode in zip(datas, modes):
        user_ids = []
        item_ids = []
        for interaction in cur_data:
            if mode == 'valid' or mode =='test':
                user_df, _, user_id, item_id = interaction
                u_id = user_df.user_id
                user_id = u_id[user_id]
            else:
                user_id = interaction[id1]
                item_id = interaction[id2]
            user_ids.append(user_id)
            item_ids.append(item_id)
        user_ids = torch.cat(user_ids, dim=0).numpy().tolist()
        item_ids = torch.cat(item_ids, dim=0).numpy().tolist()
        with open(f'./datas/{args.dataset}/{args.dataset}.{mode}.inter', 'w') as f:
            print('userId:token,itemId:token,rating:float,timestamp:float', file=f)
           # 由于时间和rating不需要使用，所以标注为1
            for u, i in zip(user_ids, item_ids):
                print(f'{u},{i},1,1', file=f)

请问可以帮我看看是否我的持久化数据的代码有问题？谢谢

opened by hotchilipowder 2

sparsity level analysis

Dear Author: Hi, RecBole is used as an integrated recomendation system framework that encapsulates the underlying operations to make rapid development possible. However, It also result in some diffcult. For me, I didn't how use the NCL code to make sparsity-level analysis due to the dataset can't change. Another issue is that how to implement the sparisty-level analysis for the quick-start code, such as BPR, LightGCN and so on. Thinks!

opened by Iwillcome 0

绘图问题

Hi,

We firstly train models with --embedding_size=2, then

import torch.nn.functional as F
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from recbole.quick_start import load_data_and_model


filepath = 'path/to/your/model'  # replace this to your path

config, model, dataset, train_data, valid_data, test_data = load_data_and_model(
    model_file=filepath,
)

item_emb = model.item_embedding.weight.cpu().detach()
item_emb = F.normalize(item_emb, dim=1).numpy()
print(item_emb.shape)

plt.figure(figsize=(3, 3))

df = pd.DataFrame({
    'x': item_emb.T[0],
    'y': item_emb.T[1]
})

ax = sns.kdeplot(
    data=df, x='x', y='y',
    thresh=0, levels=300, cmap=sns.color_palette('light:b', as_cmap=True)
)

plt.xlabel('')
plt.ylabel('')

plt.tight_layout()
plt.savefig('your pdf file name', format='pdf', dpi=300)  # replace this to your path
plt.show()

Originally posted by @hyp1231 in https://github.com/RUCAIBox/NCL/issues/8#issuecomment-1102792819

opened by Doctor-JZL 6

参数设置

proto_nce_loss = self.proto_reg * (proto_nce_loss_user + proto_nce_loss_item) 你好，请问在进行Contrastive Learning with Semantic Neighbors时，你的self.proto_reg这个参数设置得很小，请问有什么规律吗

opened by Doctor-JZL 1
merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered

你好，我严格遵守了README.md提供的环境版本，但遇到了一个无法解决的报错信息，这个错误可能是线程同步的问题，报了内存非法访问的错误。因为我单步调试是可以运行的，但直接运行就报出了如下的错误： Train 0: 0%| | 0/358 [00:00<?, ?it/s] Traceback (most recent call last): File "/home/nishikata/Downloads/NCL-master/main.py", line 71, in <module> run_single_model(args) File "/home/nishikata/Downloads/NCL-master/main.py", line 45, in run_single_model train_data, valid_data, saved=True, show_progress=config['show_progress'] File "/home/nishikata/Downloads/NCL-master/trainer.py", line 47, in fit train_loss = self._train_epoch(train_data, epoch_idx, show_progress=show_progress) File "/home/nishikata/Downloads/NCL-master/trainer.py", line 133, in _train_epoch losses = loss_func(interaction) File "/home/nishikata/Downloads/NCL-master/ncl.py", line 217, in calculate_loss user_all_embeddings, item_all_embeddings, embeddings_list = self.forward() File "/home/nishikata/Downloads/NCL-master/ncl.py", line 139, in forward all_embeddings = torch.sparse.mm(self.norm_adj_mat, all_embeddings) File "/home/nishikata/anaconda3/envs/NCL/lib/python3.7/site-packages/torch/sparse/__init__.py", line 84, in mm return torch._sparse_mm(mat1, mat2) RuntimeError: merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered

请问你在代码调试过程中有没有遇到过这种问题，谢谢啦。

opened by Nishikata97 3

Owner

RUCAIBox

An enthusiastic group that aims to create beautiful things with AI

GitHub

Deep universal probabilistic programming with Python and PyTorch

Getting Started | Documentation | Community | Contributing Pyro is a flexible, scalable deep probabilistic programming library built on PyTorch. Notab

7.7k Dec 30, 2022

Stochastic Gradient Trees implementation in Python

Stochastic Gradient Trees - Python Stochastic Gradient Trees1 by Henry Gouk, Bernhard Pfahringer, and Eibe Frank implementation in Python. Based on th

2 Nov 18, 2022

Python implementation of Principal Component Analysis

Principal Component Analysis Principal Component Analysis (PCA) is a dimension-reduction algorithm. The idea is to use the singular value decompositio

1 Nov 6, 2021

Python Implementation of Scalable In-Memory Updatable Bitmap Indexing

PyUpBit CS490 Large Scale Data Analytics — Implementation of Updatable Compressed Bitmap Indexing Paper Table of Contents About The Project Usage Cont

1 Jun 28, 2022

Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs (CIKM 2020)

Karate Club is an unsupervised machine learning extension library for NetworkX. Please look at the Documentation, relevant Paper, Promo Video, and Ext

1.8k Jan 9, 2023

Probabilistic Programming in Python: Bayesian Modeling and Probabilistic Machine Learning with Theano

PyMC3 is a Python package for Bayesian statistical modeling and Probabilistic Machine Learning focusing on advanced Markov chain Monte Carlo (MCMC) an

7.2k Dec 30, 2022

Python Library for learning (Structure and Parameter) and inference (Statistical and Causal) in Bayesian Networks.

pgmpy pgmpy is a python library for working with Probabilistic Graphical Models. Documentation and list of algorithms supported is at our official sit

2.2k Dec 25, 2022

Using approximate bayesian posteriors in deep nets for active learning

Bayesian Active Learning (BaaL) BaaL is an active learning library developed at ElementAI. This repository contains techniques and reusable components

687 Dec 25, 2022

A probabilistic programming library for Bayesian deep learning, generative models, based on Tensorflow

ZhuSuan is a Python probabilistic programming library for Bayesian deep learning, which conjoins the complimentary advantages of Bayesian methods and

2.2k Dec 28, 2022

A Pythonic introduction to methods for scaling your data science and machine learning work to larger datasets and larger models, using the tools and APIs you know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

This tutorial's purpose is to introduce Pythonistas to methods for scaling their data science and machine learning work to larger datasets and larger models, using the tools and APIs they know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

102 Nov 10, 2022

A collection of learning outcomes data analysis using Python and SQL, from DQLab.

Data Analyst with PYTHON Data Analyst berperan dalam menghasilkan analisa data serta mempresentasikan insight untuk membantu proses pengambilan keputu

6 Oct 11, 2022

Active Learning demo using two small datasets

ActiveLearningDemo How to run step one put the dataset folder and use command below to split the dataset to the required structure run utils.py For ea

3 Nov 10, 2021

An ETL framework + Monitoring UI/API (experimental project for learning purposes)

Fastlane An ETL framework for building pipelines, and Flask based web API/UI for monitoring pipelines. Project structure fastlane |- fastlane: (ETL fr

2 Jan 6, 2022

Python scripts aim to use a Random Forest machine learning algorithm to predict the water affinity of Metal-Organic Frameworks

The following Python scripts aim to use a Random Forest machine learning algorithm to predict the water affinity of Metal-Organic Frameworks (MOFs). The training set is extracted from the Cambridge Structural Database and the CoRE_MOF 2019 dataset.

1 Jan 9, 2022

Mortgage-loan-prediction - Show how to perform advanced Analytics and Machine Learning in Python using a full complement of PyData utilities

Mortgage-loan-prediction - Show how to perform advanced Analytics and Machine Learning in Python using a full complement of PyData utilities. This is aimed at those looking to get into the field of Data Science or those who are already in the field and looking to solve a real-world project with python.

1 Dec 26, 2021

BigDL - Evaluate the performance of BigDL (Distributed Deep Learning on Apache Spark) in big data analysis problems

Evaluate the performance of BigDL (Distributed Deep Learning on Apache Spark) in big data analysis problems.

1 Jan 6, 2022

Used for data processing in machine learning, and help us to construct ML model more easily from scratch

Used for data processing in machine learning, and help us to construct ML model more easily from scratch. Can be used in linear model, logistic regression model, and decision tree.

0 Jul 5, 2022

Using Data Science with Machine Learning techniques (ETL pipeline and ML pipeline) to classify received messages after disasters.

1 Feb 11, 2022

[CVPR2022] This repository contains code for the paper "Nested Collaborative Learning for Long-Tailed Visual Recognition", published at CVPR 2022

Nested Collaborative Learning for Long-Tailed Visual Recognition This repository is the official PyTorch implementation of the paper in CVPR 2022: Nes

65 Dec 9, 2022