Hierarchical unsupervised and semi-supervised topic models for sparse count data with CorEx

Overview

Anchored CorEx: Hierarchical Topic Modeling with Minimal Domain Knowledge

Correlation Explanation (CorEx) is a topic model that yields rich topics that are maximally informative about a set of documents. The advantage of using CorEx versus other topic models is that it can be easily run as an unsupervised, semi-supervised, or hierarchical topic model depending on a user's needs. For semi-supervision, CorEx allows a user to integrate their domain knowledge via "anchor words." This integration is flexible and allows the user to guide the topic model in the direction of those words. This allows for creative strategies that promote topic representation, separability, and aspects. More generally, this implementation of CorEx is good for clustering any sparse binary data.

If you use this code, please cite the following:

Gallagher, R. J., Reing, K., Kale, D., and Ver Steeg, G. "Anchored Correlation Explanation: Topic Modeling with Minimal Domain Knowledge." Transactions of the Association for Computational Linguistics (TACL), 2017.

Getting Started

Install

Python code for the CorEx topic model can be installed via pip:

pip install corextopic

Example Notebook

Full details on how to retrieve and interpret output from the CorEx topic model are given in the example notebook. Below we describe how to get CorEx running as an unsupervised, semi-supervised, or hierarchical topic model.

Running the CorEx Topic Model

Given a doc-word matrix, the CorEx topic model is easy to run. The code follows the scikit-learn fit/transform conventions.

import numpy as np
import scipy.sparse as ss
from corextopic import corextopic as ct

# Define a matrix where rows are samples (docs) and columns are features (words)
X = np.array([[0,0,0,1,1],
              [1,1,1,0,0],
              [1,1,1,1,1]], dtype=int)
# Sparse matrices are also supported
X = ss.csr_matrix(X)
# Word labels for each column can be provided to the model
words = ['dog', 'cat', 'fish', 'apple', 'orange']
# Document labels for each row can be provided
docs = ['fruit doc', 'animal doc', 'mixed doc']

# Train the CorEx topic model
topic_model = ct.Corex(n_hidden=2)  # Define the number of latent (hidden) topics to use.
topic_model.fit(X, words=words, docs=docs)

Once the model is trained, we can get topics using the get_topics() function.

topics = topic_model.get_topics()
for topic_n,topic in enumerate(topics):
    # w: word, mi: mutual information, s: sign
    topic = [(w,mi,s) if s > 0 else ('~'+w,mi,s) for w,mi,s in topic]
    # Unpack the info about the topic
    words,mis,signs = zip(*topic)    
    # Print topic
    topic_str = str(topic_n+1)+': '+', '.join(words)
    print(topic_str)

Similarly, the most probable documents for each topic can be accessed through the get_top_docs() function.

top_docs = topic_model.get_top_docs()
for topic_n, topic_docs in enumerate(top_docs):
    docs,probs = zip(*topic_docs)
    topic_str = str(topic_n+1)+': '+', '.join(docs)
    print(topic_str)

Summary files and visualizations can be outputted from vis_topic.py.

from corextopic import vis_topic as vt
vt.vis_rep(topic_model, column_label=words, prefix='topic-model-example')

Choosing the Number of Topics

Each topic explains a certain portion of the total correlation (TC). We can access the topic TCs through the tcs attribute, as well as the overall TC (the sum of the topic TCs) through the tc attribute. To determine how many topics we should use, we can look at the distribution of tcs. If adding additional topics contributes little to the overall TC, then the topics already explain a large portion of the information in the documents. If this is the case, then we likely do not need more topics in our topic model. So, as a general rule of thumb continue adding topics until the overall TC plateaus.

We can also restart the CorEx topic model from several different initializations. This allows CorEx to explore different parts of the topic space and potentially find more informative topics. If we want to follow a strictly quantitative approach to choosing which of the multiple topic model runs we should use, then we can choose the topic model that has the highest TC (the one that explains the most information about the documents)

Semi-Supervised Topic Modeling

Using Anchor Words

Anchored CorEx allows a user integrate their domain knowledge through "anchor words." Anchoring encourages (but does not force) CorEx to search for topics that are related to the anchor words. This helps us find topics of interest, enforce separability of topics, and find aspects around topics.

If words is initialized, then it is easy to use anchor words:

topic_model.fit(X, words=words, anchors=[['dog','cat'], 'apple'], anchor_strength=2)

This anchors "dog" and "cat" to the first topic, and "apple" to the second topic. The anchor_strength is the relative amount of weight given to an anchor word relative to all the other words. For example, if anchor_strength=2, then CorEx will place twice as much weight on the anchor word when searching for relevant topics. The anchor_strength should always be set above 1. The choice of anchor_strength beyond that depends on the size of the vocabulary and the task at hand. We encourage users to experiment with anchor_strength to find what is useful for their own purposes.

If words is not initialized, we can anchor by specifying the column indices of the document-term matrix that we wish to anchor on. For example,

topic_model.fit(X, anchors=[[0, 2], 1], anchor_strength=2)

anchors the words of columns 0 and 2 to the first topic, and word 1 to the second topic.

Anchoring Strategies

There are a number of strategies we can use with anchored CorEx. Below we provide just a handful of examples.

  1. Anchoring a single set of words to a single topic. This can help promote a topic that did not naturally emerge when running an unsupervised instance of the CorEx topic model. For example, we might anchor words like "snow," "cold," and "avalanche" to a topic if we supsect there should be a snow avalanche topic within a set of disaster relief articles.
topic_model.fit(X, words=words, anchors=[['snow', 'cold', 'avalanche']], anchor_strength=4)
  1. Anchoring single sets of words to multiple topics. This can help find different aspects of a topic that may be discussed in several different contexts. For example, we might anchor "protest" to three topics and "riot" to three other topics to understand different framings that arise from tweets about political protests.
topic_model.fit(X, words=words, anchors=['protest', 'protest', 'protest', 'riot', 'riot', 'riot'], anchor_strength=2)
  1. Anchoring different sets of words to multiple topics. This can help enforce topic separability if there appear to be "chimera" topics that are not well-separated. For example, we might anchor "mountain," "Bernese," and "dog" to one topic and "mountain," "rocky," and "colorado" to another topic to help separate topics that merge discussion of Bernese Mountain Dogs and the Rocky Mountains.
topic_model.fit(X, words=words, anchors=[['bernese', 'mountain', 'dog'], ['mountain', 'rocky', 'colorado']], anchor_strength=2)

The example notebook details other examples of using anchored CorEx. We encourage domain experts to experiment with other anchoring strategies that suit their needs.

Note: when running unsupervised CorEx, the topics are returned and sorted according to how much total correlation they each explain. When running anchored CorEx, the topics are not sorted by total correlation, and the first n topics will correspond to the n anchored topics in the order given by the model input.

Hierarchical Topic Modeling

Building a Hierarchical Topic Model

For the CorEx topic model, topics are latent factors that can be expressed or not in each document. We can use the matrices of these topic expressions as input for another layer of the CorEx topic model, yielding a hierarchical topic model.

# Train the first layer
topic_model = ct.Corex(n_hidden=100)
topic_model.fit(X)

# Train successive layers
tm_layer2 = ct.Corex(n_hidden=10)
tm_layer2.fit(topic_model.labels)

tm_layer3 = ct.Corex(n_hidden=1)
tm_layer3.fit(tm_layer2.labels)

Visualizations of the hierarchical topic model can be accessed through vis_topic.py.

vt.vis_hierarchy([topic_model, tm_layer2, tm_layer3], column_label=words, max_edges=300, prefix='topic-model-example')

Technical notes

Binarization of Documents

For speed reasons, this version of the CorEx topic model works only on binary data and produces binary latent factors. Despite this limitation, our work demonstrates CorEx produces coherent topics that are as good as or better than those produced by LDA for short to medium length documents. However, you may wish to consider additional preprocessing for working with longer documents. We have several strategies for handling text data.

  1. Naive binarization. This will be good for documents of similar length and especially short- to medium-length documents.

  2. Average binary bag of words. We split documents into chunks, compute the binary bag of words for each documents and then average. This implicitly weights all documents equally.

  3. All binary bag of words. Split documents into chunks and consider each chunk as its own binary bag of words documents.This changes the number of documents so it may take some work to match the ids back, if desired. Implicitly, this will weight longer documents more heavily. Generally this seems like the most theoretically justified method. Ideally, you could aggregate the latent factors over sub-documents to get 'counts' of latent factors at the higher layers.

  4. Fractional counts. This converts counts into a fraction of the background rate, with 1 as the max. Short documents tend to stay binary and words in long documents are weighted according to their frequency with respect to background in the corpus. This seems to work Ok on tests. It requires no preprocessing of count data and it uses the full range of possible inputs. However, this approach is not very rigorous or well tested.

For the python API, for 1 and 2, you can use the functions in vis_topic to process data or do the same yourself. Naive binarization is specified through the python api with count='binarize' and fractional counts with count='fraction'. While fractional counts may be work theoretically, their usage in the CorEx topic model has not be adequately tested.

Single Membership of Words in Topics

Also for speed reasons, the CorEx topic model enforces single membership of words in topics. If a user anchors a word to multiple topics, the single membership will be overriden.

References

If you use this code, please cite the following:

Gallagher, R. J., Reing, K., Kale, D., and Ver Steeg, G. "Anchored Correlation Explanation: Topic Modeling with Minimal Domain Knowledge." Transactions of the Association for Computational Linguistics (TACL), 2017.

See the following papers if you interested in how CorEx works generally beyond sparse binary data.

Discovering Structure in High-Dimensional Data Through Correlation Explanation, Ver Steeg and Galstyan, NIPS 2014.

Maximally Informative Hierarchical Representions of High-Dimensional Data, Ver Steeg and Galstyan, AISTATS 2015.

Issues
  • How we can testing the model on new data ?

    How we can testing the model on new data ?

    Hello, thank you for this tutoriel, i want to build a anchored model for text classification (i have 5 classes) sentences, so i trained an anchored model with 5 topic, but how can i test the model on new sentences ? there is a "predict" attribute but i have an error

    opened by Suhaib441 11
  • eror in vis_hierarchy

    eror in vis_hierarchy

    when I running vt.vis_hierarchy([topic_model, tm_layer2, tm_layer3],column_label=words, max_edges=200, prefix='topic-model-example')

    I get

    NameError                                 Traceback (most recent call last)
    <ipython-input-99-9a4b427527bd> in <module>
          1 vt.vis_hierarchy([topic_model, tm_layer2, tm_layer3],
    ----> 2                  column_label=words, max_edges=200, prefix='topic-model-example')
    
    ~/env/py36/lib/python3.6/site-packages/corextopic/vis_topic.py in vis_hierarchy(corexes, column_label, max_edges, prefix, n_anchors)
         58         inds = np.where(alpha[j] >= 1.)[0]
         59         inds = inds[np.argsort(-alpha[j, inds] * mis[j, inds])]
    ---> 60         group_number = u"red_" + unicode(j) if j < n_anchors else unicode(j)
         61         label = group_number + u':' + u' '.join([annotate(column_label[ind], corexes[0].sign[j,ind]) for ind in inds[:6]])
         62         label = textwrap.fill(label, width=25)
    
    NameError: name 'unicode' is not defined
    
    opened by vtrokhymenko 10
  • get_topics() returns words that appear to be out of order in terms of MI

    get_topics() returns words that appear to be out of order in terms of MI

    Hi (and thank you for this really cool and interesting work)

    I've got a situation where topic words appear to be out of order, except for the first topic.

    For example, with 2 anchored topics, the first topic words returned are listed with MI sorted in decreasing order, as expected.

    However, for the second topic, MI decreases, but then increases again.

    ...looking at get_topics() it's not clear how this could happen -- the code looks right, and I'm not aware of any strange issues with np.argsort().

    Any ideas what I should check next? Is this expected behavior in certain instances?

    opened by jay-reynolds 9
  • 'Anchor word not in word column labels provided to CorEx:

    'Anchor word not in word column labels provided to CorEx:

    How to skip if anchor words not in topic and still produce results for those words available

    opened by srujana-tak 6
  • [Question] Anchoring multiple times

    [Question] Anchoring multiple times

    In the example from the readme file, there are 3 different anchoring strategies. I'm interested in 2 of them, Anchoring single sets of words to multiple topics and Anchoring different sets of words to multiple topics. I'm wondering if I should combine two of the strategies together (or more) to get a better result. For example, using the example from the ReadMe file:

    Anchor the specific list of words for every individual document

    topic_model.fit(X, words=words, anchors=[['bernese', 'mountain', 'dog'], ['mountain', 'rocky', 'colorado']], anchor_strength=2)

    Anchor general words throughout all of the documents

    topic_model.fit(X, words=words, anchors=['protest', 'protest', 'protest', 'riot', 'riot', 'riot'], anchor_strength=2)

    Will fitting the model with two different anchor words lists improve the result in general (or change anything at all), or will it decrease the quality of the result?

    Also, does repeating the words in the anchor_words list change how the model view the words (increase its strength)? In the second code, the words 'protest' and 'riot' are repeated thrice.

    opened by pat266 5
  • Coherence Scores

    Coherence Scores

    Hi,

    Thank you for the great package.

    I noticed in your paper that you measure the coherence scores of corex outputs (https://www.aclweb.org/anthology/Q17-1037.pdf)

    However, in the class I do not see a method to output the coherence values. Could you point me in the right direction?

    Thanks in adance!

    Adam

    opened by adamdavidconn 4
  • How to do word cloud or frequency distribution on each topic?

    How to do word cloud or frequency distribution on each topic?

    First of all thanks for the wonderful work. It works perfectly, I got my topics with right anchor words. Everything is working fine, however I want to see the word cloud or frequency distribution of each topic. How can do that? Thanks in advance.

    opened by JeevaGanesan 4
  • Not getting enough topics

    Not getting enough topics

    I tried running corex_topic with a training matrix of size approx 100,000x10,000. I ran Corex with settings n_hidden=1000, max_iter=1000 but only about 200 of them were non-empty. This could be a symptom of my data, of course (and perhaps there ARE only 200 topics), but are there other parameters that could be tuned to generate way more? Thanks.

    opened by cgreenberg 4
  • Incremental Modeling

    Incremental Modeling

    Hi guys, thank you so much for developing and sharing the CorEx model. I've been working on an NLP project and have found the anchored model super helpful. I'm wondering whether it is possible to do batch processing or incremental modeling on CorEx? For example, if I already built a model but have a new batch of document coming in with new vocabulary. Is it possible to update original model with the new data?

    Thank you!

    opened by ruoyu-qian 4
  • model_ct.predict_proba() explanation

    model_ct.predict_proba() explanation

    Hi Greg,

    I was trying corextopic for supervised topic modeling (more precisely classification) and was using the model.predict_proba(<clean_vectorize_data>). This gives me output something similar to (array([[0.999, 0.0022]]), array([[0.198, -0.205]]). Could you please explain what these values are. That will be a great help.

    Thanks in advance.

    opened by sachindevsharma 4
  • Metrics for Model Selection

    Metrics for Model Selection

    Hi,

    I'm testing some semi-supervised models, each with 20 topics created through lists of roughly 15 anchor words per topic. The documents within the corpus I'm working with have a large variance in word length (150 - 20,000+). I've broken the documents into smaller batches to help control for document length, and am looking to find the batch size and anchor strength which creates the best model.

    I know that total correlation is the measure which CorEx maximizes when constructing the topic model, but in my experimenting with anchor strength I've found that TC always increased linearly with anchor strength, even when it's set into the thousands. So far I've been evaluating my models by comparing the anchor words of each topic to the words returned from .get_topics(), and I was wondering if there is a more quantitative way of selecting one model over another? I've looked into using other packages to measure the sematic similarity between the anchor words and the different words retrieved by .get_topics(), but wanted to reach out to see if there's any other metrics out there to measure model performance.

    Additionally, besides batch size and anchor strength, are there any other parameters I should be aware of when fitting a model? Any help would be greatly appreciated.

    opened by mchabala 1
  • Priority is always given to the first anchor from anchor words

    Priority is always given to the first anchor from anchor words

    I have a dataset that consists of 10 thousand documents. It definitely contains documents for 16 topics. With anchor words, I want to classify a dataset into 16 topics. For each topic, I set anchor words (some anchors have more words, some less, but on average about 50 words per topic). For each topic anchor words are set in a separate list, then I check for the presence of anchor words in the texts and add them to the general list of lists anchors.

    But at the output, one topic always dominates (90-95%) in my documents, and this is the topic whose words are set first in the anchor words (I checked this by changing the order of the anchor words).

    For example, I have a desserts and alcoholic drinks theme. If I put the anchor words of the theme desserts first in the list of anchor words, then this theme will prevail in the output. If I first put the anchor words of the topic of alcoholic beverages, then the topic of alcoholic beverages will prevail.

    To prevail this means that 90% or more of the documents are marked with the first topic of the anchor words. Other of the 16 topics also appear in the output, but much less often and also wrong.

    Can you please tell me why this is happening and what am I doing possibly wrong?

    Thank you in advance for your help and answer!

    opened by ElizaLo 1
  • Hierarchical topic model visualization

    Hierarchical topic model visualization

    The graphviz functions for visualizing the topic model are finicky and it's costly to have to update them with respect to both networkx and graphviz.

    Two proposed options : 1 . We add a function that makes it easy to get to the hierarchy edge list so that others could more easily visualize the hierarchy themselves 2. We go a step further and rework the visualization code so it just uses networkx

    enhancement 
    opened by ryanjgallagher 1
  • Close file descriptors after saving and loading models

    Close file descriptors after saving and loading models

    Fixes #42

    Additionally I removed unused imports, sorted them, and applied auto-linting with Flake8.

    opened by zafercavdar 0
  • [Warning] ResourceWarning: unclosed file

    [Warning] ResourceWarning: unclosed file

    Definition

    After saving and loading with pickle, file descriptors are not closed.

    How to reproduce

    ...
    from corextopic import corextopic
    
    corex_model = corextopic.Corex(n_hidden=10, verbose=True, max_iter=200)
    corex_model.fit(corpus, words=words)
    
    path = "path/to/corex_model.pkl"
    corex_model.save(path, ensure_compatibility=False)
    
    > ResourceWarning: unclosed file <_io.BufferedWriter name='path/to/corex_model.pkl'>
    
    
    loaded_model = corextopic.load(path)
    > ResourceWarning: unclosed file <_io.BufferedReader name='path/to/corex_model.pkl'>
    
    opened by zafercavdar 0
  • GPU Implementation

    GPU Implementation

    This is an enhancement. Given that CorEx utilises a semi-supervised approach it would be advantageous to have a GPU Implementation as the reduced wait for feedback would allow for more rapid development of topics. https://developer.nvidia.com/how-to-cuda-python

    opened by RyanCodrai 1
  • More flexibility in setting anchor strength (fixes #16)

    More flexibility in setting anchor strength (fixes #16)

    This attempts to allow setting anchor strengths on a per-topic basis, and within each topic on a per-word basis. I don't have a real theoretical understanding of this, but #16 suggests that it is possible and I think this method seems intuitive. Let me know any thoughts! Thanks

    opened by GuyAglionby 1
  • Index error when using anchors

    Index error when using anchors

    When using the fit method with anchors I get an index error from this line:

    https://github.com/gregversteeg/corex_topic/blob/83991482523a8be2b2a8b9c864b273de96c2389a/corextopic/corextopic.py#L185

    The error is understandable because if X is a 2d array, then X[:,i] is a 1d slice and thereforeX[:,i].mean(axis=1) is undefined because there is no dimension 1.

    I've installed version corextopic==1.0.5 from pypi.

    I can reproduce this for any arguments passed to anchors

    opened by owlas 6
  • Can't visualize use vis_hierarchy()

    Can't visualize use vis_hierarchy()

    I try to run the example code and vis_hierarchy() I have download the force.html according** to issue 19 But When I open the force.html, the page is whole blank and I inspect the element. In console it says: ncaught TypeError: Cannot read property 'push' of undefined at t (d3.v2.min.js?2.9.3:3) at e (d3.v2.min.js?2.9.3:3) at Object.n.start (d3.v2.min.js?2.9.3:3) at force.html:34 at d3.v2.min.js?2.9.3:2 at r (d3.v2.min.js?2.9.3:2) at XMLHttpRequest.r.onreadystatechange (d3.v2.min.js?2.9.3:2)

    Seems that the src used in force.html has some problem How can I solve this Thanks a lot

    opened by AlexanderZhujiageng 9
  • Allow anchoring parameter to be set more flexibly

    Allow anchoring parameter to be set more flexibly

    Currently, the anchoring parameter must be set to be the same across all words that are anchored. Since the theory allows it, we should allow a user to pass a list of anchors (if they want) where they the list can consist of integers (anchor all words in this topic with the same parameter), lists (anchor the words in this particular topic with these anchors), or both (a mix of setting the anchor to be the same for all words in some topics, and setting the parameter for each word in some topics). A user should still be allowed to just pass an integer to the anchoring parameter if they do not want to specify each topic.

    ex.

    anchors = [['dog', 'cat'], 'apple']
    anchor_strengths = [[2, 3], 4]
    topic_model.fit(X, words=words, anchors=anchors, anchor_strength=anchor_strengths)
    

    This would anchor "dog" to Topic 1 with anchor_strength=2, "cat" to Topic 1 with anchor_strength=3, and "apple" to Topic 2 with anchor_strength=4.

    Opening this as an issue because I keep forgetting to get around to it.

    enhancement 
    opened by ryanjgallagher 0
Owner
Greg Ver Steeg
Research professor at USC
Greg Ver Steeg
Beautiful visualizations of how language differs among document types.

Scattertext 0.1.0.0 A tool for finding distinguishing terms in corpora and displaying them in an interactive HTML scatter plot. Points corresponding t

Jason S. Kessler 1.5k Feb 17, 2021
Top2Vec is an algorithm for topic modeling and semantic search.

Top2Vec is an algorithm for topic modeling and semantic search. It automatically detects topics present in text and generates jointly embedded topic, document and word vectors.

Dimo Angelov 1.4k Oct 21, 2021
This repo is to provide a list of literature regarding Deep Learning on Graphs for NLP

This repo is to provide a list of literature regarding Deep Learning on Graphs for NLP

Graph4AI 210 Oct 16, 2021
PyTorch original implementation of Cross-lingual Language Model Pretraining.

XLM NEW: Added XLM-R model. PyTorch original implementation of Cross-lingual Language Model Pretraining. Includes: Monolingual language model pretrain

Facebook Research 2.5k Oct 15, 2021
Phrase-Based & Neural Unsupervised Machine Translation

Unsupervised Machine Translation This repository contains the original implementation of the unsupervised PBSMT and NMT models presented in Phrase-Bas

Facebook Research 1.4k Oct 17, 2021
A library for Multilingual Unsupervised or Supervised word Embeddings

MUSE: Multilingual Unsupervised and Supervised Embeddings MUSE is a Python library for multilingual word embeddings, whose goal is to provide the comm

Facebook Research 2.9k Oct 16, 2021
Unsupervised Language Modeling at scale for robust sentiment classification

** DEPRECATED ** This repo has been deprecated. Please visit Megatron-LM for our up to date Large-scale unsupervised pretraining and finetuning code.

NVIDIA Corporation 1k Oct 17, 2021
Yet another Python binding for fastText

pyfasttext Warning! pyfasttext is no longer maintained: use the official Python binding from the fastText repository: https://github.com/facebookresea

Vincent Rasneur 230 Jul 17, 2021
Yet another Python binding for fastText

pyfasttext Warning! pyfasttext is no longer maintained: use the official Python binding from the fastText repository: https://github.com/facebookresea

Vincent Rasneur 228 Feb 17, 2021
MASS: Masked Sequence to Sequence Pre-training for Language Generation

MASS: Masked Sequence to Sequence Pre-training for Language Generation

Microsoft 1k Oct 16, 2021
Deal or No Deal? End-to-End Learning for Negotiation Dialogues

Introduction This is a PyTorch implementation of the following research papers: (1) Hierarchical Text Generation and Planning for Strategic Dialogue (

Facebook Research 1.3k Oct 15, 2021
A Non-Autoregressive Transformer based TTS, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate TTS.

A Non-Autoregressive Transformer based TTS, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate TTS.

Keon Lee 50 Oct 16, 2021
Fast topic modeling platform

The state-of-the-art platform for topic modeling. Full Documentation User Mailing List Download Releases User survey What is BigARTM? BigARTM is a pow

BigARTM 580 Oct 15, 2021
Pytorch code for ICRA'21 paper: "Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation"

Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation This repository is the pytorch implementation of our paper: Hierarchical Cr

null 25 Sep 30, 2021
Implementation / replication of DALL-E, OpenAI's Text to Image Transformer, in Pytorch

Implementation / replication of DALL-E, OpenAI's Text to Image Transformer, in Pytorch

Phil Wang 3.5k Oct 22, 2021
The guide to tackle with the Text Summarization

The guide to tackle with the Text Summarization

Takahiro Kubo 1.1k Oct 20, 2021
[AAAI 21] Curriculum Labeling: Revisiting Pseudo-Labeling for Semi-Supervised Learning

◥ Curriculum Labeling ◣ Revisiting Pseudo-Labeling for Semi-Supervised Learning Paola Cascante-Bonilla, Fuwen Tan, Yanjun Qi, Vicente Ordonez. In the

UVA Computer Vision 63 Oct 19, 2021
ETM - R package for Topic Modelling in Embedding Spaces

ETM - R package for Topic Modelling in Embedding Spaces This repository contains an R package called topicmodels.etm which is an implementation of ETM

bnosac 12 Oct 12, 2021
NLP topic mdel LDA - Gathered from New York Times website

NLP topic mdel LDA - Gathered from New York Times website

null 1 Oct 14, 2021