TDmatch
TDmatch is a Python library developed to perform matching tasks in three categories:
- Text to Data which matches tuples of a table to text docuemts
- Text to Structured text matches hierarchical taxonomy concepts to text docuemtns
- Text to Text matches two copora of text documents
notebooks
contains notebooks for running different scenarios.
Folder First, the model creates a graph from document copora, next it trains a word embedding model on random walks generated by tracersing the graph and fainally, by employing the generated model we can match metadata between two corpora.
We used 5 datasets in testing different tasks:
- Two fact checking datasets: Politifact and Snopes which we use for Text to Text matching. These datasets are presented in That-is-a-Known-Lie. We also used STS dataset from GLUE as a text-to-text matching dataset.
- Two datasets for Text to Data matching: IMDB which is created form IMDB top 1000 movies of all time. CoronaCheck dataset is presented in Scrutinizer
How to run
Use the notebook for the required task to generate the results for the required dataset.
All the notebooks have the similar structure:
- Creating the gaph
- (optional) Expanding the graph with external sources
- (optional) Compressing the graph with
MSP
- Generating random walks on the graph and training Word embedding model on random walks.
- Matching metadata nodes with model and printing the results.
SSuM
compression
Using - First install the library following instructions Here
- Use the code in
SSuM
block to generate input - Generate the compressed graph:
./run.sh input_path compression_ratio reconstruction_error
Expanding with ConceptNet
- After installing conceptnet_lite, download ConceptNet DB from this link