A knowledge base construction engine for richly formatted data

HazyResearch

Last update: Dec 5, 2022

Related tags

Overview

Fonduer is a Python package and framework for building knowledge base construction (KBC) applications from richly formatted data.

Note that Fonduer is still actively under development, so feedback and contributions are welcome. Submit bugs in the Issues section or feel free to submit your contributions as a pull request.

Getting Started

Check out our Getting Started Guide to get up and running with Fonduer.

Learning how to use Fonduer

The Fonduer tutorials cover the Fonduer workflow, showing how to extract relations from hardware datasheets and scientific literature.

Reference

Fonduer: Knowledge Base Construction from Richly Formatted Data (blog):

@inproceedings{wu2018fonduer,
  title={Fonduer: Knowledge Base Construction from Richly Formatted Data},
  author={Wu, Sen and Hsiao, Luke and Cheng, Xiao and Hancock, Braden and Rekatsinas, Theodoros and Levis, Philip and R{\'e}, Christopher},
  booktitle={Proceedings of the 2018 International Conference on Management of Data},
  pages={1301--1316},
  year={2018},
  organization={ACM}
}

Acknowledgements

Fonduer leverages the work of Emmental and Snorkel.

Comments

Using candidates for prediction (Fonduer Prediction Pipeline)
Scenario:

For my use case I have a set of financial documents.

The entire document set is divided into train,dev and test. The documents are parsed and the mentions and candidates are extracted with some rules.

The featurized training candidates are used to train a Fonduer Learning model and the model is used to predict on the test candidates, as per the normal fonduer pipeline as demonstrated in the hardware tutorial.

Problems & Questions

Is the fonduer prediction pipeline production ready? How can we fine tune it to achieve better accuracy? Should the main focus be on the quality of the extracted mentions?

With my initial analysis and usage following the hardware tutorial, I could not obtain good results.

Can we separate the training and test pipeline?

As in the current scenario, with a new document that I will feed for prediction, The entire corpus will have to be parsed to extract the mentions and candidates and store the feature keys.

Please correct me, if that won't be the case and help me with a snippet to showcase the separation.
opened by atulgupta9 16
Add multiline Japanese strings support to HocrVisualParser() to fix #534 and redo #537
Description of the problems or issues

Is your pull request related to a problem? Please describe. See #534. This request redoes #537, which needs prior fixing #538 (fixed by #539).

Does your pull request fix any issue. See #534

Description of the proposed changes

In case of multi line Japanese strings 'AAAA\nBBBB', spacy[ja] sometimes generates tokens ['AAA', 'AB', 'B', 'BB']. Proposal defines bbox of 'AB' as a multi line word (i.e. left is min left of ['A','B'], top is the top of 'A', right is max right of ['A','B'] and bottom is the bottom of 'B').

Test plan

This is cause of Japanese morphological analysis. So, I have added Japanese test data to 'tests/data/hocr_simple/japan.hocr' and test code to 'tests/parser/test_parser.py::test_parse_hocr'

Checklist

[x] I have updated the documentation accordingly.

[x] I have added tests to cover my changes.

[x] All new and existing tests passed.

[x] I have updated the CHANGELOG.rst accordingly.
opened by YasushiMiyata 15
parser.apply does not return for a long time even though the progress bar indicates it finishes parsing
Description of the bug

This is not a bug, but a performance issue. This is not noticeable when parsing a small number of documents, but parser.apply does not return even though the progress bar indicates it finishes parsing a long time ago (1 hour or more ago).

To Reproduce

Steps to reproduce the behavior:

Parse many documents (my case: ~2500)

Expected behavior

parser.apply returns when the progress bar indicates it finished parsing all the documents.

Error Logs/Screenshots

If applicable, add error logs or screenshots to help explain your problem.

Environment (please complete the following information)

OS: Debian Buster

PostgreSQL Version: 12.1

Poppler Utils Version: N/A

Fonduer Version: 0.8.3+dev (01e0d9319b523aff7aa7f5c583a9f330b0705ecc)

Additional context

Add any other context about the problem here.
bug
opened by HiromuHota 14
Execute preprocessing and parsing in parallel
Description of the problems or issues

Is your pull request related to a problem? Please describe.

Currently, preprocessor and parser are executed in a complete sequential order. i.e., preprocess N docs (and load them into a queue), then parse N docs. This has two drawbacks:

the progress bar shows nothing during preprocessing.

the machine RAM has to be large enough to hold N preprocessed docs at a time.

They become more serious when N is large and/or each HTML file is large.

Does your pull request fix any issue.

Fix #435

Description of the proposed changes

A clear and concise description of what you propose.

This PR

places a cap on the in_queue so that only a certain number of documents are loaded to in_queue.

executes preprocessor and parser in parallel (ie the main process does preprocessing and child process(es) do parsing in parallel).

Test plan

A clear and concise description of how you test the new changes.

For the 1st issue: I manually check the progress bar starts showing progress right after starting parse.apply.

Checklist

[x] I have updated the documentation accordingly.

[ ] I have added tests to cover my changes.

[x] All new and existing tests passed.

[x] I have updated the CHANGELOG.rst accordingly.

enhancement
opened by HiromuHota 13

[Errno 32] Broken pipe for Parser in parallel execution on OSX

Hi,

In fonduer-tutorials, after running cell:

corpus_parser = OmniParser(structural=True, lingual=True, visual=True, pdf_path=pdf_path)
%time corpus_parser.apply(doc_preprocessor, parallelism=PARALLEL)

whenever is PARALLEL smaller than max_docs, I've got:

Traceback (most recent call last):
  File "/anaconda3/lib/python3.6/multiprocessing/queues.py", line 240, in _feed
    send_bytes(obj)
  File "/anaconda3/lib/python3.6/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/anaconda3/lib/python3.6/multiprocessing/connection.py", line 398, in _send_bytes
    self._send(buf)
  File "/anaconda3/lib/python3.6/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe

Otherwise (with PARALLEL bigger or equal than max_docs) result is empty tables in Postgresql. When turning off parallelisation, it works.

Best regards

bug

opened by mladvladimir 13

Feat/multary candidates
Description of the problems or issues

The feature extraction only supports unary and binary candidates

Does your pull request fix any issue. Closes #455

Description of the proposed changes

Add new functions that supports multary-relation between spans for the feature extraction

Test plan

A clear and concise description of how you test the new changes. Use a candidate with more then two mentions, and try the feature extraction part.

Checklist

[x] I have updated the documentation accordingly.

[x] I have added tests to cover my changes.

[x] All new and existing tests passed.

[x] I have updated the CHANGELOG.rst accordingly.

Note:

In order for this to run the multary-candidates in textual features, we need a new version of treedlib based on this PR: treedlib#46 So if you can contact them, please do.

Also if someone can jump-in to improve the coverage, I can't get the tabular_features up
enhancement
opened by wajdikhattel 12
Add HOCRDocProprocessor and HocrVisualParser
Description of the problems or issues

Is your pull request related to a problem? Please describe.

This is the second patch that follows #518 .

Does your pull request fix any issue.

N/A.

Description of the proposed changes

Add HOCRDocProprocessor and HocrVisualParser

Test plan

I added a few real hOCR example files.

Checklist

[x] I have updated the documentation accordingly.

[x] I have added tests to cover my changes.

[x] All new and existing tests passed.

[x] I have updated the CHANGELOG.rst accordingly.

enhancement
opened by HiromuHota 9
Duplicate key error while adding two mentions which are same
Suppose that I have two mentions (say for example zip-code and tax code) whose matchers return true (checking 5 digit regex match for both mentions) for the same span in document, then I think Fonduer is throwing this error. please help me in resolving this.

sqlalchemy.exc.IntegrityError: (psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "context_stable_id_key" DETAIL: Key (stable_id)=(1443208965_10_subset::span_mention:23313:23321) already exists. [SQL: INSERT INTO context (type, stable_id) VALUES (%(type)s, %(stable_id)s) RETURNING context.id]
opened by saikalyan9981 9
unable to read images in the pdf file

Hi

I am passing html to fonduer and it is saying unable read image from figure I have taken a pdf converted to html via pdftotree and passing the html to fonduer. Is this the issue with pdftotree that it is not able to render images. I want to what is the mechanism so that we can have images linked/embed in html so that fonduer can read it

Please help/advice as i am stuck with this issue

opened by ashleo25 8
Non-deterministic behavior in featurization

Describe the bug When working with large (~7k docs) corpus of hardware datasheets, extracting multiple relations, we expect that the features for each candidate would be deterministic between each run. Even more so if we have parallelism=1 set in the Featurizer. However, we find that there can be small (e.g., < 5) differences between feature tables, resulting in slightly different sparse matrices, and thus, slightly different results.

To Reproduce Running on the HACK transistor dataset will reproduce the error. However, it will take a long time, and we haven't been able to get a very minimal example that reproduces the error yet. Attached are two feature table dumps between two different runs with parallelism=1. Note that there is only a single difference on line 65454.

feature_table.tar.gz

Note that it isn't always one difference, and the difference is not deterministic. The different attached is just an example.

Expected behavior We would expect that these feature tables are identical between runs.

Error Logs/Screenshots For convenience, here is the differing line in screenshot form

Additional context If the issue is in the UDF implementation, this might affect the Labeler in addition to the Featurizer, since they share a lot of the UDF code.
bug

opened by lukehsiao 8
Type hints
Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

I'm always frustrated when I have to look at the source codes to check the type of arguments/return.

Describe the solution you'd like A clear and concise description of what you want to happen.

Type hints (PEP484) are written to source codes like

def greeting(name: str) -> str: return 'Hello ' + name

(Eventually) enforce type checking during pre-commit

For example by flake8-mypy

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Depending on the editor (PyCharm, etc.), type/rtype documentation like below gives you type hinting. However, I'm not sure this is equivalent to the type hints (PEP484).

def greeting(name): """ greeting :param name: description :type name: type description :return: description :rtype: type description """ return 'Hello ' + name

Additional context Add any other context or screenshots about the feature request here.
enhancement help wanted
opened by HiromuHota 8
CandidateExtractor doesn't scale for larger relations
Hello, thanks for providing this framework. My group has run into a bit of a snag:

For context, we've successfully completed candidate extraction & labeling for binary relations, with reasonable runtimes. With parallelism = 6, candidate extraction takes ~2 minutes per document.

We've since moved on to a 3-ary relation that is very similar to the binary relation. This 3-ary relation shares some mentions with the binary relation, and uses a very similar candidate extractor. We have done performance testing for the 3-ary throttler function and found it to have a very similar runtime to the binary throttler. Candidate extraction now takes 4 hours per document. This immense slowdown is due to the fact that computational complexity scales exponentially for each entity added to a relation.

Here are some numbers from our use case:

Mention A: 800 mentions found

Mention B: 140 mentions found

Mention C: 150 mentions found

If our relation only includes (A,B), we have a total of 800*140 = 112,000 temporary candidates to evaluate with our throttler. Should we add mention C to form the relation (A,B,C), our total now grows to 800*140*150 = 16.8 million temporary candidates. We're unable to narrow our mention matchers further without excluding true positives.

This slowdown makes the Fonduer framework effectively unusable for any large-scale use case that requires relations with more than 2 entities. Can you provide guidance to address this issue?
opened by robbieculkin 1
Tables aren't redefined for re-runs of UDF apply
Description of the bug

As part of iterative development in a Jupyter environment, apply may be re-run several times. The developer might need to update candidates or create a new labeling function, for example. When this happens, the corresponding Postgres table is cleared but not dropped. This means that the definition of the table cannot change to accommodate the updated parameters for apply.

To Reproduce

Steps to reproduce the behavior:

Run the max_storage_temp_tutorial notebook in fonduer-tutorials, up to and including the Labeling Functions section.

Add a new LF, doesn't need to do anything in particular (could return ABSTAIN every time). Add this to the stg_temp_lfs list.

Re-run the remainder of cells in the section.

Upon calling LFAnalysis, the following exception is thrown:

ValueError: Number of LFs (7) and number of LF matrix columns (6) are different

Expected behavior

Underlying tables for a re-run of a UDF apply method should not only be cleared, but dropped.

Error Logs/Screenshots

Full stack trace:

--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-62-e005feee6300> in <module> 5 sorted_lfs = sorted(lfs, key=lambda lf: lf.name) 6 ----> 7 LFAnalysis(L=L_train[0], lfs=sorted_lfs).lf_summary(Y=L_gold_train[0].reshape(-1)) ~/.venv/lib/python3.7/site-packages/snorkel/labeling/analysis.py in __init__(self, L, lfs) 44 if len(lfs) != self._L_sparse.shape[1]: 45 raise ValueError( ---> 46 f"Number of LFs ({len(lfs)}) and number of " 47 f"LF matrix columns ({self._L_sparse.shape[1]}) are different" 48 ) ValueError: Number of LFs (7) and number of LF matrix columns (6) are different

Environment (please complete the following information)

OS: Ubuntu 18.04

PostgreSQL Version: 12.1

Poppler Utils Version: 0.71.0-5

Fonduer Version: 0.8.3

Additional context

https://github.com/HazyResearch/fonduer/issues/263#issuecomment-527588765 advises restarting Python, but this does not appear to solve the problem.
opened by robbieculkin 5
Parser is not splitting multiple lines sentences properly
Description of the bug

I'm trying to Train a model that can build a Knowledge Base from the OPC UA Companions specification as a part of my Thesis. I have the Dataset as PDFs and used a third-party program to convert them into HTML and tried my best to preserve the data structure information (i'm getting the same result even if i just Parsed on the PDFs alone).

Then i followed the hardware_fonduer_model Tutorial to Extract the Candidates accordingly.

the Problem is that the Parser is splitting the sentences wrongly, namely it is getting the end of a Line as an end of a sentence. I tried to debug using a SimpleParser.split_sentences(text) command and turned out that python needs a backslash to split a statement into multiple lines.

So i thought maybe i could use the replacements=['[\n]', ' '] parameter so the Split could function better but i'm getting the ValueError: too many values to unpack (expected 2). What is the default configuration for the sentence segmentation?
How could i get a multiple Sentences as a mention? (i tried MentionNgram till n_max =100 and still getting just one).

I would really appreciate getting back from you.

many thanks in advance

Example: Text to be parsed

Boolean indicating if a profile /signature should be generated by this move command request.If the optional VariableSignatureRequestStatus is not provided on the Object, this parameter is ignored by the Server.

Expected behavior

sentence 1 : Boolean indicating if a profile /signature should be generated by this move command request. sentence 2 : If the optional VariableSignatureRequestStatus is not provided on the Object, this parameter is ignored by the Server.

Actual behavior

sentence 1 : Boolean indicating if a profile /signature should be generated by this move command sentence 2 : request. sentence 3 : request.If the optional VariableSignatureRequestStatus is not provided on the Object, this sentence 4 : parameter is ignored by the Server.

Environment

OS: Ubuntu 20.04.1 LTS

PostgreSQL Version: 12.0

Poppler Utils Version: 0.2.1]

Fonduer Version: 0.8.2 MDISCompanionSpecification.pdf
opened by eng-khaled1 3
Suggestion required: Getting error while applying Featurizer
@SenWu @HiromuHota .. can you pls suggest if my analogy is right?

I am getting error :- File "abcd./anaconda3/lib/python3.7/site-packages/fonduer/utils/data_model_utils/structural.py", line 55, in _get_node return doc_etree.xpath(sentence.xpath)[0] IndexError: list index out of range

I am following Hardware tutorial on some Email HTML msgs and getting mentions count near 4000

Also :-- train_cands = candidate_extractor.get_candidates(split=0) dev_cands = candidate_extractor.get_candidates(split=1) test_cands = candidate_extractor.get_candidates(split=2)

Above steps returned outputs but,

on applying featurizer: featurizer.apply(split=0, train=True, parallelism=PARALLEL)

I am getting error mentioned on top.

I looked stackoverflow but the reason that HTML syntax issue,.. is not there as it is rendering good on browser. So can you share your thoughts on :

can it be because no candidates being generated? or

something else

Thanks.
opened by AshutoshUpadhya 3
How can i extract a paragraph and all associated sentences in document

How can i extract a paragraph and all associated sentences in document
Basically i need paragraphs with associated sentences @lukehsiao @SenWu @vincentschen @ZZWENG @stephenbach

Appreciate your help
needs-info

opened by ashleo25 1
Featurizer.get_keys() does not honor candidate classes in context
Description of the bug

Unlike other methods (eg Featurizer.drop_keys() and Featurizer.upsert_keys(), Featurizer.get_keys() does not honor candidate classes in context but returns all feature keys no matter which candidate class they are associated with. This is confusing.

See https://github.com/HazyResearch/fonduer/issues/511#issuecomment-696618392 for how this actually confused a user.

To Reproduce

This is a design error.

Expected behavior

These methods should behave similarly. Either

None of these honor candidate classes, or

All of these honor them.

Error Logs/Screenshots

N/A

Environment (please complete the following information)

Fonduer Version: 0.8.3

Additional context

Add any other context about the problem here.
opened by HiromuHota 0

Releases(v0.9.0)

v0.9.0(Jun 23, 2021)
0.9.0 - 2021-06-22

This is a long-awaited release with some performance improvements and some breaking changes. See the changelog for details.

Added

@HiromuHota: Support spaCy v2.3. (#506)

@HiromuHota: Add HOCRDocPreprocessor and HocrVisualLinker to support hOCR as input file. (#476) (#519)

@YasushiMiyata: Add multiline Japanese strings support to fonduer.parser.visual_parser.hocr_visual_parser. (#534) (#542)

@YasushiMiyata: Add commit process immediately after add to fonduer.parser.Parser. (#494) (#544)

Changed

@HiromuHota: Renamed VisualLinker to PdfVisualParser, which assumes the followings: (#518)

pdf_path should be a directory path, where PDF files exist, and cannot be a file path.

The PDF file should have the same basename (os.path.basename) as the document. E.g., the PDF file should be either "123.pdf" or "123.PDF" for "123.html".

@HiromuHota: Changed Parser's signature as follows: (#518)

Renamed vizlink to visual_parser.

Removed pdf_path. Now this is required only by PdfVisualParser.

Removed visual. Provide visual_parser if visual information is to be parsed.

@YasushiMiyata: Changed UDFRunner's and UDF's data commit process as follows: (#545)

Removed add process on single-thread in _apply in UDFRunner.

Added UDFRunner._add of y on multi-threads to Parser, Labeler and Featurizer.

Removed y of document parsed result from out_queue in UDF.

Fixed

@YasushiMiyata: Fix test code test_postgres.py::test_cand_gen_cascading_delete. (#538) (#539)

@HiromuHota: Process the tail text only after child elements. (#333) (#520)

Source code(tar.gz)
Source code(zip)
fonduer-0.9.0-py3-none-any.whl(146.07 KB)
fonduer-0.9.0.tar.gz(102.10 KB)
v0.8.3(Sep 11, 2020)
0.8.3 - 2020-09-11

This is a big release with a lot of changes. These changes are summarized here. Check the Changelog for more details.

Added

@YasushiMiyata: Add get_max_row_num to fonduer.utils.data_model_utils.tabular. (#469) (#480)

@HiromuHota: Add get_bbox() to Sentence and SpanMention. (#429)

@HiromuHota: Add a custom MLflow model that allows you to package a Fonduer model. See here for how to use it. (#259) (#407)

@HiromuHota: Support spaCy v2.2. (#384) (#432)

@wajdikhattel: Add multinary candidates. (#455) (#456)

@HiromuHota: Add nullables to candidate_subclass() to allow NULL mention in a candidate. (#496) (#497)

@HiromuHota: Copy textual functions in data_model_utils.tabular to data_model_utils.textual. (#503) (#505)

Changed

@YasushiMiyata: Enable RegexMatchSpan with concatenates words by sep="(separator)" option. (#270) (#492)

@HiromuHota: Enabled "Type hints (PEP 484) support for the Sphinx autodoc extension." (#421)

@HiromuHota: Switched the Cython wrapper for Mecab from mecab-python3 to fugashi. Since the Japanese tokenizer remains the same, there should be no impact on users. (#384) (#432)

@HiromuHota: Log a stack trace on parsing error for better debug experience. (#478) (#479)

@HiromuHota: get_cell_ngrams and get_neighbor_cell_ngrams yield nothing when the mention is not tabular. (#471) (#504)

Deprecated

@HiromuHota: Deprecated bbox_from_span and bbox_from_sentence. (#429)

@HiromuHota: Deprecated visualizer.get_box in favor of span.get_bbox(). (#445) (#446)

@HiromuHota: Deprecate textual functions in data_model_utils.tabular. (#503) (#505)

Fixed

@senwu: Fix pdf_path cannot be without a trailing slash. (#442) (#459)

@kaikun213: Fix bug in table range difference calculations. (#420)

@HiromuHota: mention_extractor.apply with clear=True now works even if it's not the first run. (#424)

@HiromuHota: Fix get_horz_ngrams and get_vert_ngrams so that they work even when the input mention is not tabular. (#425) (#426)

@HiromuHota: Fix the order of args to Bbox. (#443) (#444)

@HiromuHota: Fix the non-deterministic behavior in VisualLinker. (#412) (#458)

@HiromuHota: Fix an issue that the progress bar shows no progress on preprocessing by executing preprocessing and parsing in parallel. (#439)

@HiromuHota: Adopt to mlflow>=1.9.0. (#461) (#463)

@HiromuHota: Correct the entity type for NumberMatcher from "NUMBER" to "CARDINAL". (#473) (#477)

@HiromuHota: Fix _get_axis_ngrams not to return None when the input is not tabular. (#481)

@HiromuHota: Fix Visualizer.display_candidates not to draw rectangles on wrong pages. (#488)

@HiromuHota: Persist doc only when no error happens during parsing. (#489) (#490)

Source code(tar.gz)
Source code(zip)
fonduer-0.8.3-py3-none-any.whl(136.97 KB)
fonduer-0.8.3.tar.gz(99.00 KB)
v0.8.2(Apr 29, 2020)
0.8.2 - 2020-04-28

A summary of the changes of this release are below. Check the Changelog for more details.

Deprecated

@HiromuHota: Use of undecorated labeling functions is deprecated and will not be supported as of v0.9.0. Please decorate them with snorkel.labeling.labeling_function.

Fixed

@HiromuHota: Labeling functions can now be decorated with snorkel.labeling.labeling_function. (#400 <https://github.com/HazyResearch/fonduer/issues/400>) (#401 <https://github.com/HazyResearch/fonduer/pull/401>)

Source code(tar.gz)
Source code(zip)
fonduer-0.8.2-py3-none-any.whl(126.83 KB)
fonduer-0.8.2.tar.gz(88.07 KB)
v0.8.1(Apr 13, 2020)
0.8.1 - 2020-04-13

A summary of the changes of this release are below. Check the Changelog for more details.

Fonduer has a new mode argument to support switching between different learning modes (e.g., STL or MLT).

Click to see example usage

# Create task for each relation. tasks = create_task( task_names = TASK_NAMES, n_arities = N_ARITIES, n_features = N_FEATURES, n_classes = N_CLASSES, emb_layer = EMB_LAYER, model="LogisticRegression", mode = MODE, )

Added

@senwu: Add mode argument in create_task to support STL and MTL.

Source code(tar.gz)
Source code(zip)
fonduer-0.8.1-py3-none-any.whl(128.52 KB)
fonduer-0.8.1.tar.gz(87.80 KB)

v0.8.0(Apr 8, 2020)

0.8.0 - 2020-04-07

A summary of the changes of this release are below. Check the Changelog for more details.

Rather than maintaining a separate learning engine, we switch to Emmental, a deep learning framework for multi-task learning. Switching to a more general learning framework allows Fonduer to support more applications and multi-task learning.

Click to see example usage

# With Emmental, you need do following steps to perform learning:
# 1. Create task for each relations and EmmentalModel to learn those tasks.
# 2. Wrap candidates into EmmentalDataLoader for training.
# 3. Training and inference (prediction).

import emmental

# Collect word counter from candidates which is used in LSTM model.
word_counter = collect_word_counter(train_cands)

# Initialize Emmental. For customize Emmental, please check here:
# https://emmental.readthedocs.io/en/latest/user/config.html
emmental.init(fonduer.Meta.log_path)

#######################################################################
# 1. Create task for each relations and EmmentalModel to learn those tasks.
#######################################################################

# Generate special tokens which are used for LSTM model to locate mentions.
# In LSTM model, we pad sentence with special tokens to help LSTM to learn
# those mentions. Example:
# Original sentence: Then Barack married Michelle.
# ->  Then ~~[[1 Barack 1]]~~ married ~~[[2 Michelle 2]]~~.
arity = 2
special_tokens = []
for i in range(arity):
    special_tokens += [f"~~[[{i}", f"{i}]]~~"]

# Generate word embedding module for LSTM.
emb_layer = EmbeddingModule(
    word_counter=word_counter, word_dim=300, specials=special_tokens
)

# Create task for each relation.
tasks = create_task(
    ATTRIBUTE,
    2,
    F_train[0].shape[1],
    2,
    emb_layer,
    mode="mtl",
    model="LogisticRegression",
)

# Create Emmental model to learn the tasks.
model = EmmentalModel(name=f"{ATTRIBUTE}_task")

# Add tasks into model
for task in tasks:
    model.add_task(task)

#######################################################################
# 2. Wrap candidates into EmmentalDataLoader for training.
#######################################################################

# Here we only use the samples that have labels, which we filter out the
# samples that don't have significant marginals.
diffs = train_marginals.max(axis=1) - train_marginals.min(axis=1)
train_idxs = np.where(diffs > 1e-6)[0]

# Create a dataloader with weakly supervisied samples to learn the model.
train_dataloader = EmmentalDataLoader(
    task_to_label_dict={ATTRIBUTE: "labels"},
    dataset=FonduerDataset(
        ATTRIBUTE,
        train_cands[0],
        F_train[0],
        emb_layer.word2id,
        train_marginals,
        train_idxs,
    ),
    split="train",
    batch_size=100,
    shuffle=True,
)


# Create test dataloader to do prediction.
# Build test dataloader
test_dataloader = EmmentalDataLoader(
    task_to_label_dict={ATTRIBUTE: "labels"},
    dataset=FonduerDataset(
        ATTRIBUTE, test_cands[0], F_test[0], emb_layer.word2id, 2
    ),
    split="test",
    batch_size=100,
    shuffle=False,
)


#######################################################################
# 3. Training and inference (prediction).
#######################################################################

# Learning those tasks.
emmental_learner = EmmentalLearner()
emmental_learner.learn(model, [train_dataloader])

# Predict based the learned model.
test_preds = model.predict(test_dataloader, return_preds=True)

Changed

@senwu: Switch to Emmental as the default learning engine.
@HiromuHota: Change ABSTAIN to -1 to be compatible with Snorkel of 0.9.X. Accordingly, user-defined labels should now be 0-indexed (used to be 1-indexed). (#310) (#320)
@HiromuHota: Use executemany_mode="batch" instead of deprecated use_batch_mode=True. (#358)
@HiromuHota: Use tqdm.notebook.tqdm instead of deprecated tqdm.tqdm_notebook. (#360)
@HiromuHota: To support ImageMagick7, expand the version range of Wand. (#373)
@HiromuHota: Comply with PEP 561 for type-checking codes that use Fonduer.
@HiromuHota: Make UDF.apply of all child classes unaware of the database backend, meaning PostgreSQL is not required if UDF.apply is directly used instead of UDFRunner.apply. (#316) (#368)

Fixed

@senwu: Fix mention extraction to return mention classes instead of data model classes.

Source code(tar.gz)
Source code(zip)
fonduer-0.8.0-py3-none-any.whl(126.29 KB)
fonduer-0.8.0.tar.gz(87.53 KB)

Owner

HazyResearch

We are a CS research group led by Prof. Chris Ré.

GitHub https://fonduer.readthedocs.io/

[IJCAI-2021] A benchmark of data-free knowledge distillation from paper "Contrastive Model Inversion for Data-Free Knowledge Distillation"

DataFree A benchmark of data-free knowledge distillation from paper "Contrastive Model Inversion for Data-Free Knowledge Distillation" Authors: Gongfa

47 Jan 9, 2023

git《Commonsense Knowledge Base Completion with Structural and Semantic Context》(AAAI 2020) GitHub: [fig1]

Commonsense Knowledge Base Completion with Structural and Semantic Context Code for the paper Commonsense Knowledge Base Completion with Structural an

96 Nov 5, 2022

RNG-KBQA: Generation Augmented Iterative Ranking for Knowledge Base Question Answering

RNG-KBQA: Generation Augmented Iterative Ranking for Knowledge Base Question Answering Authors: Xi Ye, Semih Yavuz, Kazuma Hashimoto, Yingbo Zhou and

72 Dec 5, 2022

Import Python modules from dicts and JSON formatted documents.

Paker Paker is module for importing Python packages/modules from dictionaries and JSON formatted documents. It was inspired by httpimporter. Important

1 Sep 7, 2022

This program creates a formatted excel file which highlights the undervalued stock according to Graham's number.

Over-and-Undervalued-Stocks Of Nepse Using Graham's Number Scrap the latest data using different websites and creates a formatted excel file that high

6 May 3, 2022

NeRF visualization library under construction

NeRF visualization library using PlenOctrees, under construction pip install nerfvis Docs will be at: https://nerfvis.readthedocs.org import nerfvis s

196 Jan 4, 2023

Detectron2-FC a fast construction platform of neural network algorithm based on detectron2

What is Detectron2-FC Detectron2-FC a fast construction platform of neural network algorithm based on detectron2. We have been working hard in two dir

9 Jun 6, 2022

TF2 implementation of knowledge distillation using the "function matching" hypothesis from the paper Knowledge distillation: A good teacher is patient and consistent by Beyer et al.

FunMatch-Distillation TF2 implementation of knowledge distillation using the "function matching" hypothesis from the paper Knowledge distillation: A g

67 Dec 20, 2022

Source Code for our paper: Understand me, if you refer to Aspect Knowledge: Knowledge-aware Gated Recurrent Memory Network

KaGRMN-DSG_ABSA This repository contains the PyTorch source Code for our paper: Understand me, if you refer to Aspect Knowledge: Knowledge-aware Gated

4 May 20, 2022

Trading and Backtesting environment for training reinforcement learning agent or simple rule base algo.

TradingGym TradingGym is a toolkit for training and backtesting the reinforcement learning algorithms. This was inspired by OpenAI Gym and imitated th

1.1k Jan 2, 2023

Artificial Intelligence search algorithm base on Pacman

Pacman Search Artificial Intelligence search algorithm base on Pacman Source The Pacman Projects by the University of California, Berkeley. Layouts Di

6 Nov 17, 2022

Base pretrained models and datasets in pytorch (MNIST, SVHN, CIFAR10, CIFAR100, STL10, AlexNet, VGG16, VGG19, ResNet, Inception, SqueezeNet)

This is a playground for pytorch beginners, which contains predefined models on popular dataset. Currently we support mnist, svhn cifar10, cifar100 st

2.4k Dec 28, 2022

This package proposes simplified exporting pytorch models to ONNX and TensorRT, and also gives some base interface for model inference.

PyTorch Infer Utils This package proposes simplified exporting pytorch models to ONNX and TensorRT, and also gives some base interface for model infer

11 Mar 20, 2022

A knowledge base construction engine for richly formatted data

Related tags

Overview

Getting Started

Learning how to use Fonduer

Reference

Acknowledgements

Comments

Description of the problems or issues

Description of the proposed changes

Test plan

Checklist

Description of the bug

To Reproduce

Expected behavior

Error Logs/Screenshots

Environment (please complete the following information)

Additional context

Description of the problems or issues

Description of the proposed changes

Test plan

Checklist

Description of the problems or issues

Description of the proposed changes

Test plan

Checklist

Note:

Description of the problems or issues

Description of the proposed changes

Test plan

Checklist

Description of the bug

To Reproduce

Expected behavior

Error Logs/Screenshots

Environment (please complete the following information)

Additional context

Description of the bug

Example: Text to be parsed

Expected behavior

Actual behavior

Environment

Description of the bug

To Reproduce

Expected behavior

Error Logs/Screenshots

Environment (please complete the following information)

Additional context

Releases(v0.9.0)

v0.9.0(Jun 23, 2021)

0.9.0 - 2021-06-22

Added

Changed

Fixed

v0.8.3(Sep 11, 2020)

0.8.3 - 2020-09-11

Added

Changed

Deprecated

Fixed

v0.8.2(Apr 29, 2020)

0.8.2 - 2020-04-28

Deprecated

Fixed

v0.8.1(Apr 13, 2020)

0.8.1 - 2020-04-13

Added

v0.8.0(Apr 8, 2020)

0.8.0 - 2020-04-07

Changed

Fixed

Owner

HazyResearch

[IJCAI-2021] A benchmark of data-free knowledge distillation from paper "Contrastive Model Inversion for Data-Free Knowledge Distillation"

git《Commonsense Knowledge Base Completion with Structural and Semantic Context》(AAAI 2020) GitHub: [fig1]

RNG-KBQA: Generation Augmented Iterative Ranking for Knowledge Base Question Answering

Import Python modules from dicts and JSON formatted documents.

This program creates a formatted excel file which highlights the undervalued stock according to Graham's number.

NeRF visualization library under construction

Detectron2-FC a fast construction platform of neural network algorithm based on detectron2