A knowledge base construction engine for richly formatted data

Overview

Fonduer

CI-CD Code Climate Codecov ReadTheDocs PyPI PyPIVersion GitHubStars License CodeStyle

Fonduer is a Python package and framework for building knowledge base construction (KBC) applications from richly formatted data.

Note that Fonduer is still actively under development, so feedback and contributions are welcome. Submit bugs in the Issues section or feel free to submit your contributions as a pull request.

Getting Started

Check out our Getting Started Guide to get up and running with Fonduer.

Learning how to use Fonduer

The Fonduer tutorials cover the Fonduer workflow, showing how to extract relations from hardware datasheets and scientific literature.

Reference

Fonduer: Knowledge Base Construction from Richly Formatted Data (blog):

@inproceedings{wu2018fonduer,
  title={Fonduer: Knowledge Base Construction from Richly Formatted Data},
  author={Wu, Sen and Hsiao, Luke and Cheng, Xiao and Hancock, Braden and Rekatsinas, Theodoros and Levis, Philip and R{\'e}, Christopher},
  booktitle={Proceedings of the 2018 International Conference on Management of Data},
  pages={1301--1316},
  year={2018},
  organization={ACM}
}

Acknowledgements

Fonduer leverages the work of Emmental and Snorkel.

Comments
  • Using candidates for prediction (Fonduer Prediction Pipeline)

    Using candidates for prediction (Fonduer Prediction Pipeline)

    Scenario:

    For my use case I have a set of financial documents.

    The entire document set is divided into train,dev and test. The documents are parsed and the mentions and candidates are extracted with some rules.

    The featurized training candidates are used to train a Fonduer Learning model and the model is used to predict on the test candidates, as per the normal fonduer pipeline as demonstrated in the hardware tutorial.

    Problems & Questions

    1. Is the fonduer prediction pipeline production ready? How can we fine tune it to achieve better accuracy? Should the main focus be on the quality of the extracted mentions?

    With my initial analysis and usage following the hardware tutorial, I could not obtain good results.

    1. Can we separate the training and test pipeline?

    As in the current scenario, with a new document that I will feed for prediction, The entire corpus will have to be parsed to extract the mentions and candidates and store the feature keys.

    Please correct me, if that won't be the case and help me with a snippet to showcase the separation.

    opened by atulgupta9 16
  • Add multiline Japanese strings support to HocrVisualParser() to fix #534 and redo #537

    Add multiline Japanese strings support to HocrVisualParser() to fix #534 and redo #537

    Description of the problems or issues

    Is your pull request related to a problem? Please describe. See #534. This request redoes #537, which needs prior fixing #538 (fixed by #539).

    Does your pull request fix any issue. See #534

    Description of the proposed changes

    In case of multi line Japanese strings 'AAAA\nBBBB', spacy[ja] sometimes generates tokens ['AAA', 'AB', 'B', 'BB']. Proposal defines bbox of 'AB' as a multi line word (i.e. left is min left of ['A','B'], top is the top of 'A', right is max right of ['A','B'] and bottom is the bottom of 'B').

    Test plan

    This is cause of Japanese morphological analysis. So, I have added Japanese test data to 'tests/data/hocr_simple/japan.hocr' and test code to 'tests/parser/test_parser.py::test_parse_hocr'

    Checklist

    • [x] I have updated the documentation accordingly.
    • [x] I have added tests to cover my changes.
    • [x] All new and existing tests passed.
    • [x] I have updated the CHANGELOG.rst accordingly.
    opened by YasushiMiyata 15
  • parser.apply does not return for a long time even though the progress bar indicates it finishes parsing

    parser.apply does not return for a long time even though the progress bar indicates it finishes parsing

    Description of the bug

    This is not a bug, but a performance issue. This is not noticeable when parsing a small number of documents, but parser.apply does not return even though the progress bar indicates it finishes parsing a long time ago (1 hour or more ago).

    To Reproduce

    Steps to reproduce the behavior:

    1. Parse many documents (my case: ~2500)

    Expected behavior

    parser.apply returns when the progress bar indicates it finished parsing all the documents.

    Error Logs/Screenshots

    If applicable, add error logs or screenshots to help explain your problem.

    Environment (please complete the following information)

    • OS: Debian Buster
    • PostgreSQL Version: 12.1
    • Poppler Utils Version: N/A
    • Fonduer Version: 0.8.3+dev (01e0d9319b523aff7aa7f5c583a9f330b0705ecc)

    Additional context

    Add any other context about the problem here.

    bug 
    opened by HiromuHota 14
  • Execute preprocessing and parsing in parallel

    Execute preprocessing and parsing in parallel

    Description of the problems or issues

    Is your pull request related to a problem? Please describe.

    Currently, preprocessor and parser are executed in a complete sequential order. i.e., preprocess N docs (and load them into a queue), then parse N docs. This has two drawbacks:

    1. the progress bar shows nothing during preprocessing.
    2. the machine RAM has to be large enough to hold N preprocessed docs at a time.

    They become more serious when N is large and/or each HTML file is large.

    Does your pull request fix any issue.

    Fix #435

    Description of the proposed changes

    A clear and concise description of what you propose.

    This PR

    • places a cap on the in_queue so that only a certain number of documents are loaded to in_queue.
    • executes preprocessor and parser in parallel (ie the main process does preprocessing and child process(es) do parsing in parallel).

    Test plan

    A clear and concise description of how you test the new changes.

    For the 1st issue: I manually check the progress bar starts showing progress right after starting parse.apply.

    Checklist

    • [x] I have updated the documentation accordingly.
    • [ ] I have added tests to cover my changes.
    • [x] All new and existing tests passed.
    • [x] I have updated the CHANGELOG.rst accordingly.
    enhancement 
    opened by HiromuHota 13
  • [Errno 32] Broken pipe for Parser in parallel execution on OSX

    [Errno 32] Broken pipe for Parser in parallel execution on OSX

    Hi,

    In fonduer-tutorials, after running cell:

    corpus_parser = OmniParser(structural=True, lingual=True, visual=True, pdf_path=pdf_path)
    %time corpus_parser.apply(doc_preprocessor, parallelism=PARALLEL)
    

    whenever is PARALLEL smaller than max_docs, I've got:

    Traceback (most recent call last):
      File "/anaconda3/lib/python3.6/multiprocessing/queues.py", line 240, in _feed
        send_bytes(obj)
      File "/anaconda3/lib/python3.6/multiprocessing/connection.py", line 200, in send_bytes
        self._send_bytes(m[offset:offset + size])
      File "/anaconda3/lib/python3.6/multiprocessing/connection.py", line 398, in _send_bytes
        self._send(buf)
      File "/anaconda3/lib/python3.6/multiprocessing/connection.py", line 368, in _send
        n = write(self._handle, buf)
    BrokenPipeError: [Errno 32] Broken pipe
    

    Otherwise (with PARALLEL bigger or equal than max_docs) result is empty tables in Postgresql. When turning off parallelisation, it works.

    Best regards

    bug 
    opened by mladvladimir 13
  • Feat/multary candidates

    Feat/multary candidates

    Description of the problems or issues

    The feature extraction only supports unary and binary candidates

    Does your pull request fix any issue. Closes #455

    Description of the proposed changes

    Add new functions that supports multary-relation between spans for the feature extraction

    Test plan

    A clear and concise description of how you test the new changes. Use a candidate with more then two mentions, and try the feature extraction part.

    Checklist

    • [x] I have updated the documentation accordingly.
    • [x] I have added tests to cover my changes.
    • [x] All new and existing tests passed.
    • [x] I have updated the CHANGELOG.rst accordingly.

    Note:

    In order for this to run the multary-candidates in textual features, we need a new version of treedlib based on this PR: treedlib#46 So if you can contact them, please do.

    Also if someone can jump-in to improve the coverage, I can't get the tabular_features up

    enhancement 
    opened by wajdikhattel 12
  • Add HOCRDocProprocessor and HocrVisualParser

    Add HOCRDocProprocessor and HocrVisualParser

    Description of the problems or issues

    Is your pull request related to a problem? Please describe.

    This is the second patch that follows #518 .

    Does your pull request fix any issue.

    N/A.

    Description of the proposed changes

    Add HOCRDocProprocessor and HocrVisualParser

    Test plan

    I added a few real hOCR example files.

    Checklist

    • [x] I have updated the documentation accordingly.
    • [x] I have added tests to cover my changes.
    • [x] All new and existing tests passed.
    • [x] I have updated the CHANGELOG.rst accordingly.
    enhancement 
    opened by HiromuHota 9
  • Duplicate key error while adding two mentions which are same

    Duplicate key error while adding two mentions which are same

    Suppose that I have two mentions (say for example zip-code and tax code) whose matchers return true (checking 5 digit regex match for both mentions) for the same span in document, then I think Fonduer is throwing this error. please help me in resolving this.

    
    sqlalchemy.exc.IntegrityError: (psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "context_stable_id_key"
    DETAIL:  Key (stable_id)=(1443208965_10_subset::span_mention:23313:23321) already exists.
    
    [SQL: INSERT INTO context (type, stable_id) VALUES (%(type)s, %(stable_id)s) RETURNING context.id]
    
    opened by saikalyan9981 9
  • unable to read images in the pdf file

    unable to read images in the pdf file

    Hi

    I am passing html to fonduer and it is saying unable read image from figure I have taken a pdf converted to html via pdftotree and passing the html to fonduer. Is this the issue with pdftotree that it is not able to render images. I want to what is the mechanism so that we can have images linked/embed in html so that fonduer can read it

    Please help/advice as i am stuck with this issue

    opened by ashleo25 8
  • Non-deterministic behavior in featurization

    Non-deterministic behavior in featurization

    Describe the bug When working with large (~7k docs) corpus of hardware datasheets, extracting multiple relations, we expect that the features for each candidate would be deterministic between each run. Even more so if we have parallelism=1 set in the Featurizer. However, we find that there can be small (e.g., < 5) differences between feature tables, resulting in slightly different sparse matrices, and thus, slightly different results.

    To Reproduce Running on the HACK transistor dataset will reproduce the error. However, it will take a long time, and we haven't been able to get a very minimal example that reproduces the error yet. Attached are two feature table dumps between two different runs with parallelism=1. Note that there is only a single difference on line 65454.

    feature_table.tar.gz

    Note that it isn't always one difference, and the difference is not deterministic. The different attached is just an example.

    Expected behavior We would expect that these feature tables are identical between runs.

    Error Logs/Screenshots For convenience, here is the differing line in screenshot form image

    Additional context If the issue is in the UDF implementation, this might affect the Labeler in addition to the Featurizer, since they share a lot of the UDF code.

    bug 
    opened by lukehsiao 8
  • Type hints

    Type hints

    Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

    I'm always frustrated when I have to look at the source codes to check the type of arguments/return.

    Describe the solution you'd like A clear and concise description of what you want to happen.

    1. Type hints (PEP484) are written to source codes like
    def greeting(name: str) -> str:
        return 'Hello ' + name
    
    1. (Eventually) enforce type checking during pre-commit

    For example by flake8-mypy

    Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

    Depending on the editor (PyCharm, etc.), type/rtype documentation like below gives you type hinting. However, I'm not sure this is equivalent to the type hints (PEP484).

    def greeting(name):
        """
        greeting
    
        :param name: description
        :type name: type description
        :return: description
        :rtype: type description
        """
        return 'Hello ' + name
    

    Additional context Add any other context or screenshots about the feature request here.

    enhancement help wanted 
    opened by HiromuHota 8
  • CandidateExtractor doesn't scale for larger relations

    CandidateExtractor doesn't scale for larger relations

    Hello, thanks for providing this framework. My group has run into a bit of a snag:

    For context, we've successfully completed candidate extraction & labeling for binary relations, with reasonable runtimes. With parallelism = 6, candidate extraction takes ~2 minutes per document.

    We've since moved on to a 3-ary relation that is very similar to the binary relation. This 3-ary relation shares some mentions with the binary relation, and uses a very similar candidate extractor. We have done performance testing for the 3-ary throttler function and found it to have a very similar runtime to the binary throttler. Candidate extraction now takes 4 hours per document. This immense slowdown is due to the fact that computational complexity scales exponentially for each entity added to a relation.

    Here are some numbers from our use case:

    • Mention A: 800 mentions found
    • Mention B: 140 mentions found
    • Mention C: 150 mentions found

    If our relation only includes (A,B), we have a total of 800*140 = 112,000 temporary candidates to evaluate with our throttler. Should we add mention C to form the relation (A,B,C), our total now grows to 800*140*150 = 16.8 million temporary candidates. We're unable to narrow our mention matchers further without excluding true positives.

    This slowdown makes the Fonduer framework effectively unusable for any large-scale use case that requires relations with more than 2 entities. Can you provide guidance to address this issue?

    opened by robbieculkin 1
  • Tables aren't redefined for re-runs of UDF apply

    Tables aren't redefined for re-runs of UDF apply

    Description of the bug

    As part of iterative development in a Jupyter environment, apply may be re-run several times. The developer might need to update candidates or create a new labeling function, for example. When this happens, the corresponding Postgres table is cleared but not dropped. This means that the definition of the table cannot change to accommodate the updated parameters for apply.

    To Reproduce

    Steps to reproduce the behavior:

    1. Run the max_storage_temp_tutorial notebook in fonduer-tutorials, up to and including the Labeling Functions section.
    2. Add a new LF, doesn't need to do anything in particular (could return ABSTAIN every time). Add this to the stg_temp_lfs list.
    3. Re-run the remainder of cells in the section.

    Upon calling LFAnalysis, the following exception is thrown:

    ValueError: Number of LFs (7) and number of LF matrix columns (6) are different
    

    Expected behavior

    Underlying tables for a re-run of a UDF apply method should not only be cleared, but dropped.

    Error Logs/Screenshots

    Full stack trace:

    ---------------------------------------------------------------------------
    ValueError                                Traceback (most recent call last)
    <ipython-input-62-e005feee6300> in <module>
          5 sorted_lfs = sorted(lfs, key=lambda lf: lf.name)
          6 
    ----> 7 LFAnalysis(L=L_train[0], lfs=sorted_lfs).lf_summary(Y=L_gold_train[0].reshape(-1))
    
    ~/.venv/lib/python3.7/site-packages/snorkel/labeling/analysis.py in __init__(self, L, lfs)
         44             if len(lfs) != self._L_sparse.shape[1]:
         45                 raise ValueError(
    ---> 46                     f"Number of LFs ({len(lfs)}) and number of "
         47                     f"LF matrix columns ({self._L_sparse.shape[1]}) are different"
         48                 )
    
    ValueError: Number of LFs (7) and number of LF matrix columns (6) are different
    

    Environment (please complete the following information)

    • OS: Ubuntu 18.04
    • PostgreSQL Version: 12.1
    • Poppler Utils Version: 0.71.0-5
    • Fonduer Version: 0.8.3

    Additional context

    https://github.com/HazyResearch/fonduer/issues/263#issuecomment-527588765 advises restarting Python, but this does not appear to solve the problem.

    opened by robbieculkin 5
  • Parser is not splitting multiple lines sentences properly

    Parser is not splitting multiple lines sentences properly

    Description of the bug

    I'm trying to Train a model that can build a Knowledge Base from the OPC UA Companions specification as a part of my Thesis. I have the Dataset as PDFs and used a third-party program to convert them into HTML and tried my best to preserve the data structure information (i'm getting the same result even if i just Parsed on the PDFs alone).

    Then i followed the hardware_fonduer_model Tutorial to Extract the Candidates accordingly.

    the Problem is that the Parser is splitting the sentences wrongly, namely it is getting the end of a Line as an end of a sentence. I tried to debug using a SimpleParser.split_sentences(text) command and turned out that python needs a backslash to split a statement into multiple lines.

    So i thought maybe i could use the replacements=['[\n]', ' '] parameter so the Split could function better but i'm getting the ValueError: too many values to unpack (expected 2). What is the default configuration for the sentence segmentation?
    How could i get a multiple Sentences as a mention? (i tried MentionNgram till n_max =100 and still getting just one).

    I would really appreciate getting back from you.

    many thanks in advance

    Example: Text to be parsed

    Boolean indicating if a profile /signature should be generated by this move command request.If the optional VariableSignatureRequestStatus is not provided on the Object, this parameter is ignored by the Server.

    Expected behavior

    sentence 1 : Boolean indicating if a profile /signature should be generated by this move command request. sentence 2 : If the optional VariableSignatureRequestStatus is not provided on the Object, this parameter is ignored by the Server.

    Actual behavior

    sentence 1 : Boolean indicating if a profile /signature should be generated by this move command sentence 2 : request. sentence 3 : request.If the optional VariableSignatureRequestStatus is not provided on the Object, this sentence 4 : parameter is ignored by the Server.

    Environment

    opened by eng-khaled1 3
  • Suggestion required: Getting error while applying Featurizer

    Suggestion required: Getting error while applying Featurizer

    @SenWu @HiromuHota .. can you pls suggest if my analogy is right?

    I am getting error :- File "abcd./anaconda3/lib/python3.7/site-packages/fonduer/utils/data_model_utils/structural.py", line 55, in _get_node return doc_etree.xpath(sentence.xpath)[0] IndexError: list index out of range

    I am following Hardware tutorial on some Email HTML msgs and getting mentions count near 4000

    Also :-- train_cands = candidate_extractor.get_candidates(split=0) dev_cands = candidate_extractor.get_candidates(split=1) test_cands = candidate_extractor.get_candidates(split=2)

    Above steps returned outputs but,

    on applying featurizer: featurizer.apply(split=0, train=True, parallelism=PARALLEL)

    I am getting error mentioned on top.

    I looked stackoverflow but the reason that HTML syntax issue,.. is not there as it is rendering good on browser. So can you share your thoughts on :

    1. can it be because no candidates being generated? or
    2. something else

    Thanks.

    opened by AshutoshUpadhya 3
  • How can i extract a paragraph and all associated sentences in document

    How can i extract a paragraph and all associated sentences in document

    How can i extract a paragraph and all associated sentences in document
    Basically i need paragraphs with associated sentences @lukehsiao @SenWu @vincentschen @ZZWENG @stephenbach

    Appreciate your help

    needs-info 
    opened by ashleo25 1
  • Featurizer.get_keys() does not honor candidate classes in context

    Featurizer.get_keys() does not honor candidate classes in context

    Description of the bug

    Unlike other methods (eg Featurizer.drop_keys() and Featurizer.upsert_keys(), Featurizer.get_keys() does not honor candidate classes in context but returns all feature keys no matter which candidate class they are associated with. This is confusing.

    See https://github.com/HazyResearch/fonduer/issues/511#issuecomment-696618392 for how this actually confused a user.

    To Reproduce

    This is a design error.

    Expected behavior

    These methods should behave similarly. Either

    • None of these honor candidate classes, or
    • All of these honor them.

    Error Logs/Screenshots

    N/A

    Environment (please complete the following information)

    • Fonduer Version: 0.8.3

    Additional context

    Add any other context about the problem here.

    opened by HiromuHota 0
Releases(v0.9.0)
  • v0.9.0(Jun 23, 2021)

    0.9.0 - 2021-06-22

    This is a long-awaited release with some performance improvements and some breaking changes. See the changelog for details.

    Added

    Changed

    • @HiromuHota: Renamed VisualLinker to PdfVisualParser, which assumes the followings: (#518)

      • pdf_path should be a directory path, where PDF files exist, and cannot be a file path.
      • The PDF file should have the same basename (os.path.basename) as the document. E.g., the PDF file should be either "123.pdf" or "123.PDF" for "123.html".
    • @HiromuHota: Changed Parser's signature as follows: (#518)

      • Renamed vizlink to visual_parser.
      • Removed pdf_path. Now this is required only by PdfVisualParser.
      • Removed visual. Provide visual_parser if visual information is to be parsed.
    • @YasushiMiyata: Changed UDFRunner's and UDF's data commit process as follows: (#545)

      • Removed add process on single-thread in _apply in UDFRunner.
      • Added UDFRunner._add of y on multi-threads to Parser, Labeler and Featurizer.
      • Removed y of document parsed result from out_queue in UDF.

    Fixed

    Source code(tar.gz)
    Source code(zip)
    fonduer-0.9.0-py3-none-any.whl(146.07 KB)
    fonduer-0.9.0.tar.gz(102.10 KB)
  • v0.8.3(Sep 11, 2020)

    0.8.3 - 2020-09-11

    This is a big release with a lot of changes. These changes are summarized here. Check the Changelog for more details.

    Added

    Changed

    • @YasushiMiyata: Enable RegexMatchSpan with concatenates words by sep="(separator)" option. (#270) (#492)
    • @HiromuHota: Enabled "Type hints (PEP 484) support for the Sphinx autodoc extension." (#421)
    • @HiromuHota: Switched the Cython wrapper for Mecab from mecab-python3 to fugashi. Since the Japanese tokenizer remains the same, there should be no impact on users. (#384) (#432)
    • @HiromuHota: Log a stack trace on parsing error for better debug experience. (#478) (#479)
    • @HiromuHota: get_cell_ngrams and get_neighbor_cell_ngrams yield nothing when the mention is not tabular. (#471) (#504)

    Deprecated

    Fixed

    • @senwu: Fix pdf_path cannot be without a trailing slash. (#442) (#459)
    • @kaikun213: Fix bug in table range difference calculations. (#420)
    • @HiromuHota: mention_extractor.apply with clear=True now works even if it's not the first run. (#424)
    • @HiromuHota: Fix get_horz_ngrams and get_vert_ngrams so that they work even when the input mention is not tabular. (#425) (#426)
    • @HiromuHota: Fix the order of args to Bbox. (#443) (#444)
    • @HiromuHota: Fix the non-deterministic behavior in VisualLinker. (#412) (#458)
    • @HiromuHota: Fix an issue that the progress bar shows no progress on preprocessing by executing preprocessing and parsing in parallel. (#439)
    • @HiromuHota: Adopt to mlflow>=1.9.0. (#461) (#463)
    • @HiromuHota: Correct the entity type for NumberMatcher from "NUMBER" to "CARDINAL". (#473) (#477)
    • @HiromuHota: Fix _get_axis_ngrams not to return None when the input is not tabular. (#481)
    • @HiromuHota: Fix Visualizer.display_candidates not to draw rectangles on wrong pages. (#488)
    • @HiromuHota: Persist doc only when no error happens during parsing. (#489) (#490)
    Source code(tar.gz)
    Source code(zip)
    fonduer-0.8.3-py3-none-any.whl(136.97 KB)
    fonduer-0.8.3.tar.gz(99.00 KB)
  • v0.8.2(Apr 29, 2020)

    0.8.2 - 2020-04-28

    A summary of the changes of this release are below. Check the Changelog for more details.

    Deprecated

    • @HiromuHota: Use of undecorated labeling functions is deprecated and will not be supported as of v0.9.0. Please decorate them with snorkel.labeling.labeling_function.

    Fixed

    • @HiromuHota: Labeling functions can now be decorated with snorkel.labeling.labeling_function. (#400 <https://github.com/HazyResearch/fonduer/issues/400>) (#401 <https://github.com/HazyResearch/fonduer/pull/401>)
    Source code(tar.gz)
    Source code(zip)
    fonduer-0.8.2-py3-none-any.whl(126.83 KB)
    fonduer-0.8.2.tar.gz(88.07 KB)
  • v0.8.1(Apr 13, 2020)

    0.8.1 - 2020-04-13

    A summary of the changes of this release are below. Check the Changelog for more details.

    Fonduer has a new mode argument to support switching between different learning modes (e.g., STL or MLT).

    Click to see example usage
    # Create task for each relation.
    tasks = create_task(
        task_names = TASK_NAMES,
        n_arities = N_ARITIES,
        n_features = N_FEATURES,
        n_classes = N_CLASSES,
        emb_layer = EMB_LAYER,
        model="LogisticRegression",
        mode = MODE,
    )
    

    Added

    • @senwu: Add mode argument in create_task to support STL and MTL.
    Source code(tar.gz)
    Source code(zip)
    fonduer-0.8.1-py3-none-any.whl(128.52 KB)
    fonduer-0.8.1.tar.gz(87.80 KB)
  • v0.8.0(Apr 8, 2020)

    0.8.0 - 2020-04-07

    A summary of the changes of this release are below. Check the Changelog for more details.

    Rather than maintaining a separate learning engine, we switch to Emmental, a deep learning framework for multi-task learning. Switching to a more general learning framework allows Fonduer to support more applications and multi-task learning.

    Click to see example usage
    # With Emmental, you need do following steps to perform learning:
    # 1. Create task for each relations and EmmentalModel to learn those tasks.
    # 2. Wrap candidates into EmmentalDataLoader for training.
    # 3. Training and inference (prediction).
    
    import emmental
    
    # Collect word counter from candidates which is used in LSTM model.
    word_counter = collect_word_counter(train_cands)
    
    # Initialize Emmental. For customize Emmental, please check here:
    # https://emmental.readthedocs.io/en/latest/user/config.html
    emmental.init(fonduer.Meta.log_path)
    
    #######################################################################
    # 1. Create task for each relations and EmmentalModel to learn those tasks.
    #######################################################################
    
    # Generate special tokens which are used for LSTM model to locate mentions.
    # In LSTM model, we pad sentence with special tokens to help LSTM to learn
    # those mentions. Example:
    # Original sentence: Then Barack married Michelle.
    # ->  Then ~~[[1 Barack 1]]~~ married ~~[[2 Michelle 2]]~~.
    arity = 2
    special_tokens = []
    for i in range(arity):
        special_tokens += [f"~~[[{i}", f"{i}]]~~"]
    
    # Generate word embedding module for LSTM.
    emb_layer = EmbeddingModule(
        word_counter=word_counter, word_dim=300, specials=special_tokens
    )
    
    # Create task for each relation.
    tasks = create_task(
        ATTRIBUTE,
        2,
        F_train[0].shape[1],
        2,
        emb_layer,
        mode="mtl",
        model="LogisticRegression",
    )
    
    # Create Emmental model to learn the tasks.
    model = EmmentalModel(name=f"{ATTRIBUTE}_task")
    
    # Add tasks into model
    for task in tasks:
        model.add_task(task)
    
    #######################################################################
    # 2. Wrap candidates into EmmentalDataLoader for training.
    #######################################################################
    
    # Here we only use the samples that have labels, which we filter out the
    # samples that don't have significant marginals.
    diffs = train_marginals.max(axis=1) - train_marginals.min(axis=1)
    train_idxs = np.where(diffs > 1e-6)[0]
    
    # Create a dataloader with weakly supervisied samples to learn the model.
    train_dataloader = EmmentalDataLoader(
        task_to_label_dict={ATTRIBUTE: "labels"},
        dataset=FonduerDataset(
            ATTRIBUTE,
            train_cands[0],
            F_train[0],
            emb_layer.word2id,
            train_marginals,
            train_idxs,
        ),
        split="train",
        batch_size=100,
        shuffle=True,
    )
    
    
    # Create test dataloader to do prediction.
    # Build test dataloader
    test_dataloader = EmmentalDataLoader(
        task_to_label_dict={ATTRIBUTE: "labels"},
        dataset=FonduerDataset(
            ATTRIBUTE, test_cands[0], F_test[0], emb_layer.word2id, 2
        ),
        split="test",
        batch_size=100,
        shuffle=False,
    )
    
    
    #######################################################################
    # 3. Training and inference (prediction).
    #######################################################################
    
    # Learning those tasks.
    emmental_learner = EmmentalLearner()
    emmental_learner.learn(model, [train_dataloader])
    
    # Predict based the learned model.
    test_preds = model.predict(test_dataloader, return_preds=True)
    

    Changed

    • @senwu: Switch to Emmental as the default learning engine.
    • @HiromuHota: Change ABSTAIN to -1 to be compatible with Snorkel of 0.9.X. Accordingly, user-defined labels should now be 0-indexed (used to be 1-indexed). (#310) (#320)
    • @HiromuHota: Use executemany_mode="batch" instead of deprecated use_batch_mode=True. (#358)
    • @HiromuHota: Use tqdm.notebook.tqdm instead of deprecated tqdm.tqdm_notebook. (#360)
    • @HiromuHota: To support ImageMagick7, expand the version range of Wand. (#373)
    • @HiromuHota: Comply with PEP 561 for type-checking codes that use Fonduer.
    • @HiromuHota: Make UDF.apply of all child classes unaware of the database backend, meaning PostgreSQL is not required if UDF.apply is directly used instead of UDFRunner.apply. (#316) (#368)

    Fixed

    • @senwu: Fix mention extraction to return mention classes instead of data model classes.
    Source code(tar.gz)
    Source code(zip)
    fonduer-0.8.0-py3-none-any.whl(126.29 KB)
    fonduer-0.8.0.tar.gz(87.53 KB)
Owner
HazyResearch
We are a CS research group led by Prof. Chris Ré.
HazyResearch
[IJCAI-2021] A benchmark of data-free knowledge distillation from paper "Contrastive Model Inversion for Data-Free Knowledge Distillation"

DataFree A benchmark of data-free knowledge distillation from paper "Contrastive Model Inversion for Data-Free Knowledge Distillation" Authors: Gongfa

ZJU-VIPA 47 Jan 9, 2023
git《Commonsense Knowledge Base Completion with Structural and Semantic Context》(AAAI 2020) GitHub: [fig1]

Commonsense Knowledge Base Completion with Structural and Semantic Context Code for the paper Commonsense Knowledge Base Completion with Structural an

AI2 96 Nov 5, 2022
RNG-KBQA: Generation Augmented Iterative Ranking for Knowledge Base Question Answering

RNG-KBQA: Generation Augmented Iterative Ranking for Knowledge Base Question Answering Authors: Xi Ye, Semih Yavuz, Kazuma Hashimoto, Yingbo Zhou and

Salesforce 72 Dec 5, 2022
Import Python modules from dicts and JSON formatted documents.

Paker Paker is module for importing Python packages/modules from dictionaries and JSON formatted documents. It was inspired by httpimporter. Important

Wojciech Wentland 1 Sep 7, 2022
This program creates a formatted excel file which highlights the undervalued stock according to Graham's number.

Over-and-Undervalued-Stocks Of Nepse Using Graham's Number Scrap the latest data using different websites and creates a formatted excel file that high

null 6 May 3, 2022
NeRF visualization library under construction

NeRF visualization library using PlenOctrees, under construction pip install nerfvis Docs will be at: https://nerfvis.readthedocs.org import nerfvis s

Alex Yu 196 Jan 4, 2023
Detectron2-FC a fast construction platform of neural network algorithm based on detectron2

What is Detectron2-FC Detectron2-FC a fast construction platform of neural network algorithm based on detectron2. We have been working hard in two dir

董晋宗 9 Jun 6, 2022
TF2 implementation of knowledge distillation using the "function matching" hypothesis from the paper Knowledge distillation: A good teacher is patient and consistent by Beyer et al.

FunMatch-Distillation TF2 implementation of knowledge distillation using the "function matching" hypothesis from the paper Knowledge distillation: A g

Sayak Paul 67 Dec 20, 2022
Source Code for our paper: Understand me, if you refer to Aspect Knowledge: Knowledge-aware Gated Recurrent Memory Network

KaGRMN-DSG_ABSA This repository contains the PyTorch source Code for our paper: Understand me, if you refer to Aspect Knowledge: Knowledge-aware Gated

XingBowen 4 May 20, 2022
Trading and Backtesting environment for training reinforcement learning agent or simple rule base algo.

TradingGym TradingGym is a toolkit for training and backtesting the reinforcement learning algorithms. This was inspired by OpenAI Gym and imitated th

Yvictor 1.1k Jan 2, 2023
Artificial Intelligence search algorithm base on Pacman

Pacman Search Artificial Intelligence search algorithm base on Pacman Source The Pacman Projects by the University of California, Berkeley. Layouts Di

Day Fundora 6 Nov 17, 2022
Base pretrained models and datasets in pytorch (MNIST, SVHN, CIFAR10, CIFAR100, STL10, AlexNet, VGG16, VGG19, ResNet, Inception, SqueezeNet)

This is a playground for pytorch beginners, which contains predefined models on popular dataset. Currently we support mnist, svhn cifar10, cifar100 st

Aaron Chen 2.4k Dec 28, 2022
This package proposes simplified exporting pytorch models to ONNX and TensorRT, and also gives some base interface for model inference.

PyTorch Infer Utils This package proposes simplified exporting pytorch models to ONNX and TensorRT, and also gives some base interface for model infer

Alex Gorodnitskiy 11 Mar 20, 2022
This project uses Template Matching technique for object detecting by detection of template image over base image.

Object Detection Project Using OpenCV This project uses Template Matching technique for object detecting by detection the template image over base ima

Pratham Bhatnagar 7 May 29, 2022
This project uses Template Matching technique for object detecting by detection of template image over base image

Object Detection Project Using OpenCV This project uses Template Matching technique for object detecting by detection the template image over base ima

Pratham Bhatnagar 4 Nov 16, 2021
Anti-UAV base on PaddleDetection

Paddle-Anti-UAV Anti-UAV base on PaddleDetection Background UAVs are very popular and we can see them in many public spaces, such as parks and playgro

Qingzhong Wang 2 Apr 20, 2022
Blender Add-on that sets a Material's Base Color to one of Pantone's Colors of the Year

Blender PCOY (Pantone Color of the Year) MCMC (Mid-Century Modern Colors) HG71 (House & Garden Colors 1971) Blender Add-ons That Assign a Custom Color

Don Schnitzius 15 Nov 20, 2022
Finetune the base 64 px GLIDE-text2im model from OpenAI on your own image-text dataset

Finetune the base 64 px GLIDE-text2im model from OpenAI on your own image-text dataset

Clay Mullis 82 Oct 13, 2022
Code base of object detection

rmdet code base of object detection. 环境安装: 1. 安装conda python环境 - `conda create -n xxx python=3.7/3.8` - `conda activate xxx` 2. 运行脚本,自动安装pytorch1

null 3 Mar 8, 2022