This repository contains the scripts for downloading and validating scripts for the documents

Related tags

Deep Learning HC4
Overview

HC4: HLTCOE CLIR Common-Crawl Collection

This repository contains the scripts for downloading and validating scripts for the documents. Document ids, topics, and qrel files are in resources/hc4/

Required packages for the scripts are recorded in requirements.txt.

Topics and Qrels

Topics are stored in jsonl format and located in resources/hc4. The language(s) the topic is annotated for is recored in the language_with_qrels field. We provide the English topic title and description for all topics and human translation for the languages that it has qrels for. We also provide machine translation of them in all three languages for all topics. Narratives(field narratives) are all in English and has one entry for each of the languages that has qrels. Each topic also has an English report(field report) that is designed to record the prior knowledge the searcher has.

Qrels are stored in the classic TREC style located in resources/hc4/{lang}.

Download Documents

To download the documents from Common Crawl, please use the following command. If you plan to use HC4 with ir_datasets, please specify ~/.ir_datasets/hc4 as the storage or make a soft link to to the directory you wish to store the documents. The document ids and hashs are stored in resources/hc4/{lang}/ids*.jsonl.gz. Russian document ids are separated into 8 files.

python download_documents.py --storage ./data/ \
                             --zho ./resources/hc4/zho/ids.jsonl.gz \
                             --fas ./resources/hc4/fas/ids.jsonl.gz \
                             --rus ./resources/hc4/rus/ids.*.jsonl.gz \
                             --jobs 4 \
                             --check_hash 

If you wish to only download the documents for one language, just specify the id file for the language you wish to download. We encourage using the flag --check_hash to varify the documents downloaded match with the documents we intend to use in the collection. The full description of the arguments can be found when execute with the --help flag.

Validate

After documents are downloaded, please run the validate_hc4_documents.py to verify all documents are downloaded for each language.

python validate_hc4_documents.py --hc4_file ./data/zho/hc4_docs.jsonl \
                                 --id_file ./resources/hc4/zho/ids.jsonl.gz \
                                 --qrels ./resources/hc4/zho/*.qrels.v1-0.txt

Reference

If you use this collection, please kindly cite our dataset paper with the following bibtex entry.

@inproceedings{hc4,
	author = {Dawn Lawrie and James Mayfield and Douglas W. Oard and Eugene Yang},
	title = {{HC4}: A New Suite of Test Collections for Ad Hoc {CLIR}},
	booktitle = {Proceedings of the 44th European Conference on Information Retrieval (ECIR)},
	year = {2022}
}
Comments
  • Train Topics Missing

    Train Topics Missing

    Hi @eugene-yang, @dlawrie

    I noticed that some of the train human translation topics are empty

    e.g Farsi topic_id 1027 is blank, all the Russian human translations are missing too I am looking at this file: https://github.com/hltcoe/HC4/blob/main/resources/hc4/train.topics.v1-0.jsonl

    Kindly assist, Thanks!

    opened by ToluClassics 2
  • Bump lxml from 4.6.3 to 4.6.5

    Bump lxml from 4.6.3 to 4.6.5

    Bumps lxml from 4.6.3 to 4.6.5.

    Changelog

    Sourced from lxml's changelog.

    4.6.5 (2021-12-12)

    Bugs fixed

    • A vulnerability (GHSL-2021-1038) in the HTML cleaner allowed sneaking script content through SVG images (CVE-2021-43818).

    • A vulnerability (GHSL-2021-1037) in the HTML cleaner allowed sneaking script content through CSS imports and other crafted constructs (CVE-2021-43818).

    4.6.4 (2021-11-01)

    Features added

    • GH#317: A new property system_url was added to DTD entities. Patch by Thirdegree.

    • GH#314: The STATIC_* variables in setup.py can now be passed via env vars. Patch by Isaac Jurado.

    Commits
    • a9611ba Fix a test in Py2.
    • a3eacbc Prepare release of 4.6.5.
    • b7ea687 Update changelog.
    • 69a7473 Cleaner: cover some more cases where scripts could sneak through in specially...
    • 54d2985 Fix condition in test decorator.
    • 4b220b5 Use the non-depcrecated TextTestResult instead of _TextTestResult (GH-333)
    • d85c6de Exclude a test when using the macOS system libraries because it fails with li...
    • cd4bec9 Add macOS-M1 as wheel build platform.
    • fd0d471 Install automake and libtool in macOS build to be able to install the latest ...
    • f233023 Cleaner: Remove SVG image data URLs since they can embed script content.
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    wontfix dependencies 
    opened by dependabot[bot] 2
  • Question about languages_with_qrels

    Question about languages_with_qrels

    Hi,

    Thanks for building this collection. I have a question about the languages_with_qrels in topics.

    Suppose a query has "languages_with_qrels": ["rus", "zho"], does this mean there is no relevant document in Farsi? or the query is only run against Russian and Chinese collections?

    opened by zhiqihuang 1
  • Typo in error message and many hash mismatches

    Typo in error message and many hash mismatches

    After downloading the data as described, I ran:

    $ python fix_document_order.py --hc4_file ./data/zho/hc4_docs.jsonl \
                                 --id_file ./resources/hc4/zho/ids*.jsonl.gz \
                                 --check_hash
    ...
    Traceback (most recent call last):
      File "/mnt/ssd/expts/joelb/HC4/fix_document_order.py", line 71, in <module>
        main(args)
      File "/mnt/ssd/expts/joelb/HC4/fix_document_order.py", line 40, in main
        assert len(ordered_ids) == len(docs_pos), \
    AssertionError: Downloaded 646268 unique documents but id file(s) have 646268 unique ids.
    

    There's a typo in the error message. Here is a diff to fix:

    $ git diff
    diff --git a/fix_document_order.py b/fix_document_order.py
    index caa76de..f1d5e0f 100644
    --- a/fix_document_order.py
    +++ b/fix_document_order.py
    @@ -38,7 +38,7 @@ def main(args):
                 pbar.update()
    
         assert len(ordered_ids) == len(docs_pos), \
    -           f"Downloaded {len(docs_pos)} unique documents but id file(s) have {len(docs_pos)} unique ids."
    +           f"Downloaded {len(docs_pos)} unique documents but id file(s) have {len(ordered_ids)} unique ids."
    
         output_file = args.hc4_file.with_name(f"{args.hc4_file.name}.sorted")
    
    @@ -68,4 +68,4 @@ if __name__ == '__main__':
         if len(args.id_file) > 1:
             args.id_file = sorted(args.id_file, key=lambda x: int(x.name.split(".")[1]))
    
    -    main(args)
    \ No newline at end of file
    +    main(args)
    

    Running --resume:

    $ python download_documents.py --storage ./data/ \
                                 --zho ./resources/hc4/zho/ids.jsonl.gz \
                                 --fas ./resources/hc4/fas/ids.jsonl.gz \
                                 --rus ./resources/hc4/rus/ids.*.jsonl.gz \
                                 --jobs 4 --resume
    ...
    Looking for 478 documents in 1 cc_files
    ...
    Found all needed docs in crawl-data/CC-NEWS/2019/03/CC-NEWS-20190305130425-00520.warc.gz, early stopping
    done-cc-file:crawl-data/CC-NEWS/2019/03/CC-NEWS-20190305130425-00520.warc.gz
    

    Then re-running fix_document_order.py showed that all docs were downloaded, however I still had many hash errors. For example, for rus:

    $ python fix_document_order.py --hc4_file ./data/rus/hc4_docs.jsonl \
      --id_file ./resources/hc4/rus/ids*.jsonl.gz \
      --check_hash
    ...
    Doc 81f3aa7d-ab14-4dea-be41-6b3474249953 hash mismatch -- should be d4ca468d21616841a2144d0dad123eb4 but got 8a1dc724e4164c6532ed36a7946bd981
    Doc 86b099ba-1511-4326-828e-e0d5e1c0f90a hash mismatch -- should be 1a50b2a9ba514dfc68810fc4632ab97a but got 6914a2420218286b33955723d126d8e9
    Reading downloaded file: 100%|██████████████████████████████████████████████████████████▉| 4719506/4721064 [07:08<00:00, 11117.84it/s]Doc 589b3401-1e65-40f0-a905-fd6b6fc1e04a hash mismatch -- should be 771d69c411674f627e0be95f7b0ce98d but got a81ca88d6e61cf64f594957001045f84
    Doc b99a9558-d946-4732-b09a-e9d1600cdafa hash mismatch -- should be 0309f37937d498a4bfaeeb3366934c07 but got da68945322ddb025ed1c0430a8aba8cb
    Doc e55d8cae-58af-43af-bf04-19f3628f4273 hash mismatch -- should be 27cbce31eeb04088f5cf529f56d498b9 but got 9828b6d5d53a28dc0cf7876509920bff
    Doc 4bbe64ca-11b7-479f-8e02-83861d470e53 hash mismatch -- should be ab6f0ff18ba464da2f2ac78f3a7a69e4 but got dd0380def9f1d217ae8601ba47aec389
    Doc 382f3ed2-6155-4415-adc3-993a287f129b hash mismatch -- should be 64b44761217e3d8a924c769baeaf8b3d but got 26bf5ed5298ab99c5072772fff4d506f
    Reading downloaded file: 100%|██████████████████████████████████████████████████████████▉| 4720621/4721064 [07:08<00:00, 10955.39it/s]Doc 5d195cd8-e402-4714-8bf7-0c95807f96ff hash mismatch -- should be 662f2f0ca832a52bb2630c1749c74276 but got a0fb8bb09d0b134fef9ccb143ba6a7a0
    Doc 3b6ec592-48c4-4d95-a692-62fd7b9f1529 hash mismatch -- should be 42bcb22637bf3d61669a62781c79a496 but got 373c61c948ef7088bfd25a6234c3a557
    Doc a155ba8a-d5bb-4d5d-b532-4accf699a3b4 hash mismatch -- should be 5aa723febfa6be837456cbc0637c24db but got d493913737cd51554e61f7e34c353261
    Reading downloaded file: 100%|███████████████████████████████████████████████████████████| 4721064/4721064 [07:08<00:00, 11006.19it/s]
    Writing sorted docuements: 100%|█████████████████████████████████████████████████████████| 4721064/4721064 [02:19<00:00, 33754.77it/s]
    Backing up the original file...
    Done
    
    opened by joelb-git 2
  • add instructions to install wheel into virtual environment

    add instructions to install wheel into virtual environment

    Somewhere in the dependency tree is Pillow which requires compiling C code. A lot of user will have trouble with that unless they first install wheel into their virtual environment and get the precompiled binary.

    opened by cash 2
Owner
JHU Human Language Technology Center of Excellence
JHU Human Language Technology Center of Excellence
An image base contains 490 images for learning (400 cars and 90 boats), and another 21 images for testingAn image base contains 490 images for learning (400 cars and 90 boats), and another 21 images for testing

SVM Données Une base d’images contient 490 images pour l’apprentissage (400 voitures et 90 bateaux), et encore 21 images pour fait des tests. Prétrait

Achraf Rahouti 3 Nov 30, 2021
This package contains deep learning models and related scripts for RoseTTAFold

RoseTTAFold This package contains deep learning models and related scripts to run RoseTTAFold This repository is the official implementation of RoseTT

null 1.6k Jan 3, 2023
Official codebase for "B-Pref: Benchmarking Preference-BasedReinforcement Learning" contains scripts to reproduce experiments.

B-Pref Official codebase for B-Pref: Benchmarking Preference-BasedReinforcement Learning contains scripts to reproduce experiments. Install conda env

null 48 Dec 20, 2022
A deep learning based semantic search platform that computes similarity scores between provided query and documents

semanticsearch This is a deep learning based semantic search platform that computes similarity scores between provided query and documents. Documents

null 1 Nov 30, 2021
An executor that loads ONNX models and embeds documents using the ONNX runtime.

ONNXEncoder An executor that loads ONNX models and embeds documents using the ONNX runtime. Usage via Docker image (recommended) from jina import Flow

Jina AI 2 Mar 15, 2022
Import Python modules from dicts and JSON formatted documents.

Paker Paker is module for importing Python packages/modules from dictionaries and JSON formatted documents. It was inspired by httpimporter. Important

Wojciech Wentland 1 Sep 7, 2022
On Generating Extended Summaries of Long Documents

ExtendedSumm This repository contains the implementation details and datasets used in On Generating Extended Summaries of Long Documents paper at the

Georgetown Information Retrieval Lab 76 Sep 5, 2022
Pytorch Implementation of Value Retrieval with Arbitrary Queries for Form-like Documents.

Value Retrieval with Arbitrary Queries for Form-like Documents Introduction Pytorch Implementation of Value Retrieval with Arbitrary Queries for Form-

Salesforce 13 Sep 15, 2022
An efficient and effective learning to rank algorithm by mining information across ranking candidates. This repository contains the tensorflow implementation of SERank model. The code is developed based on TF-Ranking.

SERank An efficient and effective learning to rank algorithm by mining information across ranking candidates. This repository contains the tensorflow

Zhihu 44 Oct 20, 2022
This repository contains all the code and materials distributed in the 2021 Q-Programming Summer of Qode.

Q-Programming Summer of Qode This repository contains all the code and materials distributed in the Q-Programming Summer of Qode. If you want to creat

Sammarth Kumar 11 Jun 11, 2021
null 190 Jan 3, 2023
This repository contains the source code and data for reproducing results of Deep Continuous Clustering paper

Deep Continuous Clustering Introduction This is a Pytorch implementation of the DCC algorithms presented in the following paper (paper): Sohil Atul Sh

Sohil Shah 197 Nov 29, 2022
This repository contains a re-implementation of the code for the CVPR 2021 paper "Omnimatte: Associating Objects and Their Effects in Video."

Omnimatte in PyTorch This repository contains a re-implementation of the code for the CVPR 2021 paper "Omnimatte: Associating Objects and Their Effect

Erika Lu 728 Dec 28, 2022
This repository contains the code for "Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP".

Self-Diagnosis and Self-Debiasing This repository contains the source code for Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based

Timo Schick 62 Dec 12, 2022
This repository contains the code and models for the following paper.

DC-ShadowNet Introduction This is an implementation of the following paper DC-ShadowNet: Single-Image Hard and Soft Shadow Removal Using Unsupervised

AuAgCu 65 Dec 27, 2022
This repository contains several image-to-image translation models, whcih were tested for RGB to NIR image generation. The models are Pix2Pix, Pix2PixHD, CycleGAN and PointWise.

RGB2NIR_Experimental This repository contains several image-to-image translation models, whcih were tested for RGB to NIR image generation. The models

null 5 Jan 4, 2023
This repository contains the entire code for our work "Two-Timescale End-to-End Learning for Channel Acquisition and Hybrid Precoding"

Two-Timescale-DNN Two-Timescale End-to-End Learning for Channel Acquisition and Hybrid Precoding This repository contains the entire code for our work

QiyuHu 3 Mar 7, 2022
This repository contains numerical implementation for the paper Intertemporal Pricing under Reference Effects: Integrating Reference Effects and Consumer Heterogeneity.

This repository contains numerical implementation for the paper Intertemporal Pricing under Reference Effects: Integrating Reference Effects and Consumer Heterogeneity.

Hansheng Jiang 6 Nov 18, 2022
This repository contains the code and models necessary to replicate the results of paper: How to Robustify Black-Box ML Models? A Zeroth-Order Optimization Perspective

Black-Box-Defense This repository contains the code and models necessary to replicate the results of our recent paper: How to Robustify Black-Box ML M

OPTML Group 2 Oct 5, 2022