This repository contains the scripts for downloading and validating scripts for the documents

JHU Human Language Technology Center of Excellence

Last update: Jun 7, 2022

Related tags

Deep Learning HC4

Overview

HC4: HLTCOE CLIR Common-Crawl Collection

This repository contains the scripts for downloading and validating scripts for the documents. Document ids, topics, and qrel files are in resources/hc4/

Required packages for the scripts are recorded in requirements.txt.

Topics and Qrels

Topics are stored in jsonl format and located in resources/hc4. The language(s) the topic is annotated for is recored in the language_with_qrels field. We provide the English topic title and description for all topics and human translation for the languages that it has qrels for. We also provide machine translation of them in all three languages for all topics. Narratives(field narratives) are all in English and has one entry for each of the languages that has qrels. Each topic also has an English report(field report) that is designed to record the prior knowledge the searcher has.

Qrels are stored in the classic TREC style located in resources/hc4/{lang}.

Download Documents

To download the documents from Common Crawl, please use the following command. If you plan to use HC4 with ir_datasets, please specify ~/.ir_datasets/hc4 as the storage or make a soft link to to the directory you wish to store the documents. The document ids and hashs are stored in resources/hc4/{lang}/ids*.jsonl.gz. Russian document ids are separated into 8 files.

python download_documents.py --storage ./data/ \
                             --zho ./resources/hc4/zho/ids.jsonl.gz \
                             --fas ./resources/hc4/fas/ids.jsonl.gz \
                             --rus ./resources/hc4/rus/ids.*.jsonl.gz \
                             --jobs 4 \
                             --check_hash

If you wish to only download the documents for one language, just specify the id file for the language you wish to download. We encourage using the flag --check_hash to varify the documents downloaded match with the documents we intend to use in the collection. The full description of the arguments can be found when execute with the --help flag.

Validate

After documents are downloaded, please run the validate_hc4_documents.py to verify all documents are downloaded for each language.

python validate_hc4_documents.py --hc4_file ./data/zho/hc4_docs.jsonl \
                                 --id_file ./resources/hc4/zho/ids.jsonl.gz \
                                 --qrels ./resources/hc4/zho/*.qrels.v1-0.txt

Reference

If you use this collection, please kindly cite our dataset paper with the following bibtex entry.

@inproceedings{hc4,
	author = {Dawn Lawrie and James Mayfield and Douglas W. Oard and Eugene Yang},
	title = {{HC4}: A New Suite of Test Collections for Ad Hoc {CLIR}},
	booktitle = {Proceedings of the 44th European Conference on Information Retrieval (ECIR)},
	year = {2022}
}

Comments

Train Topics Missing

Hi @eugene-yang, @dlawrie

I noticed that some of the train human translation topics are empty

e.g Farsi topic_id 1027 is blank, all the Russian human translations are missing too I am looking at this file: https://github.com/hltcoe/HC4/blob/main/resources/hc4/train.topics.v1-0.jsonl

Kindly assist, Thanks!

opened by ToluClassics 2
Bump lxml from 4.6.3 to 4.6.5
Bumps lxml from 4.6.3 to 4.6.5.

Changelog

Sourced from lxml's changelog.

4.6.5 (2021-12-12)

Bugs fixed

A vulnerability (GHSL-2021-1038) in the HTML cleaner allowed sneaking script content through SVG images (CVE-2021-43818).

A vulnerability (GHSL-2021-1037) in the HTML cleaner allowed sneaking script content through CSS imports and other crafted constructs (CVE-2021-43818).

4.6.4 (2021-11-01)

Features added

GH#317: A new property system_url was added to DTD entities. Patch by Thirdegree.

GH#314: The STATIC_* variables in setup.py can now be passed via env vars. Patch by Isaac Jurado.

Commits

a9611ba Fix a test in Py2.

a3eacbc Prepare release of 4.6.5.

b7ea687 Update changelog.

69a7473 Cleaner: cover some more cases where scripts could sneak through in specially...

54d2985 Fix condition in test decorator.

4b220b5 Use the non-depcrecated TextTestResult instead of _TextTestResult (GH-333)

d85c6de Exclude a test when using the macOS system libraries because it fails with li...

cd4bec9 Add macOS-M1 as wheel build platform.

fd0d471 Install automake and libtool in macOS build to be able to install the latest ...

f233023 Cleaner: Remove SVG image data URLs since they can embed script content.

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

wontfix dependencies
opened by dependabot[bot] 2
Question about languages_with_qrels

Hi,

Thanks for building this collection. I have a question about the languages_with_qrels in topics.

Suppose a query has "languages_with_qrels": ["rus", "zho"], does this mean there is no relevant document in Farsi? or the query is only run against Russian and Chinese collections?

opened by zhiqihuang 1

Typo in error message and many hash mismatches

After downloading the data as described, I ran:

$ python fix_document_order.py --hc4_file ./data/zho/hc4_docs.jsonl \
                             --id_file ./resources/hc4/zho/ids*.jsonl.gz \
                             --check_hash
...
Traceback (most recent call last):
  File "/mnt/ssd/expts/joelb/HC4/fix_document_order.py", line 71, in <module>
    main(args)
  File "/mnt/ssd/expts/joelb/HC4/fix_document_order.py", line 40, in main
    assert len(ordered_ids) == len(docs_pos), \
AssertionError: Downloaded 646268 unique documents but id file(s) have 646268 unique ids.

There's a typo in the error message. Here is a diff to fix:

$ git diff
diff --git a/fix_document_order.py b/fix_document_order.py
index caa76de..f1d5e0f 100644
--- a/fix_document_order.py
+++ b/fix_document_order.py
@@ -38,7 +38,7 @@ def main(args):
             pbar.update()

     assert len(ordered_ids) == len(docs_pos), \
-           f"Downloaded {len(docs_pos)} unique documents but id file(s) have {len(docs_pos)} unique ids."
+           f"Downloaded {len(docs_pos)} unique documents but id file(s) have {len(ordered_ids)} unique ids."

     output_file = args.hc4_file.with_name(f"{args.hc4_file.name}.sorted")

@@ -68,4 +68,4 @@ if __name__ == '__main__':
     if len(args.id_file) > 1:
         args.id_file = sorted(args.id_file, key=lambda x: int(x.name.split(".")[1]))

-    main(args)
\ No newline at end of file
+    main(args)

Running --resume:

$ python download_documents.py --storage ./data/ \
                             --zho ./resources/hc4/zho/ids.jsonl.gz \
                             --fas ./resources/hc4/fas/ids.jsonl.gz \
                             --rus ./resources/hc4/rus/ids.*.jsonl.gz \
                             --jobs 4 --resume
...
Looking for 478 documents in 1 cc_files
...
Found all needed docs in crawl-data/CC-NEWS/2019/03/CC-NEWS-20190305130425-00520.warc.gz, early stopping
done-cc-file:crawl-data/CC-NEWS/2019/03/CC-NEWS-20190305130425-00520.warc.gz

Then re-running fix_document_order.py showed that all docs were downloaded, however I still had many hash errors. For example, for rus:

$ python fix_document_order.py --hc4_file ./data/rus/hc4_docs.jsonl \
  --id_file ./resources/hc4/rus/ids*.jsonl.gz \
  --check_hash
...
Doc 81f3aa7d-ab14-4dea-be41-6b3474249953 hash mismatch -- should be d4ca468d21616841a2144d0dad123eb4 but got 8a1dc724e4164c6532ed36a7946bd981
Doc 86b099ba-1511-4326-828e-e0d5e1c0f90a hash mismatch -- should be 1a50b2a9ba514dfc68810fc4632ab97a but got 6914a2420218286b33955723d126d8e9
Reading downloaded file: 100%|██████████████████████████████████████████████████████████▉| 4719506/4721064 [07:08<00:00, 11117.84it/s]Doc 589b3401-1e65-40f0-a905-fd6b6fc1e04a hash mismatch -- should be 771d69c411674f627e0be95f7b0ce98d but got a81ca88d6e61cf64f594957001045f84
Doc b99a9558-d946-4732-b09a-e9d1600cdafa hash mismatch -- should be 0309f37937d498a4bfaeeb3366934c07 but got da68945322ddb025ed1c0430a8aba8cb
Doc e55d8cae-58af-43af-bf04-19f3628f4273 hash mismatch -- should be 27cbce31eeb04088f5cf529f56d498b9 but got 9828b6d5d53a28dc0cf7876509920bff
Doc 4bbe64ca-11b7-479f-8e02-83861d470e53 hash mismatch -- should be ab6f0ff18ba464da2f2ac78f3a7a69e4 but got dd0380def9f1d217ae8601ba47aec389
Doc 382f3ed2-6155-4415-adc3-993a287f129b hash mismatch -- should be 64b44761217e3d8a924c769baeaf8b3d but got 26bf5ed5298ab99c5072772fff4d506f
Reading downloaded file: 100%|██████████████████████████████████████████████████████████▉| 4720621/4721064 [07:08<00:00, 10955.39it/s]Doc 5d195cd8-e402-4714-8bf7-0c95807f96ff hash mismatch -- should be 662f2f0ca832a52bb2630c1749c74276 but got a0fb8bb09d0b134fef9ccb143ba6a7a0
Doc 3b6ec592-48c4-4d95-a692-62fd7b9f1529 hash mismatch -- should be 42bcb22637bf3d61669a62781c79a496 but got 373c61c948ef7088bfd25a6234c3a557
Doc a155ba8a-d5bb-4d5d-b532-4accf699a3b4 hash mismatch -- should be 5aa723febfa6be837456cbc0637c24db but got d493913737cd51554e61f7e34c353261
Reading downloaded file: 100%|███████████████████████████████████████████████████████████| 4721064/4721064 [07:08<00:00, 11006.19it/s]
Writing sorted docuements: 100%|█████████████████████████████████████████████████████████| 4721064/4721064 [02:19<00:00, 33754.77it/s]
Backing up the original file...
Done

opened by joelb-git 2

add instructions to install wheel into virtual environment

Somewhere in the dependency tree is Pillow which requires compiling C code. A lot of user will have trouble with that unless they first install wheel into their virtual environment and get the precompiled binary.

opened by cash 2

This repository contains the scripts for downloading and validating scripts for the documents

Related tags

Overview

HC4: HLTCOE CLIR Common-Crawl Collection

Topics and Qrels

Download Documents

Validate

Reference

Comments

Train Topics Missing

Bump lxml from 4.6.3 to 4.6.5

4.6.5 (2021-12-12)

Bugs fixed

4.6.4 (2021-11-01)

Features added

Question about languages_with_qrels

Typo in error message and many hash mismatches

add instructions to install wheel into virtual environment

Owner

JHU Human Language Technology Center of Excellence

An image base contains 490 images for learning (400 cars and 90 boats), and another 21 images for testingAn image base contains 490 images for learning (400 cars and 90 boats), and another 21 images for testing

This package contains deep learning models and related scripts for RoseTTAFold

Official codebase for "B-Pref: Benchmarking Preference-BasedReinforcement Learning" contains scripts to reproduce experiments.

A deep learning based semantic search platform that computes similarity scores between provided query and documents

An executor that loads ONNX models and embeds documents using the ONNX runtime.

Import Python modules from dicts and JSON formatted documents.

On Generating Extended Summaries of Long Documents

Pytorch Implementation of Value Retrieval with Arbitrary Queries for Form-like Documents.

An efficient and effective learning to rank algorithm by mining information across ranking candidates. This repository contains the tensorflow implementation of SERank model. The code is developed based on TF-Ranking.

This repository contains all the code and materials distributed in the 2021 Q-Programming Summer of Qode.

This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL 2021.

This repository contains the source code and data for reproducing results of Deep Continuous Clustering paper

This repository contains a re-implementation of the code for the CVPR 2021 paper "Omnimatte: Associating Objects and Their Effects in Video."

This repository contains the code for "Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP".

This repository contains the code and models for the following paper.

This repository contains several image-to-image translation models, whcih were tested for RGB to NIR image generation. The models are Pix2Pix, Pix2PixHD, CycleGAN and PointWise.

This repository contains the entire code for our work "Two-Timescale End-to-End Learning for Channel Acquisition and Hybrid Precoding"

This repository contains numerical implementation for the paper Intertemporal Pricing under Reference Effects: Integrating Reference Effects and Consumer Heterogeneity.

This repository contains the code and models necessary to replicate the results of paper: How to Robustify Black-Box ML Models? A Zeroth-Order Optimization Perspective