Tools to download and cleanup Common Crawl data

Meta Research

Last update: Jan 2, 2023

Related tags

Text Data & NLP cc_net

Overview

cc_net

Tools to download and clean Common Crawl as introduced in our paper CCNet.

If you found these resources useful, please consider citing:

@inproceedings{wenzek2020ccnet,
  title={CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data},
  author={Wenzek, Guillaume and Lachaux, Marie-Anne and Conneau, Alexis and Chaudhary, Vishrav and Guzm{\'a}n, Francisco and Joulin, Armand and Grave, {\'E}douard},
  booktitle={Proceedings of The 12th Language Resources and Evaluation Conference},
  pages={4003--4012},
  year={2020}
}

Installation

We only tried this on Linux but installation should be possible on MacOS too.

Create or simlink a data folder to where you want to download the corpus.
Run make install. This will download some resources and install required packages.
If you have a C++ 17 compiler you can also run pip install .[getpy], it provides more memory efficient hashset.
Install the following tools manually if make install failed:

lmplz and build_binary from KenLM
spm_train and spm_encode from Sentence Piece

Training Language Models

The Makefile is used to train Sentence Piece and LM on Wikipedia data.

make help shows help
make lang=de lm trains a Sentence Piece and a LM on German Wikipedia
make all_lm trains the same model than in the paper
make lang=de dl_lm downloads the LM trained for the paper
make dl_all_lm downloads all of them

Pipeline overview

The full mining pipeline is divided in 3 steps:

hashes downloads one Common-Crawl snapshot, and compute hashes for each paragraph
mine removes duplicates, detects language, run the LM and split by lang/perplexity buckets
regroup regroup the files created by mine in chunks of 4Gb

Each step needs the previous step to be over before starting. You can launch the full pipeline using python -m cc_net.

python -m cc_net --help shows help
python -m cc_net --dump 2019-13 treats a specific snapshot
python -m cc_net -l my -l gu restricts to specific languages
python -m cc_net --lm_dir my_lms/ uses custom LMs
python -m cc_net --lang_threshold 0.3 set a specific field in mine.Config
python -m cc_net --config test runs on a tiny subset of a snapshot
python -m cc_net --config config/my_config.json uses configuration from the given config file

Reproducing our work

Given the CPU required to run the full pipeline on such a big corpus we share a mapping from url to the information we computed. You can reconstruct the corpus used in the paper by using:

python -m cc_net --conf reproduce --dump 2019-09

Extract XLM-R data

Unsupervised Cross-lingual Representation Learning at Scale (XLM-RoBERTa) paper was trained on data extracted by an internal version of cc_net.

Due to the format being a little bit different please use the following command instead:

python cc_net/tools/dl_cc_100.py --help
python cc_net/tools/dl_cc_100.py --outdir data_cc100 --process 8

If you use this version of the data please also consider citing:

@article{conneau2019unsupervised,
  title={Unsupervised Cross-lingual Representation Learning at Scale},
  author={Conneau, Alexis and Khandelwal, Kartikay and Goyal, Naman and Chaudhary, Vishrav and Wenzek, Guillaume and Guzm{\'a}n, Francisco and Grave, Edouard and Ott, Myle and Zettlemoyer, Luke and Stoyanov, Veselin},
  journal={arXiv preprint arXiv:1911.02116},
  year={2019}
}

Adapting to your infrastructure

Given the computation cost of running the full pipeline we distributed the computation on a Slurm cluster using submitit. submitit will default to spawning processes on your machine if Slurm cluster is found. You should tweak --task_parallelism to something adapated to your machine. Defaults are 512 for mining and 20 for reproducing.

To run the tasks in-process use --execution debug.

Output format

Generated files are compressed JSON files. There is one JSON object per line.

List of fields:

url: webpage URL (part of CC)
date_download: date of download (part of CC)
digest: sha1 digest of the webpage (part of CC)
length: number of chars
nlines: number of lines
source_domain: web domain of the webpage
title: page title (part of CC)
raw_content: webpage content after deduplication
original_nlines: number of lines before deduplication
original_length: number of chars before deduplication
language: language detected by FastText LID
language_score: language score
perplexity: perplexity of a LM trained on Wikipedia

Sample JSON object:

{
  "url": "http://www.pikespeakhospice.org/members/1420",
  "date_download": "2019-02-15T18:40:25Z",
  "digest": "sha1:VQW3KXUOALO543IJGTK2JLVEAN2XXKHI",
  "length": 752,
  "nlines": 5,
  "source_domain": "www.pikespeakhospice.org",
  "title": "LeeRoy Aragon",
  "raw_content": "Date Honored: March 2017\nHe was a man of integrity, a hard worker, and a dedicated family man. He loved spending time with family camping, fishing, hunting, boating and just hanging out.\nHis Catholic faith was extremely important to him as he gave of his time and talents to the community. He had many friends through church and the Knights of Columbus. He was a meticulous handyman, and enjoyed building and fixing things and restoring antique furniture to perfection. He was a fan and supported his Colorado Rockies and Denver Broncos. Throughout the years he had devoted four-legged friends (his dogs and a horse named Sunny Boy).\nWe have many cherished memories of him that we will treasure until we are with him again.\n~ Family of LeeRoy F. Aragon",
  "original_nlines": 7,
  "original_length": 754,
  "language": "en",
  "language_score": 0.99,
  "perplexity": 255.11,
}

You can peak at those files using UNIX tools zcat and jq, eg: zcat data/mined/2019-09/en_head_0000.json.gz | head -1 | jq .

jq can do some complicated filtering. jsonql.py provides a Python API with multiprocess support to do more complicated operations like LM scoring of the document.

License

By contributing to cc_net, you agree that your contributions will be licensed under the LICENSE file in the root directory of this source tree.

Comments

ChunkedEncodingError & ConnectionResetError

Here's the log with command nohup python -m cc_net mine --dump 2019-13 > 2019-13.log 2>2019-13.err &:

2019-11-12 00:26 INFO 22835:HashesCollector - Processed 519_187 documents in 1e+01h ( 14.4 doc/s).
2019-11-12 00:26 INFO 22835:HashesCollector - Found 25_229k unique hashes over 90_967 lines. Using 3.6GB of RAM.
2019-11-12 00:27 INFO 22835:cc_net.process_wet_file - Kept 43_340 documents over 45_437 (95.4%).
2019-11-12 00:27 INFO 22835:cc_net.process_wet_file - Parsed 13 / 35 files. Estimated remaining time: 9.2h
2019-11-12 00:27 INFO 22835:root - Starting download of https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-13/segments/1552912201329.40/wet/CC-MAIN-20190318132220-20190318153624-00039.warc.wet.gz
/data/myusername/projects/cc_net/cc_net/jsonql.py:1138: UserWarning: Swallowed error ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer')) while downloading https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-13/segments/1552912201329.40/wet/CC-MAIN-20190318132220-20190318153624-00039.warc.wet.gz (1 out of 3)
  f"Swallowed error {e} while downloading {url} ({i} out of {n_retry})"
2019-11-12 01:16 INFO 22835:root - Downloaded https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-13/segments/1552912201329.40/wet/CC-MAIN-20190318132220-20190318153624-00039.warc.wet.gz [200]
2019-11-12 01:16 INFO 22835:HashesCollector - Processed 562_527 documents in 1.1e+01h ( 14.4 doc/s).
2019-11-12 01:16 INFO 22835:HashesCollector - Found 26_687k unique hashes over 98_562 lines. Using 3.7GB of RAM.
2019-11-12 01:16 INFO 22835:HashesCollector - Found 26_687k unique hashes over 98_562 lines. Using 3.7GB of RAM.
2019-11-12 01:17 INFO 22835:cc_net.process_wet_file - Kept 43_268 documents over 45_427 (95.2%).
2019-11-12 01:17 INFO 22835:cc_net.process_wet_file - Parsed 14 / 35 files. Estimated remaining time: 17.7h
2019-11-12 01:17 INFO 22835:root - Starting download of https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-13/segments/1552912201329.40/wet/CC-MAIN-20190318132220-20190318153625-00008.warc.wet.gz
/data/myusername/projects/cc_net/cc_net/jsonql.py:1138: UserWarning: Swallowed error ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer')) while downloading https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-13/segments/1552912201329.40/wet/CC-MAIN-20190318132220-20190318153625-00008.warc.wet.gz (1 out of 3)
  f"Swallowed error {e} while downloading {url} ({i} out of {n_retry})"
/data/myusername/projects/cc_net/cc_net/jsonql.py:1138: UserWarning: Swallowed error ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer')) while downloading https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-13/segments/1552912201329.40/wet/CC-MAIN-20190318132220-20190318153625-00008.warc.wet.gz (2 out of 3)
  f"Swallowed error {e} while downloading {url} ({i} out of {n_retry})"
2019-11-12 02:11 INFO 22835:HashesCollector - Processed 605_794 documents in 1.2e+01h ( 14.3 doc/s).
2019-11-12 02:11 INFO 22835:HashesCollector - Found 0k unique hashes over 106_217 lines. Using 3.7GB of RAM.
Traceback (most recent call last):
  File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/urllib3/response.py", line 425, in _error_catcher
    yield
  File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/urllib3/response.py", line 507, in read
    data = self._fp.read(amt) if not fp_closed else b""
  File "/usr/lib/python3.7/http/client.py", line 457, in read
    n = self.readinto(b)
  File "/usr/lib/python3.7/http/client.py", line 501, in readinto
    n = self.fp.readinto(b)
  File "/usr/lib/python3.7/socket.py", line 589, in readinto
    return self._sock.recv_into(b)
  File "/usr/lib/python3.7/ssl.py", line 1071, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/lib/python3.7/ssl.py", line 929, in read
    return self._sslobj.read(len, buffer)
ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/requests/models.py", line 750, in generate
    for chunk in self.raw.stream(chunk_size, decode_content=True):
  File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/urllib3/response.py", line 564, in stream
    data = self.read(amt=amt, decode_content=decode_content)
  File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/urllib3/response.py", line 529, in read
    raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
  File "/usr/lib/python3.7/contextlib.py", line 130, in __exit__
    self.gen.throw(type, value, traceback)
  File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/urllib3/response.py", line 443, in _error_catcher
    raise ProtocolError("Connection broken: %r" % e, e)
urllib3.exceptions.ProtocolError: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/data/myusername/projects/cc_net/cc_net/__main__.py", line 31, in <module>
    main()
  File "/data/myusername/projects/cc_net/cc_net/__main__.py", line 27, in main
    command(**parsed_args)
  File "/data/myusername/projects/cc_net/cc_net/mine.py", line 512, in main
    regroup(conf)
  File "/data/myusername/projects/cc_net/cc_net/mine.py", line 364, in regroup
    mine(conf)
  File "/data/myusername/projects/cc_net/cc_net/mine.py", line 257, in mine
    hashes_groups = list(jsonql.grouper(hashes(conf), conf.hash_in_mem))
  File "/data/myusername/projects/cc_net/cc_net/mine.py", line 206, in hashes
    ex(_hashes_shard, repeat(conf), *_transpose(missing_outputs))
  File "/data/myusername/projects/cc_net/cc_net/execution.py", line 128, in debug_executor
    message = function(*x)
  File "/data/myusername/projects/cc_net/cc_net/mine.py", line 218, in _hashes_shard
    file=conf.get_cc_shard(shard),
  File "/data/myusername/projects/cc_net/cc_net/jsonql.py", line 448, in run_pipes
    for res in results:
  File "/data/myusername/projects/cc_net/cc_net/jsonql.py", line 295, in map
    for x in source:
  File "/data/myusername/projects/cc_net/cc_net/process_wet_file.py", line 198, in __iter__
    with jsonql.open_remote_file(self.segment_url(segment)) as f:
  File "/data/myusername/projects/cc_net/cc_net/jsonql.py", line 1151, in open_remote_file
    content = io.BytesIO(request_get_content(url))
  File "/data/myusername/projects/cc_net/cc_net/jsonql.py", line 1136, in request_get_content
    raise e
  File "/data/myusername/projects/cc_net/cc_net/jsonql.py", line 1129, in request_get_content
    r = requests.get(url)
  File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/requests/api.py", line 75, in get
    return request('get', url, params=params, **kwargs)
  File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/requests/api.py", line 60, in request
    return session.request(method=method, url=url, **kwargs)
  File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/requests/sessions.py", line 533, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/requests/sessions.py", line 686, in send
    r.content
  File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/requests/models.py", line 828, in content
    self._content = b''.join(self.iter_content(CONTENT_CHUNK_SIZE)) or b''
  File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/requests/models.py", line 753, in generate
    raise ChunkedEncodingError(e)
requests.exceptions.ChunkedEncodingError: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))

Is this just due to poor network connection between me and Amazon server (I'm in China)? If so, is it recommended to run the code from an AWS server located in US? If I don't have a C++17 compiler, how much memory do I need? Thanks a lot.

opened by soloice 13

Cannot download the precpomputed files

I am trying to reproduce the results from your paper. However, after downloading the common crawl data from aws, the access to the precomputed files seems failed.

Did you changed the location of precomputed file ?

The error messages are like below:

/.local/lib/python3.7/site-packages/cc_net/jsonql.py:1141: 
UserWarning: Swallowed error HTTPSConnectionPool(host='dl.fbaipublicfiles.com', port=443): Max retries exceeded with url: 
/cc_net/2019-09/en_head_0017.json.gz (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7ff4d87913d0>: 
Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')) while downloading https://dl.fbaipublicfiles.com/cc_net/2019-09/en_head_0017.json.gz (2 out of 3)

opened by yinfeiy-g 7

support of Hausa

Thanks for your contribution to the community. I am wondering whether the ccnet contains the Hausa language (ISO id: ha/hau)? Because in the xlm-r paper, Table 6 mentioned that Hausa was included in CCNet. However, I didn't find the language code of Hausa in the dumped files and fasttext lid's document.

opened by donglixp 4

ModuleNotFoundError: No module named 'typing_extensions'

After running make install, I was getting

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/data1/alferre/cc_net/cc_net/__main__.py", line 11, in <module>
    import cc_net.mine
  File "/data1/alferre/cc_net/cc_net/mine.py", line 29, in <module>
    from cc_net import dedup, execution, jsonql, perplexity, process_wet_file
  File "/data1/alferre/cc_net/cc_net/dedup.py", line 25, in <module>
    from cc_net import jsonql
  File "/data1/alferre/cc_net/cc_net/jsonql.py", line 50, in <module>
    from typing_extensions import Literal, Protocol

Running pip install typing_extensions fixed it. So this package is probably missing from the setup.py.

opened by alexandremuzio 4

Fix typo in README (dl_all_lm -> dl_all_lms)

I found a typo in README file.

It seems that make dl_all_lm should be make dl_all_lms in README.md as it is defined in makefile.

I tested it is working well. 😃
CLA Signed

opened by chloamme 3
Decrease RAM usage, investigate miss documents

@gwenzek I written a post with tips how to recreate this in GCP. Basically use S3 or Google cloud bucket and mount them as disk will save you a lot of storage fees

Originally posted by @theblackcat102 in https://github.com/facebookresearch/cc_net/issues/2#issuecomment-599174158

opened by gwenzek 3
Early exit when desired number of documents is reached?

Apologies if this is mentioned somewhere or is otherwise obvious, but:

Is there a way to early-exit when a desired number of documents have been collected? Say I only wanted 1 million documents, can I somehow exit the call to python cc_net mine once I have hit that number?

Thanks a lot in advance.

opened by JohnGiorgi 3

EOFError: Compressed file ended before the end-of-stream marker was reached

Hi there, I was trying to run the code by MPExecutor but got the following error:

2020-07-23 20:44 INFO 156:root - Downloaded https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-51/wet.paths.gz [200]
2020-07-23 20:44 INFO 156:root - Starting download of https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-51/segments/1512948512054.0/wet/CC-MAIN-20171211014442-20171211034442-00400.warc.wet.gz
2020-07-23 20:48 INFO 171:root - Downloaded https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-51/segments/1512948512208.1/wet/CC-MAIN-20171211052406-20171211072406-00300.warc.wet.gz [200]
2020-07-23 20:48 INFO 171:HashesCollector - Processed 2_915 documents in 0.078h ( 10.4 doc/s).
2020-07-23 20:48 INFO 171:HashesCollector - Found 0k unique hashes over 522 lines. Using 0.1GB of RAM.
multiprocessing.pool.RemoteTraceback: 

Traceback (most recent call last):
  File "/home/app/anaconda3/envs/ccnet/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/home/cc_net-master/cc_net/execution.py", line 145, in global_fn
    return f(*args[1:])
  File "/home/cc_net-master/cc_net/mine.py", line 233, in _hashes_shard
    file=conf.get_cc_shard(shard),
  File "/home/cc_net-master/cc_net/jsonql.py", line 449, in run_pipes
    for res in results:
  File "/home/cc_net-master/cc_net/jsonql.py", line 296, in map
    for x in source:
  File "/home/cc_net-master/cc_net/process_wet_file.py", line 199, in __iter__
    for doc in parse_warc_file(iter(f), self.min_len):
  File "/home/cc_net-master/cc_net/process_wet_file.py", line 117, in parse_warc_file
    for doc in group_by_docs(lines):
  File "/home/cc_net-master/cc_net/process_wet_file.py", line 89, in group_by_docs
    for warc in warc_lines:
  File "/home/app/anaconda3/envs/ccnet/lib/python3.7/gzip.py", line 300, in read1
    return self._buffer.read1(size)
  File "/home/app/anaconda3/envs/ccnet/lib/python3.7/_compression.py", line 68, in readinto
    data = self.read(len(byte_view))
  File "/home/app/anaconda3/envs/ccnet/lib/python3.7/gzip.py", line 493, in read
    raise EOFError("Compressed file ended before the "
EOFError: Compressed file ended before the end-of-stream marker was reached


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/app/anaconda3/envs/ccnet/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/app/anaconda3/envs/ccnet/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/cc_net-master/cc_net/__main__.py", line 24, in <module>
    main()
  File "/home/cc_net-master/cc_net/__main__.py", line 20, in main
    func_argparse.parse_and_call(parser)
  File "/home/app/anaconda3/envs/ccnet/lib/python3.7/site-packages/func_argparse/__init__.py", line 72, in parse_and_call
    return command(**parsed_args)
  File "/home/cc_net-master/cc_net/mine.py", line 524, in main
    regroup(conf)
  File "/home/cc_net-master/cc_net/mine.py", line 379, in regroup
    mine(conf)
  File "/home/cc_net-master/cc_net/mine.py", line 272, in mine
    hashes_groups = list(jsonql.grouper(hashes(conf), conf.hash_in_mem))
  File "/home/cc_net-master/cc_net/mine.py", line 221, in hashes
    ex(_hashes_shard, repeat(conf), *_transpose(missing_outputs))
  File "/home/cc_net-master/cc_net/execution.py", line 174, in __call__
    global_fn, zip(itertools.repeat(f_name), *args)
  File "/home/app/anaconda3/envs/ccnet/lib/python3.7/multiprocessing/pool.py", line 748, in next
    raise value
EOFError: Compressed file ended before the end-of-stream marker was reached

and the config I've changed in the mine.py is just like this:

config_name: str = "base"
    dump: str = "2017-51"
    output_dir: Path = Path("data")  
    execution: str = "mp"
    num_shards: int = 800
    num_segments_per_shard: int = -1
    min_len: int = 300
    hash_in_mem: int = 25
    lang_whitelist: Sequence[str] = ["zh"]
    lang_blacklist: Sequence[str] = []
    lang_threshold: float = 0.5
    lm_dir: Path = Path("data/lm_sp")
    cutoff: Path = CUTOFF_CSV
    lm_languages: Optional[Sequence[str]] = ["zh"]
    mine_num_processes: int = 10
    target_size: str = "2G"
    cleanup_after_regroup: bool = True
    task_parallelism: int = 500
    pipeline: Sequence[str] = []
    experiments: Sequence[str] = []

I searched about this error and they all say that caused by the incomplete download file, but I saw your code annotation in jsonql.py func open_remote_file : "Download the files at the given url to memory and opens it as a file" How can I delete these incomplete download file in the memory ? or any other solution to fix this error ?

By the way ,the environment I was running the code is docker containner Ubuntu20.04

opened by zl827154659 2

Dedup all paragraphs if it appear more than once?

eg. if "it is an issue about cc_net" is a paragraph and it appeared three times, as the NativeHashSet saves the value of this key is 1, the 3 paragraphs will be dropped. Why not save one copy?

opened by xingenju 2

Error: Mining phase failure

Hello everyone, I'm having problems running the mining phase. I'm using a computer with 60Gb of RAM and 16 CPU cores. When running the mining phase I get the error below.

Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/raphael_assis4347/raphael/cc_net/cc_net/__main__.py", line 18, in <module>
    main()
  File "/home/raphael_assis4347/raphael/cc_net/cc_net/__main__.py", line 14, in main
    func_argparse.parse_and_call(cc_net.mine.get_main_parser())
  File "/home/raphael_assis4347/.local/lib/python3.8/site-packages/func_argparse/__init__.py", line 72, in parse_and_call
    return command(**parsed_args)
  File "/home/raphael_assis4347/raphael/cc_net/cc_net/mine.py", line 631, in main
    all_files = mine(conf)
  File "/home/raphael_assis4347/raphael/cc_net/cc_net/mine.py", line 341, in mine
    ex(_mine_shard, repeat(conf), hashes_files, *_transpose(missing_outputs))
  File "/home/raphael_assis4347/raphael/cc_net/cc_net/execution.py", line 200, in custom_map_array
    raise Exception(message)
Exception: 9 / 9 jobs failed while running _mine_shard
Will run cc_net.mine.main with the following config: Config(config_name='base', dump='2019-09', output_dir=PosixPath('data'), mined_dir='mined_data', execution='auto', num_shards=9, num_segments_per_shard=750, metadata=None, min_len=300, hash_in_mem=9, lang_whitelist=['pt'], lang_blacklist=[], lang_threshold=0.5, keep_bucket=['head', 'middle'], lm_dir=PosixPath('data/lm_sp'), cutoff=PosixPath('/home/raphael_assis4347/raphael/cc_net/cc_net/data/cutoff.csv'), lm_languages=['pt'], mine_num_processes=9, target_size='4G', cleanup_after_regroup=True, task_parallelism=-1, pipeline=['dedup', 'lid', 'keep_lang', 'sp', 'lm', 'pp_bucket', 'drop', 'split_by_lang'], experiments=[], cache_dir=PosixPath('data/wet_cache'))
Submitting 9 jobs for _mine_shard, with task_parallelism=16
Waiting on 9 running jobs. Job ids: 69378,69397,69400,69419...
Failed job 69516 (1 / 9): Job 69516 (task: 0) with path /home/raphael_assis4347/raphael/cc_net/data/logs/69516_0_result.pkl
has not produced any output (state: FINISHED)
Error stream produced:
----------------------------------------
2022-09-14 16:43 INFO 69535:cc_net.jsonql - preparing [<cc_net.dedup.DuplicatesRemover object at 0x7fd7313c8d00>, Classifier(bin/lid.bin), <cc_net.jsonql.where object at 0x7fd7313c8df0>, <cc_net.perplexity.MultiSentencePiece object at 0x7fd7313c8d30>, <cc_net.perplexity.DocLM object at 0x7fd7313c8e50>, <cc_net.perplexity.PerplexityBucket object at 0x7fd7313c8d60>, <cc_net.perplexity.DropKeys object at 0x7fd7313c8fa0>]

Waiting on 8 running jobs. Job ids: 69378,69397,69400,69419...
Failed job 69400 (2 / 9): Job 69400 (task: 0) with path /home/raphael_assis4347/raphael/cc_net/data/logs/69400_0_result.pkl
has not produced any output (state: FINISHED)
Error stream produced:
----------------------------------------
2022-09-14 16:43 INFO 69418:cc_net.jsonql - preparing [<cc_net.dedup.DuplicatesRemover object at 0x7fcee923a970>, Classifier(bin/lid.bin), <cc_net.jsonql.where object at 0x7fcee923aa60>, <cc_net.perplexity.MultiSentencePiece object at 0x7fcee923a9a0>, <cc_net.perplexity.DocLM object at 0x7fcee923aac0>, <cc_net.perplexity.PerplexityBucket object at 0x7fcee923a9d0>, <cc_net.perplexity.DropKeys object at 0x7fcee923ac10>]

I couldn't identify the problem by looking at the logs. In the process .log.err file it only contains the pipeline objects vector. Does anyone have any idea what it could be?

This is my configuration file:

{
    "dump": "2019-09",
    "hash_in_mem": 9,
    "num_shards": 9,
    "mine_num_processes": 9,
    "num_segments_per_shard": 750,
    "lang_whitelist": ["pt"],
    "lm_languages": ["pt"],
    "pipeline": [
        "dedup",
        "lid",
        "keep_lang",
        "sp",
        "lm",
        "pp_bucket",
        "drop",
        "split_by_lang"
    ],
    "execution": "auto",
    "output_dir": "data",
    "mined_dir": "mined_data",
    "target_size": "4G",
    "keep_bucket": ["head", "middle"],
    "cache_dir": "data/wet_cache"
}

opened by AssisRaphael 1

I want to copy the output data of CC_net directly, what should I do?

run “python -m cc_net --config reproduce --dump 2019-09”

Can 403 still copy your output? metadata='https://dl.fbaipublicfiles.com/cc_net/1.0.0'

I don't have enough CPU and GPU resources to complete the mining process. I want to copy the output data of CC_net directly, what should I do?

opened by mome1024 1
403 forbidden while downloading

hi there, I encountered the 403 error while trying downloading ccnet data using this pipeline. Wondering if this is bcs of the network settings from my side or is there anything wrong? Thanks in advance.

/ldap_home/raven.ren/cc_net/cc_net/flat_hash_set.py:115: UserWarning: Module 'getpy' not found. Deduplication will take more RAM. Try pip install cc_net[getpy] warnings.warn( 2022-08-23 19:25 INFO 6898:root - Starting download of https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-51/wet.paths.gz 2022-08-23 19:25 INFO 6898:HashesCollector - Processed 0 documents in 0.00034h ( 0.0 doc/s). 2022-08-23 19:25 INFO 6898:HashesCollector - Found 0k unique hashes over 0k lines. Using 0.1GB of RAM. submitit ERROR (2022-08-23 19:25:23,974) - Submitted job triggered an exception 2022-08-23 19:25 ERROR 6898:submitit - Submitted job triggered an exception Traceback (most recent call last): File "/ldap_home/raven.ren/.conda/envs/py38/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/ldap_home/raven.ren/.conda/envs/py38/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/ldap_home/raven.ren/.conda/envs/py38/lib/python3.8/site-packages/submitit/core/_submit.py", line 11, in <module> submitit_main() File "/ldap_home/raven.ren/.conda/envs/py38/lib/python3.8/site-packages/submitit/core/submission.py", line 72, in submitit_main process_job(args.folder) File "/ldap_home/raven.ren/.conda/envs/py38/lib/python3.8/site-packages/submitit/core/submission.py", line 65, in process_job raise error File "/ldap_home/raven.ren/.conda/envs/py38/lib/python3.8/site-packages/submitit/core/submission.py", line 54, in process_job result = delayed.result() File "/ldap_home/raven.ren/.conda/envs/py38/lib/python3.8/site-packages/submitit/core/utils.py", line 133, in result self._result = self.function(*self.args, **self.kwargs) File "/ldap_home/raven.ren/cc_net/cc_net/mine.py", line 273, in _hashes_shard jsonql.run_pipes( File "/ldap_home/raven.ren/cc_net/cc_net/jsonql.py", line 455, in run_pipes write_jsons(data, output) File "/ldap_home/raven.ren/cc_net/cc_net/jsonql.py", line 496, in write_jsons for res in source: File "/ldap_home/raven.ren/cc_net/cc_net/jsonql.py", line 284, in map for x in source: File "/ldap_home/raven.ren/cc_net/cc_net/process_wet_file.py", line 195, in __iter__ n = len(self.segments) File "/ldap_home/raven.ren/cc_net/cc_net/process_wet_file.py", line 243, in segments segments = cc_segments(self.dump, self.cache_dir) File "/ldap_home/raven.ren/cc_net/cc_net/process_wet_file.py", line 38, in cc_segments f = jsonql.open_remote_file(wet_paths, cache=wet_paths_cache) File "/ldap_home/raven.ren/cc_net/cc_net/jsonql.py", line 1124, in open_remote_file raw_bytes = request_get_content(url) File "/ldap_home/raven.ren/cc_net/cc_net/jsonql.py", line 1101, in request_get_content raise e File "/ldap_home/raven.ren/cc_net/cc_net/jsonql.py", line 1095, in request_get_content r.raise_for_status() File "/ldap_home/raven.ren/.conda/envs/py38/lib/python3.8/site-packages/requests/models.py", line 960, in raise_for_status raise HTTPError(http_error_msg, response=self) requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-51/wet.paths.gz

opened by Raven-Ren 2
Batch job submission failed: Invalid job array specification

Hi, when I run "python -m cc_net", this error happened:

Submitting _hashes_shard in a job array (1600 jobs) sbatch: error: Batch job submission failed: Invalid job array specification subprocess.CalledProcessError: Command '['sbatch', '/data/gsw/test/cc_net/data/logs/submission_file_479eba35e148432da4432891c1191887.sh']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/gsw/anaconda3/envs/test_p/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/gsw/anaconda3/envs/test_p/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/data/gsw/test/cc_net/cc_net/main.py", line 18, in main() File "/data/gsw/test/cc_net/cc_net/main.py", line 14, in main func_argparse.parse_and_call(cc_net.mine.get_main_parser()) File "/home/gsw/anaconda3/envs/test_p/lib/python3.9/site-packages/func_argparse/init.py", line 72, in parse_and_call return command(**parsed_args) File "/data/gsw/test/cc_net/cc_net/mine.py", line 632, in main all_files = mine(conf) File "/data/gsw/test/cc_net/cc_net/mine.py", line 335, in mine hashes_groups = list(jsonql.grouper(hashes(conf), conf.hash_in_mem)) File "/data/gsw/test/cc_net/cc_net/mine.py", line 263, in hashes ex(_hashes_shard, repeat(conf), *_transpose(missing_outputs)) File "/data/gsw/test/cc_net/cc_net/execution.py", line 89, in map_array_and_wait jobs = ex.map_array(function, *args) File "/home/gsw/anaconda3/envs/test_p/lib/python3.9/site-packages/submitit/core/core.py", line 701, in map_array return self._internal_process_submissions(submissions) File "/home/gsw/anaconda3/envs/test_p/lib/python3.9/site-packages/submitit/auto/auto.py", line 218, in _internal_process_submissions return self._executor._internal_process_submissions(delayed_submissions) File "/home/gsw/anaconda3/envs/test_p/lib/python3.9/site-packages/submitit/slurm/slurm.py", line 332, in _internal_process_submissions first_job: core.Job[tp.Any] = array_ex._submit_command(self._submitit_command_str) File "/home/gsw/anaconda3/envs/test_p/lib/python3.9/site-packages/submitit/core/core.py", line 864, in _submit_command output = utils.CommandFunction(command_list, verbose=False)() # explicit errors File "/home/gsw/anaconda3/envs/test_p/lib/python3.9/site-packages/submitit/core/utils.py", line 350, in call raise FailedJobError(stderr) from subprocess_error submitit.core.utils.FailedJobError: sbatch: error: Batch job submission failed: Invalid job array specification

opened by swgu98 2
Variance of hash files sizes in newer crawls

Hello, I noticed that hash files that I've produced from the dump of January 21 (and several others months in 2020) are much smaller (x100) than hashes from dump of April and May 2019, even though original wet files were the same size.

In both cases there are 2 shards per one hash and all the other parameters are the same.

Trying to understand why, tnx:)

opened by var926 1
"Reproducing our work" does not specify set of languages and snapshots

README.md provides python -m cc_net --config reproduce --dump 2019-09 as an example to reproduce the cc_net corpus, which relies on

https://github.com/facebookresearch/cc_net/blob/242e10d1d694031c82817f895e56e27a02618803/cc_net/mine.py#L172-L191

The combination of dump 2019-09 and french languages provides only a small corpus. As the metadata files are only accessible via https://dl.fbaipublicfiles.com/cc_net/1.0.0, it is impossible to list the underlying S3 bucket to obtain a complete list of available languages and dumps. Thus it would be helpful if you can provide the complete list in your README.

opened by leezu 2
cc_net/tools/dl_cc_100.py fails to extract complete dataset
python3.7 cc_net/tools/dl_cc_100.py --outdir data/cc100 --processes 96 provides only 99GB (277 GB uncompressed) data across 10 languages:

780M /mnt/data/cc100/bn_IN 2.0G /mnt/data/cc100/hi_IN 25G /mnt/data/cc100/id_ID 12G /mnt/data/cc100/ko_KR 89M /mnt/data/cc100/my_MM 25G /mnt/data/cc100/sv_SE 270M /mnt/data/cc100/sw_KE 6.7G /mnt/data/cc100/th_TH 475M /mnt/data/cc100/tl_XX 21G /mnt/data/cc100/vi_VN

The script should provide all 100 languages listed in https://arxiv.org/pdf/1911.02116.pdf Figure 1:
opened by leezu 6

Owner

Meta Research

GitHub

Grading tools for Advanced NLP (11-711)Grading tools for Advanced NLP (11-711)

Grading tools for Advanced NLP (11-711) Installation You'll need docker and unzip to use this repo. For docker, visit the official guide to get starte

2 Sep 27, 2022

A modular Karton Framework service that unpacks common packers like UPX and others using the Qiling Framework.

Unpacker Karton Service A modular Karton Framework service that unpacks common packers like UPX and others using the Qiling Framework. This project is

45 Jan 5, 2023

Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage.

TextDistance TextDistance -- python library for comparing distance between two or more sequences by many algorithms. Features: 30+ algorithms Pure pyt

3k Jan 6, 2023

Get list of common stop words in various languages in Python

Python Stop Words Table of contents Overview Available languages Installation Basic usage Python compatibility Overview Get list of common stop words

142 Dec 21, 2022

Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage.

TextDistance TextDistance -- python library for comparing distance between two or more sequences by many algorithms. Features: 30+ algorithms Pure pyt

1.9k Feb 18, 2021

Get list of common stop words in various languages in Python

Python Stop Words Table of contents Overview Available languages Installation Basic usage Python compatibility Overview Get list of common stop words

121 Jan 6, 2021

Common Voice Dataset explorer

Common Voice Dataset Explorer Common Voice Dataset is by Mozilla Made during huggingface finetuning week Usage pip install -r requirements.txt streaml

22 Nov 16, 2022

🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

?? The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

15k Jan 2, 2023

Tools and data for measuring the popularity & growth of various programming languages.

growth-data Tools and data for measuring the popularity & growth of various programming languages. Install the dependencies $ pip install -r requireme

3 Jan 6, 2022

Client library to download and publish models and other files on the huggingface.co hub

huggingface_hub Client library to download and publish models and other files on the huggingface.co hub Do you have an open source ML library? We're l

644 Jan 1, 2023

Tools, wrappers, etc... for data science with a concentration on text processing

Rosetta Tools for data science with a focus on text processing. Focuses on "medium data", i.e. data too big to fit into memory but too small to necess

207 Nov 22, 2022

Tools for curating biomedical training data for large-scale language modeling

242 Dec 25, 2022

Download videos from YouTube/Twitch/Twitter right in the Windows Explorer, without installing any shady shareware apps

youtube-dl and ffmpeg Windows Explorer Integration Download videos from YouTube/Twitch/Twitter and more (any platform that is supported by youtube-dl)

226 Dec 30, 2022

Script to download some free japanese lessons in portuguse from NHK

Nihongo_nhk This is a script to download some free japanese lessons in portuguese from NHK. It can be executed by installing the packages with: pip in

2 Jan 6, 2022

Simple telegram bot to convert files into direct download link.you can use telegram as a file server 🪁

TGCLOUD ?? Simple telegram bot to convert files into direct download link.you can use telegram as a file server ?? Features Easy to Deploy Heroku Supp

6 Oct 18, 2022

Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Pattern Pattern is a web mining module for Python. It has tools for: Data Mining: web services (Google, Twitter, Wikipedia), web crawler, HTML DOM par

Computational Linguistics Research Group

8.4k Dec 30, 2022

SinglepassTextCluster, an TextCluster tools based on Singlepass cluster algorithm that use tfidf vector and doc2vec，which can be used for individual real-time corpus cluster task。基于single-pass算法思想的自动文本聚类小组件，内置tfidf和doc2vec两种文本向量方法，可自动输出聚类数目、类簇文档集合和簇类大小，用于自有实时数据的聚类任务。

项目的背景 SinglepassTextCluster, an TextCluster tool based on Singlepass cluster algorithm that use tfidf vector and doc2vec，which can be used for individ

34 Dec 18, 2022

A Python package implementing a new model for text classification with visualization tools for Explainable AI :octocat:

A Python package implementing a new model for text classification with visualization tools for Explainable AI ?? Online live demos: http://tworld.io/s

285 Jan 2, 2023

Python wrapper for Stanford CoreNLP tools v3.4.1

Python interface to Stanford Core NLP tools v3.4.1 This is a Python wrapper for Stanford University's NLP group's Java-based CoreNLP tools. It can eit

610 Sep 7, 2022