Tools to download and cleanup Common Crawl data

Overview

cc_net

Tools to download and clean Common Crawl as introduced in our paper CCNet.

If you found these resources useful, please consider citing:

@inproceedings{wenzek2020ccnet,
  title={CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data},
  author={Wenzek, Guillaume and Lachaux, Marie-Anne and Conneau, Alexis and Chaudhary, Vishrav and Guzm{\'a}n, Francisco and Joulin, Armand and Grave, {\'E}douard},
  booktitle={Proceedings of The 12th Language Resources and Evaluation Conference},
  pages={4003--4012},
  year={2020}
}

CircleCI

Installation

We only tried this on Linux but installation should be possible on MacOS too.

  1. Create or simlink a data folder to where you want to download the corpus.

  2. Run make install. This will download some resources and install required packages.

  3. If you have a C++ 17 compiler you can also run pip install .[getpy], it provides more memory efficient hashset.

  4. Install the following tools manually if make install failed:

Training Language Models

The Makefile is used to train Sentence Piece and LM on Wikipedia data.

  • make help shows help
  • make lang=de lm trains a Sentence Piece and a LM on German Wikipedia
  • make all_lm trains the same model than in the paper
  • make lang=de dl_lm downloads the LM trained for the paper
  • make dl_all_lm downloads all of them

Pipeline overview

The full mining pipeline is divided in 3 steps:

  • hashes downloads one Common-Crawl snapshot, and compute hashes for each paragraph
  • mine removes duplicates, detects language, run the LM and split by lang/perplexity buckets
  • regroup regroup the files created by mine in chunks of 4Gb

Each step needs the previous step to be over before starting. You can launch the full pipeline using python -m cc_net.

  • python -m cc_net --help shows help
  • python -m cc_net --dump 2019-13 treats a specific snapshot
  • python -m cc_net -l my -l gu restricts to specific languages
  • python -m cc_net --lm_dir my_lms/ uses custom LMs
  • python -m cc_net --lang_threshold 0.3 set a specific field in mine.Config
  • python -m cc_net --config test runs on a tiny subset of a snapshot
  • python -m cc_net --config config/my_config.json uses configuration from the given config file

Reproducing our work

Given the CPU required to run the full pipeline on such a big corpus we share a mapping from url to the information we computed. You can reconstruct the corpus used in the paper by using:

python -m cc_net --conf reproduce --dump 2019-09

Extract XLM-R data

Unsupervised Cross-lingual Representation Learning at Scale (XLM-RoBERTa) paper was trained on data extracted by an internal version of cc_net.

Due to the format being a little bit different please use the following command instead:

python cc_net/tools/dl_cc_100.py --help
python cc_net/tools/dl_cc_100.py --outdir data_cc100 --process 8

If you use this version of the data please also consider citing:

@article{conneau2019unsupervised,
  title={Unsupervised Cross-lingual Representation Learning at Scale},
  author={Conneau, Alexis and Khandelwal, Kartikay and Goyal, Naman and Chaudhary, Vishrav and Wenzek, Guillaume and Guzm{\'a}n, Francisco and Grave, Edouard and Ott, Myle and Zettlemoyer, Luke and Stoyanov, Veselin},
  journal={arXiv preprint arXiv:1911.02116},
  year={2019}
}

Adapting to your infrastructure

Given the computation cost of running the full pipeline we distributed the computation on a Slurm cluster using submitit. submitit will default to spawning processes on your machine if Slurm cluster is found. You should tweak --task_parallelism to something adapated to your machine. Defaults are 512 for mining and 20 for reproducing.

To run the tasks in-process use --execution debug.

Output format

Generated files are compressed JSON files. There is one JSON object per line.

List of fields:

  • url: webpage URL (part of CC)
  • date_download: date of download (part of CC)
  • digest: sha1 digest of the webpage (part of CC)
  • length: number of chars
  • nlines: number of lines
  • source_domain: web domain of the webpage
  • title: page title (part of CC)
  • raw_content: webpage content after deduplication
  • original_nlines: number of lines before deduplication
  • original_length: number of chars before deduplication
  • language: language detected by FastText LID
  • language_score: language score
  • perplexity: perplexity of a LM trained on Wikipedia

Sample JSON object:

{
  "url": "http://www.pikespeakhospice.org/members/1420",
  "date_download": "2019-02-15T18:40:25Z",
  "digest": "sha1:VQW3KXUOALO543IJGTK2JLVEAN2XXKHI",
  "length": 752,
  "nlines": 5,
  "source_domain": "www.pikespeakhospice.org",
  "title": "LeeRoy Aragon",
  "raw_content": "Date Honored: March 2017\nHe was a man of integrity, a hard worker, and a dedicated family man. He loved spending time with family camping, fishing, hunting, boating and just hanging out.\nHis Catholic faith was extremely important to him as he gave of his time and talents to the community. He had many friends through church and the Knights of Columbus. He was a meticulous handyman, and enjoyed building and fixing things and restoring antique furniture to perfection. He was a fan and supported his Colorado Rockies and Denver Broncos. Throughout the years he had devoted four-legged friends (his dogs and a horse named Sunny Boy).\nWe have many cherished memories of him that we will treasure until we are with him again.\n~ Family of LeeRoy F. Aragon",
  "original_nlines": 7,
  "original_length": 754,
  "language": "en",
  "language_score": 0.99,
  "perplexity": 255.11,
}

You can peak at those files using UNIX tools zcat and jq, eg: zcat data/mined/2019-09/en_head_0000.json.gz | head -1 | jq .

jq can do some complicated filtering. jsonql.py provides a Python API with multiprocess support to do more complicated operations like LM scoring of the document.

License

By contributing to cc_net, you agree that your contributions will be licensed under the LICENSE file in the root directory of this source tree.

Comments
  • ChunkedEncodingError & ConnectionResetError

    ChunkedEncodingError & ConnectionResetError

    Here's the log with command nohup python -m cc_net mine --dump 2019-13 > 2019-13.log 2>2019-13.err &:

    2019-11-12 00:26 INFO 22835:HashesCollector - Processed 519_187 documents in 1e+01h ( 14.4 doc/s).
    2019-11-12 00:26 INFO 22835:HashesCollector - Found 25_229k unique hashes over 90_967 lines. Using 3.6GB of RAM.
    2019-11-12 00:27 INFO 22835:cc_net.process_wet_file - Kept 43_340 documents over 45_437 (95.4%).
    2019-11-12 00:27 INFO 22835:cc_net.process_wet_file - Parsed 13 / 35 files. Estimated remaining time: 9.2h
    2019-11-12 00:27 INFO 22835:root - Starting download of https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-13/segments/1552912201329.40/wet/CC-MAIN-20190318132220-20190318153624-00039.warc.wet.gz
    /data/myusername/projects/cc_net/cc_net/jsonql.py:1138: UserWarning: Swallowed error ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer')) while downloading https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-13/segments/1552912201329.40/wet/CC-MAIN-20190318132220-20190318153624-00039.warc.wet.gz (1 out of 3)
      f"Swallowed error {e} while downloading {url} ({i} out of {n_retry})"
    2019-11-12 01:16 INFO 22835:root - Downloaded https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-13/segments/1552912201329.40/wet/CC-MAIN-20190318132220-20190318153624-00039.warc.wet.gz [200]
    2019-11-12 01:16 INFO 22835:HashesCollector - Processed 562_527 documents in 1.1e+01h ( 14.4 doc/s).
    2019-11-12 01:16 INFO 22835:HashesCollector - Found 26_687k unique hashes over 98_562 lines. Using 3.7GB of RAM.
    2019-11-12 01:16 INFO 22835:HashesCollector - Found 26_687k unique hashes over 98_562 lines. Using 3.7GB of RAM.
    2019-11-12 01:17 INFO 22835:cc_net.process_wet_file - Kept 43_268 documents over 45_427 (95.2%).
    2019-11-12 01:17 INFO 22835:cc_net.process_wet_file - Parsed 14 / 35 files. Estimated remaining time: 17.7h
    2019-11-12 01:17 INFO 22835:root - Starting download of https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-13/segments/1552912201329.40/wet/CC-MAIN-20190318132220-20190318153625-00008.warc.wet.gz
    /data/myusername/projects/cc_net/cc_net/jsonql.py:1138: UserWarning: Swallowed error ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer')) while downloading https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-13/segments/1552912201329.40/wet/CC-MAIN-20190318132220-20190318153625-00008.warc.wet.gz (1 out of 3)
      f"Swallowed error {e} while downloading {url} ({i} out of {n_retry})"
    /data/myusername/projects/cc_net/cc_net/jsonql.py:1138: UserWarning: Swallowed error ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer')) while downloading https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-13/segments/1552912201329.40/wet/CC-MAIN-20190318132220-20190318153625-00008.warc.wet.gz (2 out of 3)
      f"Swallowed error {e} while downloading {url} ({i} out of {n_retry})"
    2019-11-12 02:11 INFO 22835:HashesCollector - Processed 605_794 documents in 1.2e+01h ( 14.3 doc/s).
    2019-11-12 02:11 INFO 22835:HashesCollector - Found 0k unique hashes over 106_217 lines. Using 3.7GB of RAM.
    Traceback (most recent call last):
      File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/urllib3/response.py", line 425, in _error_catcher
        yield
      File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/urllib3/response.py", line 507, in read
        data = self._fp.read(amt) if not fp_closed else b""
      File "/usr/lib/python3.7/http/client.py", line 457, in read
        n = self.readinto(b)
      File "/usr/lib/python3.7/http/client.py", line 501, in readinto
        n = self.fp.readinto(b)
      File "/usr/lib/python3.7/socket.py", line 589, in readinto
        return self._sock.recv_into(b)
      File "/usr/lib/python3.7/ssl.py", line 1071, in recv_into
        return self.read(nbytes, buffer)
      File "/usr/lib/python3.7/ssl.py", line 929, in read
        return self._sslobj.read(len, buffer)
    ConnectionResetError: [Errno 104] Connection reset by peer
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/requests/models.py", line 750, in generate
        for chunk in self.raw.stream(chunk_size, decode_content=True):
      File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/urllib3/response.py", line 564, in stream
        data = self.read(amt=amt, decode_content=decode_content)
      File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/urllib3/response.py", line 529, in read
        raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
      File "/usr/lib/python3.7/contextlib.py", line 130, in __exit__
        self.gen.throw(type, value, traceback)
      File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/urllib3/response.py", line 443, in _error_catcher
        raise ProtocolError("Connection broken: %r" % e, e)
    urllib3.exceptions.ProtocolError: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
        "__main__", mod_spec)
      File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
        exec(code, run_globals)
      File "/data/myusername/projects/cc_net/cc_net/__main__.py", line 31, in <module>
        main()
      File "/data/myusername/projects/cc_net/cc_net/__main__.py", line 27, in main
        command(**parsed_args)
      File "/data/myusername/projects/cc_net/cc_net/mine.py", line 512, in main
        regroup(conf)
      File "/data/myusername/projects/cc_net/cc_net/mine.py", line 364, in regroup
        mine(conf)
      File "/data/myusername/projects/cc_net/cc_net/mine.py", line 257, in mine
        hashes_groups = list(jsonql.grouper(hashes(conf), conf.hash_in_mem))
      File "/data/myusername/projects/cc_net/cc_net/mine.py", line 206, in hashes
        ex(_hashes_shard, repeat(conf), *_transpose(missing_outputs))
      File "/data/myusername/projects/cc_net/cc_net/execution.py", line 128, in debug_executor
        message = function(*x)
      File "/data/myusername/projects/cc_net/cc_net/mine.py", line 218, in _hashes_shard
        file=conf.get_cc_shard(shard),
      File "/data/myusername/projects/cc_net/cc_net/jsonql.py", line 448, in run_pipes
        for res in results:
      File "/data/myusername/projects/cc_net/cc_net/jsonql.py", line 295, in map
        for x in source:
      File "/data/myusername/projects/cc_net/cc_net/process_wet_file.py", line 198, in __iter__
        with jsonql.open_remote_file(self.segment_url(segment)) as f:
      File "/data/myusername/projects/cc_net/cc_net/jsonql.py", line 1151, in open_remote_file
        content = io.BytesIO(request_get_content(url))
      File "/data/myusername/projects/cc_net/cc_net/jsonql.py", line 1136, in request_get_content
        raise e
      File "/data/myusername/projects/cc_net/cc_net/jsonql.py", line 1129, in request_get_content
        r = requests.get(url)
      File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/requests/api.py", line 75, in get
        return request('get', url, params=params, **kwargs)
      File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/requests/api.py", line 60, in request
        return session.request(method=method, url=url, **kwargs)
      File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/requests/sessions.py", line 533, in request
        resp = self.send(prep, **send_kwargs)
      File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/requests/sessions.py", line 686, in send
        r.content
      File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/requests/models.py", line 828, in content
        self._content = b''.join(self.iter_content(CONTENT_CHUNK_SIZE)) or b''
      File "/home/myusername/envs/ccnet/lib/python3.7/site-packages/requests/models.py", line 753, in generate
        raise ChunkedEncodingError(e)
    requests.exceptions.ChunkedEncodingError: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))
    

    Is this just due to poor network connection between me and Amazon server (I'm in China)? If so, is it recommended to run the code from an AWS server located in US? If I don't have a C++17 compiler, how much memory do I need? Thanks a lot.

    opened by soloice 13
  • Cannot download the precpomputed files

    Cannot download the precpomputed files

    Hi

    I am trying to reproduce the results from your paper. However, after downloading the common crawl data from aws, the access to the precomputed files seems failed.

    Did you changed the location of precomputed file ?

    The error messages are like below:

    /.local/lib/python3.7/site-packages/cc_net/jsonql.py:1141: 
    UserWarning: Swallowed error HTTPSConnectionPool(host='dl.fbaipublicfiles.com', port=443): Max retries exceeded with url: 
    /cc_net/2019-09/en_head_0017.json.gz (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7ff4d87913d0>: 
    Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')) while downloading https://dl.fbaipublicfiles.com/cc_net/2019-09/en_head_0017.json.gz (2 out of 3)
    
    opened by yinfeiy-g 7
  • support of Hausa

    support of Hausa

    Thanks for your contribution to the community. I am wondering whether the ccnet contains the Hausa language (ISO id: ha/hau)? Because in the xlm-r paper, Table 6 mentioned that Hausa was included in CCNet. However, I didn't find the language code of Hausa in the dumped files and fasttext lid's document.

    opened by donglixp 4
  • ModuleNotFoundError: No module named 'typing_extensions'

    ModuleNotFoundError: No module named 'typing_extensions'

    After running make install, I was getting

    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/data1/alferre/cc_net/cc_net/__main__.py", line 11, in <module>
        import cc_net.mine
      File "/data1/alferre/cc_net/cc_net/mine.py", line 29, in <module>
        from cc_net import dedup, execution, jsonql, perplexity, process_wet_file
      File "/data1/alferre/cc_net/cc_net/dedup.py", line 25, in <module>
        from cc_net import jsonql
      File "/data1/alferre/cc_net/cc_net/jsonql.py", line 50, in <module>
        from typing_extensions import Literal, Protocol
    

    Running pip install typing_extensions fixed it. So this package is probably missing from the setup.py.

    opened by alexandremuzio 4
  • Fix typo in README (dl_all_lm -> dl_all_lms)

    Fix typo in README (dl_all_lm -> dl_all_lms)

    I found a typo in README file.

    It seems that make dl_all_lm should be make dl_all_lms in README.md as it is defined in makefile.

    I tested it is working well. 😃

    CLA Signed 
    opened by chloamme 3
  • Decrease RAM usage, investigate miss documents

    Decrease RAM usage, investigate miss documents

    @gwenzek I written a post with tips how to recreate this in GCP. Basically use S3 or Google cloud bucket and mount them as disk will save you a lot of storage fees

    Originally posted by @theblackcat102 in https://github.com/facebookresearch/cc_net/issues/2#issuecomment-599174158

    opened by gwenzek 3
  • Early exit when desired number of documents is reached?

    Early exit when desired number of documents is reached?

    Apologies if this is mentioned somewhere or is otherwise obvious, but:

    Is there a way to early-exit when a desired number of documents have been collected? Say I only wanted 1 million documents, can I somehow exit the call to python cc_net mine once I have hit that number?

    Thanks a lot in advance.

    opened by JohnGiorgi 3
  • EOFError: Compressed file ended before the end-of-stream marker was reached

    EOFError: Compressed file ended before the end-of-stream marker was reached

    Hi there, I was trying to run the code by MPExecutor but got the following error:

    2020-07-23 20:44 INFO 156:root - Downloaded https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-51/wet.paths.gz [200]
    2020-07-23 20:44 INFO 156:root - Starting download of https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-51/segments/1512948512054.0/wet/CC-MAIN-20171211014442-20171211034442-00400.warc.wet.gz
    2020-07-23 20:48 INFO 171:root - Downloaded https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-51/segments/1512948512208.1/wet/CC-MAIN-20171211052406-20171211072406-00300.warc.wet.gz [200]
    2020-07-23 20:48 INFO 171:HashesCollector - Processed 2_915 documents in 0.078h ( 10.4 doc/s).
    2020-07-23 20:48 INFO 171:HashesCollector - Found 0k unique hashes over 522 lines. Using 0.1GB of RAM.
    multiprocessing.pool.RemoteTraceback: 
    
    Traceback (most recent call last):
      File "/home/app/anaconda3/envs/ccnet/lib/python3.7/multiprocessing/pool.py", line 121, in worker
        result = (True, func(*args, **kwds))
      File "/home/cc_net-master/cc_net/execution.py", line 145, in global_fn
        return f(*args[1:])
      File "/home/cc_net-master/cc_net/mine.py", line 233, in _hashes_shard
        file=conf.get_cc_shard(shard),
      File "/home/cc_net-master/cc_net/jsonql.py", line 449, in run_pipes
        for res in results:
      File "/home/cc_net-master/cc_net/jsonql.py", line 296, in map
        for x in source:
      File "/home/cc_net-master/cc_net/process_wet_file.py", line 199, in __iter__
        for doc in parse_warc_file(iter(f), self.min_len):
      File "/home/cc_net-master/cc_net/process_wet_file.py", line 117, in parse_warc_file
        for doc in group_by_docs(lines):
      File "/home/cc_net-master/cc_net/process_wet_file.py", line 89, in group_by_docs
        for warc in warc_lines:
      File "/home/app/anaconda3/envs/ccnet/lib/python3.7/gzip.py", line 300, in read1
        return self._buffer.read1(size)
      File "/home/app/anaconda3/envs/ccnet/lib/python3.7/_compression.py", line 68, in readinto
        data = self.read(len(byte_view))
      File "/home/app/anaconda3/envs/ccnet/lib/python3.7/gzip.py", line 493, in read
        raise EOFError("Compressed file ended before the "
    EOFError: Compressed file ended before the end-of-stream marker was reached
    
    
    The above exception was the direct cause of the following exception:
    
    Traceback (most recent call last):
      File "/home/app/anaconda3/envs/ccnet/lib/python3.7/runpy.py", line 193, in _run_module_as_main
        "__main__", mod_spec)
      File "/home/app/anaconda3/envs/ccnet/lib/python3.7/runpy.py", line 85, in _run_code
        exec(code, run_globals)
      File "/home/cc_net-master/cc_net/__main__.py", line 24, in <module>
        main()
      File "/home/cc_net-master/cc_net/__main__.py", line 20, in main
        func_argparse.parse_and_call(parser)
      File "/home/app/anaconda3/envs/ccnet/lib/python3.7/site-packages/func_argparse/__init__.py", line 72, in parse_and_call
        return command(**parsed_args)
      File "/home/cc_net-master/cc_net/mine.py", line 524, in main
        regroup(conf)
      File "/home/cc_net-master/cc_net/mine.py", line 379, in regroup
        mine(conf)
      File "/home/cc_net-master/cc_net/mine.py", line 272, in mine
        hashes_groups = list(jsonql.grouper(hashes(conf), conf.hash_in_mem))
      File "/home/cc_net-master/cc_net/mine.py", line 221, in hashes
        ex(_hashes_shard, repeat(conf), *_transpose(missing_outputs))
      File "/home/cc_net-master/cc_net/execution.py", line 174, in __call__
        global_fn, zip(itertools.repeat(f_name), *args)
      File "/home/app/anaconda3/envs/ccnet/lib/python3.7/multiprocessing/pool.py", line 748, in next
        raise value
    EOFError: Compressed file ended before the end-of-stream marker was reached
    

    and the config I've changed in the mine.py is just like this:

    config_name: str = "base"
        dump: str = "2017-51"
        output_dir: Path = Path("data")  
        execution: str = "mp"
        num_shards: int = 800
        num_segments_per_shard: int = -1
        min_len: int = 300
        hash_in_mem: int = 25
        lang_whitelist: Sequence[str] = ["zh"]
        lang_blacklist: Sequence[str] = []
        lang_threshold: float = 0.5
        lm_dir: Path = Path("data/lm_sp")
        cutoff: Path = CUTOFF_CSV
        lm_languages: Optional[Sequence[str]] = ["zh"]
        mine_num_processes: int = 10
        target_size: str = "2G"
        cleanup_after_regroup: bool = True
        task_parallelism: int = 500
        pipeline: Sequence[str] = []
        experiments: Sequence[str] = []
    

    I searched about this error and they all say that caused by the incomplete download file, but I saw your code annotation in jsonql.py func open_remote_file : "Download the files at the given url to memory and opens it as a file" How can I delete these incomplete download file in the memory ? or any other solution to fix this error ?

    By the way ,the environment I was running the code is docker containner Ubuntu20.04

    opened by zl827154659 2
  • Dedup all paragraphs if it appear more than once?

    Dedup all paragraphs if it appear more than once?

    eg. if "it is an issue about cc_net" is a paragraph and it appeared three times, as the NativeHashSet saves the value of this key is 1, the 3 paragraphs will be dropped. Why not save one copy?

    opened by xingenju 2
  • Error: Mining phase failure

    Error: Mining phase failure

    Hello everyone, I'm having problems running the mining phase. I'm using a computer with 60Gb of RAM and 16 CPU cores. When running the mining phase I get the error below.

    Traceback (most recent call last):
      File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
        return _run_code(code, main_globals, None,
      File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
        exec(code, run_globals)
      File "/home/raphael_assis4347/raphael/cc_net/cc_net/__main__.py", line 18, in <module>
        main()
      File "/home/raphael_assis4347/raphael/cc_net/cc_net/__main__.py", line 14, in main
        func_argparse.parse_and_call(cc_net.mine.get_main_parser())
      File "/home/raphael_assis4347/.local/lib/python3.8/site-packages/func_argparse/__init__.py", line 72, in parse_and_call
        return command(**parsed_args)
      File "/home/raphael_assis4347/raphael/cc_net/cc_net/mine.py", line 631, in main
        all_files = mine(conf)
      File "/home/raphael_assis4347/raphael/cc_net/cc_net/mine.py", line 341, in mine
        ex(_mine_shard, repeat(conf), hashes_files, *_transpose(missing_outputs))
      File "/home/raphael_assis4347/raphael/cc_net/cc_net/execution.py", line 200, in custom_map_array
        raise Exception(message)
    Exception: 9 / 9 jobs failed while running _mine_shard
    Will run cc_net.mine.main with the following config: Config(config_name='base', dump='2019-09', output_dir=PosixPath('data'), mined_dir='mined_data', execution='auto', num_shards=9, num_segments_per_shard=750, metadata=None, min_len=300, hash_in_mem=9, lang_whitelist=['pt'], lang_blacklist=[], lang_threshold=0.5, keep_bucket=['head', 'middle'], lm_dir=PosixPath('data/lm_sp'), cutoff=PosixPath('/home/raphael_assis4347/raphael/cc_net/cc_net/data/cutoff.csv'), lm_languages=['pt'], mine_num_processes=9, target_size='4G', cleanup_after_regroup=True, task_parallelism=-1, pipeline=['dedup', 'lid', 'keep_lang', 'sp', 'lm', 'pp_bucket', 'drop', 'split_by_lang'], experiments=[], cache_dir=PosixPath('data/wet_cache'))
    Submitting 9 jobs for _mine_shard, with task_parallelism=16
    Waiting on 9 running jobs. Job ids: 69378,69397,69400,69419...
    Failed job 69516 (1 / 9): Job 69516 (task: 0) with path /home/raphael_assis4347/raphael/cc_net/data/logs/69516_0_result.pkl
    has not produced any output (state: FINISHED)
    Error stream produced:
    ----------------------------------------
    2022-09-14 16:43 INFO 69535:cc_net.jsonql - preparing [<cc_net.dedup.DuplicatesRemover object at 0x7fd7313c8d00>, Classifier(bin/lid.bin), <cc_net.jsonql.where object at 0x7fd7313c8df0>, <cc_net.perplexity.MultiSentencePiece object at 0x7fd7313c8d30>, <cc_net.perplexity.DocLM object at 0x7fd7313c8e50>, <cc_net.perplexity.PerplexityBucket object at 0x7fd7313c8d60>, <cc_net.perplexity.DropKeys object at 0x7fd7313c8fa0>]
    
    Waiting on 8 running jobs. Job ids: 69378,69397,69400,69419...
    Failed job 69400 (2 / 9): Job 69400 (task: 0) with path /home/raphael_assis4347/raphael/cc_net/data/logs/69400_0_result.pkl
    has not produced any output (state: FINISHED)
    Error stream produced:
    ----------------------------------------
    2022-09-14 16:43 INFO 69418:cc_net.jsonql - preparing [<cc_net.dedup.DuplicatesRemover object at 0x7fcee923a970>, Classifier(bin/lid.bin), <cc_net.jsonql.where object at 0x7fcee923aa60>, <cc_net.perplexity.MultiSentencePiece object at 0x7fcee923a9a0>, <cc_net.perplexity.DocLM object at 0x7fcee923aac0>, <cc_net.perplexity.PerplexityBucket object at 0x7fcee923a9d0>, <cc_net.perplexity.DropKeys object at 0x7fcee923ac10>]
    

    I couldn't identify the problem by looking at the logs. In the process .log.err file it only contains the pipeline objects vector. Does anyone have any idea what it could be?

    This is my configuration file:

    {
        "dump": "2019-09",
        "hash_in_mem": 9,
        "num_shards": 9,
        "mine_num_processes": 9,
        "num_segments_per_shard": 750,
        "lang_whitelist": ["pt"],
        "lm_languages": ["pt"],
        "pipeline": [
            "dedup",
            "lid",
            "keep_lang",
            "sp",
            "lm",
            "pp_bucket",
            "drop",
            "split_by_lang"
        ],
        "execution": "auto",
        "output_dir": "data",
        "mined_dir": "mined_data",
        "target_size": "4G",
        "keep_bucket": ["head", "middle"],
        "cache_dir": "data/wet_cache"
    }
    
    opened by AssisRaphael 1
  • I want to copy the output data of CC_net directly, what should I do?

    I want to copy the output data of CC_net directly, what should I do?

    run “python -m cc_net --config reproduce --dump 2019-09”

    Can 403 still copy your output? metadata='https://dl.fbaipublicfiles.com/cc_net/1.0.0'

    image

    I don't have enough CPU and GPU resources to complete the mining process. I want to copy the output data of CC_net directly, what should I do?

    opened by mome1024 1
  • 403 forbidden while downloading

    403 forbidden while downloading

    hi there, I encountered the 403 error while trying downloading ccnet data using this pipeline. Wondering if this is bcs of the network settings from my side or is there anything wrong? Thanks in advance.

    /ldap_home/raven.ren/cc_net/cc_net/flat_hash_set.py:115: UserWarning: Module 'getpy' not found. Deduplication will take more RAM. Try pip install cc_net[getpy] warnings.warn( 2022-08-23 19:25 INFO 6898:root - Starting download of https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-51/wet.paths.gz 2022-08-23 19:25 INFO 6898:HashesCollector - Processed 0 documents in 0.00034h ( 0.0 doc/s). 2022-08-23 19:25 INFO 6898:HashesCollector - Found 0k unique hashes over 0k lines. Using 0.1GB of RAM. submitit ERROR (2022-08-23 19:25:23,974) - Submitted job triggered an exception 2022-08-23 19:25 ERROR 6898:submitit - Submitted job triggered an exception Traceback (most recent call last): File "/ldap_home/raven.ren/.conda/envs/py38/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/ldap_home/raven.ren/.conda/envs/py38/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/ldap_home/raven.ren/.conda/envs/py38/lib/python3.8/site-packages/submitit/core/_submit.py", line 11, in <module> submitit_main() File "/ldap_home/raven.ren/.conda/envs/py38/lib/python3.8/site-packages/submitit/core/submission.py", line 72, in submitit_main process_job(args.folder) File "/ldap_home/raven.ren/.conda/envs/py38/lib/python3.8/site-packages/submitit/core/submission.py", line 65, in process_job raise error File "/ldap_home/raven.ren/.conda/envs/py38/lib/python3.8/site-packages/submitit/core/submission.py", line 54, in process_job result = delayed.result() File "/ldap_home/raven.ren/.conda/envs/py38/lib/python3.8/site-packages/submitit/core/utils.py", line 133, in result self._result = self.function(*self.args, **self.kwargs) File "/ldap_home/raven.ren/cc_net/cc_net/mine.py", line 273, in _hashes_shard jsonql.run_pipes( File "/ldap_home/raven.ren/cc_net/cc_net/jsonql.py", line 455, in run_pipes write_jsons(data, output) File "/ldap_home/raven.ren/cc_net/cc_net/jsonql.py", line 496, in write_jsons for res in source: File "/ldap_home/raven.ren/cc_net/cc_net/jsonql.py", line 284, in map for x in source: File "/ldap_home/raven.ren/cc_net/cc_net/process_wet_file.py", line 195, in __iter__ n = len(self.segments) File "/ldap_home/raven.ren/cc_net/cc_net/process_wet_file.py", line 243, in segments segments = cc_segments(self.dump, self.cache_dir) File "/ldap_home/raven.ren/cc_net/cc_net/process_wet_file.py", line 38, in cc_segments f = jsonql.open_remote_file(wet_paths, cache=wet_paths_cache) File "/ldap_home/raven.ren/cc_net/cc_net/jsonql.py", line 1124, in open_remote_file raw_bytes = request_get_content(url) File "/ldap_home/raven.ren/cc_net/cc_net/jsonql.py", line 1101, in request_get_content raise e File "/ldap_home/raven.ren/cc_net/cc_net/jsonql.py", line 1095, in request_get_content r.raise_for_status() File "/ldap_home/raven.ren/.conda/envs/py38/lib/python3.8/site-packages/requests/models.py", line 960, in raise_for_status raise HTTPError(http_error_msg, response=self) requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-51/wet.paths.gz

    opened by Raven-Ren 2
  • Batch job submission failed: Invalid job array specification

    Batch job submission failed: Invalid job array specification

    Hi, when I run "python -m cc_net", this error happened:

    Submitting _hashes_shard in a job array (1600 jobs) sbatch: error: Batch job submission failed: Invalid job array specification subprocess.CalledProcessError: Command '['sbatch', '/data/gsw/test/cc_net/data/logs/submission_file_479eba35e148432da4432891c1191887.sh']' returned non-zero exit status 1.

    The above exception was the direct cause of the following exception:

    Traceback (most recent call last): File "/home/gsw/anaconda3/envs/test_p/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/gsw/anaconda3/envs/test_p/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/data/gsw/test/cc_net/cc_net/main.py", line 18, in main() File "/data/gsw/test/cc_net/cc_net/main.py", line 14, in main func_argparse.parse_and_call(cc_net.mine.get_main_parser()) File "/home/gsw/anaconda3/envs/test_p/lib/python3.9/site-packages/func_argparse/init.py", line 72, in parse_and_call return command(**parsed_args) File "/data/gsw/test/cc_net/cc_net/mine.py", line 632, in main all_files = mine(conf) File "/data/gsw/test/cc_net/cc_net/mine.py", line 335, in mine hashes_groups = list(jsonql.grouper(hashes(conf), conf.hash_in_mem)) File "/data/gsw/test/cc_net/cc_net/mine.py", line 263, in hashes ex(_hashes_shard, repeat(conf), *_transpose(missing_outputs)) File "/data/gsw/test/cc_net/cc_net/execution.py", line 89, in map_array_and_wait jobs = ex.map_array(function, *args) File "/home/gsw/anaconda3/envs/test_p/lib/python3.9/site-packages/submitit/core/core.py", line 701, in map_array return self._internal_process_submissions(submissions) File "/home/gsw/anaconda3/envs/test_p/lib/python3.9/site-packages/submitit/auto/auto.py", line 218, in _internal_process_submissions return self._executor._internal_process_submissions(delayed_submissions) File "/home/gsw/anaconda3/envs/test_p/lib/python3.9/site-packages/submitit/slurm/slurm.py", line 332, in _internal_process_submissions first_job: core.Job[tp.Any] = array_ex._submit_command(self._submitit_command_str) File "/home/gsw/anaconda3/envs/test_p/lib/python3.9/site-packages/submitit/core/core.py", line 864, in _submit_command output = utils.CommandFunction(command_list, verbose=False)() # explicit errors File "/home/gsw/anaconda3/envs/test_p/lib/python3.9/site-packages/submitit/core/utils.py", line 350, in call raise FailedJobError(stderr) from subprocess_error submitit.core.utils.FailedJobError: sbatch: error: Batch job submission failed: Invalid job array specification

    opened by swgu98 2
  • Variance of hash files sizes in newer crawls

    Variance of hash files sizes in newer crawls

    Hello, I noticed that hash files that I've produced from the dump of January 21 (and several others months in 2020) are much smaller (x100) than hashes from dump of April and May 2019, even though original wet files were the same size.

    In both cases there are 2 shards per one hash and all the other parameters are the same.

    Trying to understand why, tnx:)

    opened by var926 1
  • "Reproducing our work" does not specify set of languages and snapshots

    README.md provides python -m cc_net --config reproduce --dump 2019-09 as an example to reproduce the cc_net corpus, which relies on

    https://github.com/facebookresearch/cc_net/blob/242e10d1d694031c82817f895e56e27a02618803/cc_net/mine.py#L172-L191

    The combination of dump 2019-09 and french languages provides only a small corpus. As the metadata files are only accessible via https://dl.fbaipublicfiles.com/cc_net/1.0.0, it is impossible to list the underlying S3 bucket to obtain a complete list of available languages and dumps. Thus it would be helpful if you can provide the complete list in your README.

    opened by leezu 2
  • cc_net/tools/dl_cc_100.py fails to extract complete dataset

    cc_net/tools/dl_cc_100.py fails to extract complete dataset

    python3.7 cc_net/tools/dl_cc_100.py --outdir data/cc100 --processes 96 provides only 99GB (277 GB uncompressed) data across 10 languages:

    780M    /mnt/data/cc100/bn_IN
    2.0G    /mnt/data/cc100/hi_IN
    25G     /mnt/data/cc100/id_ID
    12G     /mnt/data/cc100/ko_KR
    89M     /mnt/data/cc100/my_MM
    25G     /mnt/data/cc100/sv_SE
    270M    /mnt/data/cc100/sw_KE
    6.7G    /mnt/data/cc100/th_TH
    475M    /mnt/data/cc100/tl_XX
    21G     /mnt/data/cc100/vi_VN
    

    The script should provide all 100 languages listed in https://arxiv.org/pdf/1911.02116.pdf Figure 1:

    image

    opened by leezu 6
Owner
Meta Research
Meta Research
Grading tools for Advanced NLP (11-711)Grading tools for Advanced NLP (11-711)

Grading tools for Advanced NLP (11-711) Installation You'll need docker and unzip to use this repo. For docker, visit the official guide to get starte

Hao Zhu 2 Sep 27, 2022
A modular Karton Framework service that unpacks common packers like UPX and others using the Qiling Framework.

Unpacker Karton Service A modular Karton Framework service that unpacks common packers like UPX and others using the Qiling Framework. This project is

c3rb3ru5 45 Jan 5, 2023
Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage.

TextDistance TextDistance -- python library for comparing distance between two or more sequences by many algorithms. Features: 30+ algorithms Pure pyt

Life4 3k Jan 6, 2023
Get list of common stop words in various languages in Python

Python Stop Words Table of contents Overview Available languages Installation Basic usage Python compatibility Overview Get list of common stop words

Alireza Savand 142 Dec 21, 2022
Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage.

TextDistance TextDistance -- python library for comparing distance between two or more sequences by many algorithms. Features: 30+ algorithms Pure pyt

Life4 1.9k Feb 18, 2021
Get list of common stop words in various languages in Python

Python Stop Words Table of contents Overview Available languages Installation Basic usage Python compatibility Overview Get list of common stop words

Alireza Savand 121 Jan 6, 2021
Common Voice Dataset explorer

Common Voice Dataset Explorer Common Voice Dataset is by Mozilla Made during huggingface finetuning week Usage pip install -r requirements.txt streaml

Ceyda Cinarel 22 Nov 16, 2022
🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

?? The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

Hugging Face 15k Jan 2, 2023
Tools and data for measuring the popularity & growth of various programming languages.

growth-data Tools and data for measuring the popularity & growth of various programming languages. Install the dependencies $ pip install -r requireme

null 3 Jan 6, 2022
Client library to download and publish models and other files on the huggingface.co hub

huggingface_hub Client library to download and publish models and other files on the huggingface.co hub Do you have an open source ML library? We're l

Hugging Face 644 Jan 1, 2023
Tools, wrappers, etc... for data science with a concentration on text processing

Rosetta Tools for data science with a focus on text processing. Focuses on "medium data", i.e. data too big to fit into memory but too small to necess

null 207 Nov 22, 2022
Tools for curating biomedical training data for large-scale language modeling

Tools for curating biomedical training data for large-scale language modeling

BigScience Workshop 242 Dec 25, 2022
Download videos from YouTube/Twitch/Twitter right in the Windows Explorer, without installing any shady shareware apps

youtube-dl and ffmpeg Windows Explorer Integration Download videos from YouTube/Twitch/Twitter and more (any platform that is supported by youtube-dl)

Wolfgang 226 Dec 30, 2022
Script to download some free japanese lessons in portuguse from NHK

Nihongo_nhk This is a script to download some free japanese lessons in portuguese from NHK. It can be executed by installing the packages with: pip in

Matheus Alves 2 Jan 6, 2022
Simple telegram bot to convert files into direct download link.you can use telegram as a file server 🪁

TGCLOUD ?? Simple telegram bot to convert files into direct download link.you can use telegram as a file server ?? Features Easy to Deploy Heroku Supp

Mr.Acid dev 6 Oct 18, 2022
Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Pattern Pattern is a web mining module for Python. It has tools for: Data Mining: web services (Google, Twitter, Wikipedia), web crawler, HTML DOM par

Computational Linguistics Research Group 8.4k Dec 30, 2022
A Python package implementing a new model for text classification with visualization tools for Explainable AI :octocat:

A Python package implementing a new model for text classification with visualization tools for Explainable AI ?? Online live demos: http://tworld.io/s

Sergio Burdisso 285 Jan 2, 2023
Python wrapper for Stanford CoreNLP tools v3.4.1

Python interface to Stanford Core NLP tools v3.4.1 This is a Python wrapper for Stanford University's NLP group's Java-based CoreNLP tools. It can eit

Dustin Smith 610 Sep 7, 2022