Deduplicating Training Data Makes Language Models Better

Overview

Deduplicating Training Data Makes Language Models Better

This repository contains code to deduplicate language model datasets as descrbed in the paper "Deduplicating Training Data Makes Language Models Better" by Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch and Nicholas Carlini. This repository contains both the ExactSubstr deduplication implementation (written in Rust) along with the scripts we used in the paper to perform deduplication and inspect the results (written in Python). In an upcoming update, we will add files to reproduce the NearDup-deduplicated versions of the C4, RealNews, LM1B, and Wiki-40B-en datasets.

This is not an officially supported Google product.

Why deduplicate?

When datasets are created by scraping raw text from the Internet, this will often result in the same sequences being repeated multiple times (e.g., we find a single 50 word sequence that is repeated in the C4 dataset 60,000 times). Training models on deduplicated datasets is faster (because they see fewer total examples) and experimentally results in models with similar or better perplexity to models trained on data that hasn't been deduplicated. Moreover, language models are less likely to exhibit memorization when their training data has been well-deduplicated.

Citing this work

If you use this repository or our deduplicated datasets you can cite

@article{lee2021deduplicating,
      title={Deduplicating Training Data Makes Language Models Better}, 
      author={Katherine Lee and Daphne Ippolito and Andrew Nystrom and Chiyuan Zhang and Douglas Eck and Chris Callison-Burch and Nicholas Carlini},
      journal={arXiv preprint arXiv:2107.06499},
      year={2021},
}

Exact Deduplication Code

We provide an implementation of the exact deduplication technique used in the paper. This is very much research code. It is (a very slightly cleaned up) version of exactly what we do in the paper. It assumes that you want to deduplicate something the size of C4 (~300GB) running on a machine with 96 cores and >600GB of RAM. If you only want to use this for reasonably-sized datasets, you should change the number of parallel threads from 96 to something smaller. If your machine is big enough, there should be no upper bound on the size of the dataset it can handle (well, 2^64-1 bytes is the limit, but I think we can all agree that's essentially unlimited).

We build a suffix array (based on Andrew Gallant's suffix array implementation) in src/table.rs. It has some minor changes from the original version that make it so we can't just import this library as a crate. First, we need 64-bit integers. The original implementation says that u32 works for "reasonably sized documents (~4GB)" but we're working with unreasonably sized documents. So we need u64. Second, we don't want UTF8 strings. Everything is a [u8] byte array, because we might be working over token sequences which aren't valid UTF8. The main complication in the rest of src/main.rs is the fact that we want things to run in parallel, and we probably can't fit the entire suffix array into memory. And so all of our algorithms are designed around these constraints.

If you just want to run the rust deduplicator, then you will only need to install Rust:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

If you additionally want to generate datasets to run the rust script on (and you probably do) then you will need python dependencies:

pip3 install numpy scipy tensorflow tensorflow_datasets transformers sentencepiece

Basic Usage

If you just want to reproduce the result of this paper, or deduplicate any language model that's already in the Tensorflow Datasets (TFDS) format, then you can just run the following commands:

cargo build

to compile the rust code, and then run

python3 scripts/load_dataset.py --data_dir $LOAD_DIR --save_dir $SAVE_DIR --name $DATASET --split $SPLIT [--tokenize]

For example, to get the LM1B training set you could run python3 scripts/load_dataset.py --data_dir ~/tensorflow_datasets --save_dir data --name lm1b --split test. This should will take just a few seconds to run on the test set or about an hour if running with the train set instead.

If the dataset is really big, you might want to add the --tokenize flag. This will shrink the dataset by roughly a factor of two by tokenizing it with the GPT-2 tokenizer.

And then to construct the suffix array run

python3 scripts/make_suffix_array.py [path/to/dataset]

For example, if you run python3 scripts/make_suffix_array.py data/lm1b.test, this will create a file data/lm1b.test.table.bin containing the suffix array. Again, this should be fast, about two hours on the LM1B train set when run single-thread and a few minutes on 96 cores.

(If you get an error that you have too many open files, that's because this script opens lots of files. You should run ulimit -Sn 1000000 to "fix" the error.)

Querying a suffix array to find duplicated examples

Start by loading and building a suffix array for a dataset as described above

Once you have the suffix array, you now query the dataset to find all occurances of a particular string. To do this, run

python3 scripts/count_occurances.py --suffix [path/to/suffix_array] [--query query_string] [--query_file /path/to/query]

On the LM1B test set, running python3 scripts/count_occurances.py --suffix data/lm1b.test --query " on Tuesday" should return 1288. If you tokenized the dataset, then you should pass --tokenizetocount_occurences.py` as well, to get the same result (plus or minus tokenization differences).

If you want to confirm this the outputted number is correct (assuming you haven't tokenized), you can run cat /tmp/lm1b.test | grep -ao " on Tuesday" and get the same result.

Advanced Usage

The above scripts work by calling into the core Rust suffix array deduplicator. If you want to do each step yourself, the following options are available:

Single threaded suffix array construction

To build a suffix array for any particular file, you can run

cargo run save [file_path]

This will create a file called [file_path].table.bin which contains the suffix array for the file provided. This algorithm is linear time, but (a) only runs on a single core, and (b) has memory requirement O(big * len(file)) which is prohibitive for large files.

Parallel suffix array construction

To build a suffix array for an extremely large file (e.g., ~about as much RAM as available) it is better to run the script

python scripts/make_suffix_array.py [file_path]

This script will build the suffix array in parallel by splitting the single file into chunks, generating suffix arrays for each chunk, and then merging the suffix arrays together to form the full suffix array. Note that in general this algorithm is quadratic, but when the maximum substring length is short relative to the total file length (as it is, when generating suffix arrays for N independent training examples) it will never reach this worst case behavior.

The two steps are described below.

Building a piece of a suffix array from a piece of a file

The first generats a suffix array from a piece of a file. This is implemented by running

cargo run save_part [file_path] [byte_start] [byte_end]

And builds a suffix array for the byte sequence between [byte_start] and [byte_end] for the given file. Multiple of these can be run in parallel to build a suffix array for a file quickly.

Merging suffix array pieces to create a single suffix array

Given the several independent suffix arrays, merging them is now just a matter of calling

cargo run merge_parallel [path_to_partial_suffix_trees,...] [tmp_output_directory]

to generate a collection of ordered suffix arrays pieces in the output directory. The final step just requires merging these together

cat [tmp_output_directory]/* > [file_path].table.bin

Finding Duplicates

Given a suffix array file, as generated in the prevous section, it can now be queried for interesting statistics. The simplest operation, counting occurrences of particular substrings, takes O(log(N)) time and O(query_length) memory requirements, (as shown above with scripts/count_occurances.py). To do this you can run:

cargo run count_occurances /path/to/dataset /path/to/query_file

(Indeed, the python script is just a wrapper that makes calling this nicer, with the option for tokenization.) This is useful mainly as a commandline interface to interact with the dataset to find interesting properties. To run more sophisticated analysis, use the tools described below:

Finding duplicates between two documents

Given a document A and another document B, we can find all duplicates betwen the two by (1) constructing suffix arrays for both, and then (2) linearly walking the suffix arrays in order to find all duplicates of a given length.

Once the suffix array for the dataset has been constructed, this algorithm therefore requires time O(len(dataset) + len(query)) and space O(len(dataset)). It is better to run this algorithm when the number of queries into the dataset is greater than O(len(dataset)/log(len(query))). However note that the prior code requires disk seeks and and this implementation is a linear scan through the suffix array table, so in practice there is at least a factor-of-10 speedup here. As a rough order of magnitude, for a dataset with ~100GB, it is faster to run similar_parallel when querying with more than a few megabytes of text. Otherwise it is probably faster to run count_occurances.

Notice that this command also requires that the entire dataset fits in memory. For many datasets this is not a problem, but the C4 dataset is 350 GB and the Pile dataset is 750 GB (both even after tokenization). The machine must therefore have a lot of RAM for this to work.

cargo run similar_parallel [dataset1] [dataset2]

This creates lots of containing the position of all examples in dataset2 that are also in dataset1. (The code could also do the inverse at the same time, if you want to modify it slightly.) However it spits this out in some not-very-useful form: a list of tokens x_i so that dataset2[x_i:x_i+100] is also in dataset1. But this probably has overlaps.

The second step is then to run

cargo run collect_similar [dataset2]. This converts the result to instead compute ranges so that instead we have dataset2[xi:yi] match.

Finding duplicates within one document

To find duplicates that are contained within one document (for example, to actually deduplicate a dataset as we do in the paper) run the command

cargo run selfsimilar_parallel [dataset]

This will find all repeated substrings contained in the dataset above a given length threshold. Again run collect_similar to find the indexs of repeated examples.

Approx Deduplication Results

Coming soon.

Comments
  • Can the tool run on plain text files?

    Can the tool run on plain text files?

    Hello, I'm trying to deduplicate several plain text files. If i run python scripts/make_suffix_array.py myfile.en it correctly generates the myfile.en.table.bin file

    However, if i run cargo selfsimilar_parallel myfile.en it shows no duplicates.

    myfile.en contains 10 times the same string, so I am wondering whether I have to use TFDS format or not.

    opened by m-resta 20
  • Accessing the duplicates and their counts

    Accessing the duplicates and their counts

    Hey, thanks for releasing the code!

    I'm a bit confused regarding how to use the dups_ and sizes_ files. I would like to get a mapping between all duplicate strings and their corresponding number of appearances in the data. From my understanding, this is what you get with the points in those files, but I don't understand how to read these. Any explanation, would be helpful! (and code snippet / reference even better!)

    Thanks

    opened by yanaiela 12
  • Error on self deduplication

    Error on self deduplication

    I am planning to reproduce the self deduplication result for lm1b. I have already produced the result mentioned in the readme here.

    However, when running selfsimilar_parallel, it shows Final answer 0 and when running collect_similar it throws an error of thread 'main' panicked at 'index out of bounds: the len is 0 but the index is 0', src/main.rs:1244:26. Am I missing something here?

    Log:

    $ python3 scripts/count_occurances.py --suffix dataset_save/lm1b.test --query " on Tuesday"
    b' on Tuesday'
    Number of times present: 1288
    
    
    $ cargo run selfsimilar_parallel dataset_save/lm1b.test
    warning: function is never used: `get_example_index`
       --> src/main.rs:447:4
        |
    447 | fn get_example_index(table:&[u64], position:u64) -> usize{
        |    ^^^^^^^^^^^^^^^^^
        |
        = note: `#[warn(dead_code)]` on by default
    
    warning: unused `Result` that must be used
       --> src/main.rs:367:2
        |
    367 |     tablestream.file.read_exact(&mut tablestream.cache);
        |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        |
        = note: `#[warn(unused_must_use)]` on by default
        = note: this `Result` may be an `Err` variant, which should be handled
    
    warning: unused `Result` that must be used
       --> src/main.rs:379:2
        |
    379 |     file.read_exact(&mut cache);
        |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        |
        = note: this `Result` may be an `Err` variant, which should be handled
    
    warning: 3 warnings emitted
    
        Finished dev [optimized + debuginfo] target(s) in 0.02s
         Running `target/debug/dedup_dataset selfsimilar_parallel dataset_save/lm1b.test`
    Start load!
    Loading ratio is 8
    0 / 453700
    Final answer 0
    
    
    $ cargo run collect_similar dataset_save/lm1b.test
    warning: function is never used: `get_example_index`
       --> src/main.rs:447:4
        |
    447 | fn get_example_index(table:&[u64], position:u64) -> usize{
        |    ^^^^^^^^^^^^^^^^^
        |
        = note: `#[warn(dead_code)]` on by default
    
    warning: unused `Result` that must be used
       --> src/main.rs:367:2
        |
    367 |     tablestream.file.read_exact(&mut tablestream.cache);
        |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        |
        = note: `#[warn(unused_must_use)]` on by default
        = note: this `Result` may be an `Err` variant, which should be handled
    
    warning: unused `Result` that must be used
       --> src/main.rs:379:2
        |
    379 |     file.read_exact(&mut cache);
        |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        |
        = note: this `Result` may be an `Err` variant, which should be handled
    
    warning: 3 warnings emitted
    
        Finished dev [optimized + debuginfo] target(s) in 0.02s
         Running `target/debug/dedup_dataset collect_similar dataset_save/lm1b.test`
    Sorting.
    Sorted.
    thread 'main' panicked at 'index out of bounds: the len is 0 but the index is 0', src/main.rs:1244:26
    note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
    
    opened by zijwang 10
  • Error with table size not being divisible by text size

    Error with table size not being divisible by text size

    Hi, I'm getting an error because size of the suffix table is not divisible by length of the text. https://github.com/google-research/deduplicate-text-datasets/blob/8a172b0d815862b1131da203be24d430e121a725/src/main.rs#L479

    I'm running bash scripts/deduplicate_single_file.sh test.txt test_dedup.txt 20 1 where test.txt just contains a few paragraphs from a random Wikipedia article and some duplicate text that I manually added. I'm doing this mainly for debugging purpose (I would like to later make some edits to keep the first occurrence of duplicate samples and throw away the rest). If I run the command on my actual dataset that is roughly ~70GB big, I'm not encountering such issue. So I'm wondering what the issue is? Does the code not work with datasets that are too small?

    Thanks!

    Update: I just found out that running the command on the actual 70GB dataset also raised the same error.

    opened by jinyongyoo 7
  • How to dedup between two datasets?

    How to dedup between two datasets?

    A practical situation is that given two datasets A and B, we want to remove the data in A that has huge overlap with B. Is there a command that I could use to achieve this functionality? There is only command of single-document or single-document pairs in the readme on finding duplicates.

    opened by mralexis1 7
  • Why not use Simhash?

    Why not use Simhash?

    Since Google has shown that Simhash is practically useful for identifying near-duplicates in web documents belonging to a multi-billion page repository (Detecting Near-Duplicates for Web Crawling). In your paper, you choose minhash for approximate matching. Why not use Simhash in this scenario?

    opened by Ethan-yt 3
  • question about deduplication cluster size

    question about deduplication cluster size

    As shown in following picture, the cluster starting at 0x02954cb9 has the size of 3. image but when I count it using bytes.count(), it shows 2. image

    I tried different datasets and observed the same phenomenon. Did I make a mistake about the size meaning?

    opened by everks 2
  • Unexpected behavior with ending symbols

    Unexpected behavior with ending symbols

    Hi again,

    I found that count-occurrences have an unexpected behavior if you want to count last symbols in sequence. Here are the examples:

    • sequence "aaabbb", query "b": expect 3, but output is Number of times present: 2
    • another one is when sequence "aaabbb", query "bb": expected 2, but actual output is Number of times present: 0

    Can you fix this? Thanks!

    opened by mitya52 2
  • "failed to fill whole buffer" errors

    Hi,

    I have tried to run the code on simple string and count-occurrences fails with "failed to fill whole buffer" error.

    Here are steps to reproduce:

    1. run ./target/debug/dedup_dataset make --data-file dup.txt, data file dup.txt contains simple string "aaabbb"
    2. then run ./target/debug/dedup_dataset count-occurrences --data-file dup.txt --query-file query.txt, where query.txt contains
    • "bb" expectation: Number of times present: 2 reality: thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Error { kind: UnexpectedEof, message: "failed to fill whole buffer" }', src/main.rs:275:31;
    • "ab" expectation: Number of times present: 1 reality: thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Error { kind: UnexpectedEof, message: "failed to fill whole buffer" }', src/main.rs:297:31;
    • "b" expectation: Number of times present: 2 reality: Number of times present: 1;

    May be I'm doing something wrong? Thanks.

    opened by mitya52 2
  • Should newline char be removed

    Should newline char be removed

    Hi, So I notice that this read here adds a \n char to the end of the query. This then causes an issue with the count if its not actually an end-of-line. Should there be a .strip() added here?

    arr = open(args.query_file,"rb").read().strip()

    Thanks.

    opened by cperiz 1
  • Fix multiprocessing bug in Windows/Mac OS X

    Fix multiprocessing bug in Windows/Mac OS X

    The multiprocessing pool started uses the default method for launching child processes, which is OS specific. The default on Unix is "fork", and the resulting process inherits all resources from the parent process. Conversely, the default on Mac OS X/Windows is "spawn", which results in a minimal number of resources inherited by the child process.

    I changed the code to explicitly use "fork", and I was able to run through the README on a Mac M1. I presume this fix would also help Windows users, though I haven't tested myself.

    While I was at it, I fixed a small typo in the README.

    Thanks for sharing the code and having a really nice README!

    opened by alistairewj 1
  • Off-by-1 error in `collect`?

    Off-by-1 error in `collect`?

    Hi, thanks for the great repo!

    I'm using the tool to deduplicate a dataset, and I'm trying to investigate what happens in subsequent steps. I noticed that after running collect, some of the duplicate strings seem to start with control characters:, e.g. after running code similar to this:

    >>> data=open("data/my_data.test","rb").read()
    >>> data[left:right]
    

    where left and right are one of the pairs returned by collect, I get sth like this:

    b'\x00\x00Enter the username or e-mail you used in your profile. A password reset link will be sent to you by email.'
    

    I'm cleaning the control characters up in my main text so it looks like parts of the separator codes are being leaked. Interestingly, this doesn't happen consistently, but it does happen more on the more frequent strings. Also, matched documents from my original dataset don't contain control characters.

    Any chance there's some sort of an off-by-1 error in collect? Not a huge deal but I'd like to understand what's happening here

    opened by ola13 0
  • how to deduplicate huggingface datasets

    how to deduplicate huggingface datasets

    Hey there, excellent work on this repo and the paper.

    I wanted to know on how could i use this to deduplicate my huggingface custom dataset. that i have custom developed and cleaned.

    that has been saved as custom_dataset.save_to_disk("dataset_path")

    and can be loaded as custom_dataset = datasets.load_from_disk("dataset_path")

    opened by StephennFernandes 6
  • Fix to issue #17 limits cmd_merge to be single-threaded

    Fix to issue #17 limits cmd_merge to be single-threaded

    Hi,

    it looks like the fix for issue #17, which puts some limits on the number of threads in cmd_merge, is a bit too aggressive, resulting in only using a single thread even for big workloads:

    https://github.com/google-research/deduplicate-text-datasets/blob/ad86c7f65ac626581fe3a4277106309bc6b50c23/src/main.rs#L1020-L1023

    texts.len() is equal to nn (the number of input parts), I think you want something like

        let num_threads = std::cmp::min(num_threads, std::cmp::max((texts_len.iter().sum::<usize>() as i64 - 1024)/10, 1));
    

    instead.

    opened by kleinj 2
  • RAM crash when use collect method

    RAM crash when use collect method

    first of all thanks for releasing the code

    i have dataset(mc4) size about 110 GB

    my machine specs is 96 cores cpu and 350 GB RAM

    i've successfully created 524GB suffix array from that dataset

    i also managed to run deduplicator (self similar method with 100 threshold) with no memory issue , create about ~140 GB cache files ( 20B examples)

    but when i run collect method my RAM blowup after few minutes

    i stacktrace the code and found my RAM crash when this code/step running https://github.com/google-research/deduplicate-text-datasets/blob/ad86c7f65ac626581fe3a4277106309bc6b50c23/src/main.rs#L1188

    is this expected? do you have workaround to solve the issue?

    AFAIK, collect method is just merging all duplicate sequence that found in the dataset and its only return text file with pair of bytes,CMIIW

    i'm thinking maybe write and text file as soon each cache files finish processed/read ,instead of waiting all of them to be finish (this is just assumption, i dont know its possible...not expert on rust)

    thank you

    opened by acul3 1
  • Error when running the code

    Error when running the code

    Hi,

    I try to deduplicate my plain text file, but it shows some errors. I first run

    python scripts/make_suffix_array.py c4-train.00000-of-01024.txt
    

    The output is

    ./target/debug/dedup_dataset make-part --data-file c4-train.00000-of-01024.txt --start-byte 0 --end-byte 114700294
    ./target/debug/dedup_dataset make-part --data-file c4-train.00000-of-01024.txt --start-byte 114600294 --end-byte 229300588
    ./target/debug/dedup_dataset make-part --data-file c4-train.00000-of-01024.txt --start-byte 229200588 --end-byte 343900882
    ./target/debug/dedup_dataset make-part --data-file c4-train.00000-of-01024.txt --start-byte 343800882 --end-byte 458401177
    Waiting for jobs to finish
    Checking all wrote correctly
    FACT 4.0
    FACT 4.0
    FACT 4.0
    FACT 4.0
    Rerunning 0 jobs because they failed.
    Merging suffix trees
    ./target/debug/dedup_dataset merge --output-file tmp/out.table.bin --suffix-path c4-train.00000-of-01024.txt.part.0-114700294 --suffix-path c4-train.00000-of-01024.txt.part.114600294-229300588 --suffix-path c4-train.00000-of-01024.txt.part.229200588-343900882 --suffix-path c4-train.00000-of-01024.txt.part.343800882-458401177 --num-threads 256
    thread 'thread '<unnamed><unnamed>' panicked at '' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rscalled `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }:', 875src/main.rs::125222
    :note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
    77
    thread 'thread 'thread '<unnamed><unnamed><unnamed>' panicked at '' panicked at '' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', ', ', src/main.rssrc/main.rssrc/main.rs:::222875875:::77125125
    
    
    thread 'thread '<unnamed><unnamed>thread 'thread '' panicked at '' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }<unnamed><unnamed>called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', ' panicked at '' panicked at '', src/main.rscalled `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }src/main.rs:', ', src/main.rssrc/main.rs:222::875222:222::77:1257777
    
    
    
    thread 'thread '<unnamed><unnamed>' panicked at '' panicked at 'thread 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }<unnamed>', ', src/main.rs' panicked at 'src/main.rs:called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }:875', 875:src/main.rs:125:thread '125
    222<unnamed>
    :' panicked at '77called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }
    ', src/main.rs:222:77
    thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
    thread 'thread '<unnamed><unnamed>' panicked at '' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', thread '', src/main.rs<unnamed>src/main.rs:' panicked at ':222thread 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }875:<unnamed>', :77' panicked at 'src/main.rs125thread '
    called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }:
    <unnamed>', 222thread '' panicked at 'src/main.rs:<unnamed>called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }:77' panicked at '', 222
    called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }src/main.rs:', :src/main.rs77222:
    :22277:
    77
    thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }thread '', <unnamed>src/main.rs' panicked at ':called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }875', :src/main.rs125:
    875:125
    thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
    thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
    thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
    thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
    thread 'thread 'thread '<unnamed><unnamed><unnamed>' panicked at '' panicked at '' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', ', ', src/main.rssrc/main.rssrc/main.rs:::222222222:::777777
    
    
    thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
    thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
    thread 'thread 'thread 'thread 'thread '<unnamed><unnamed><unnamed>' panicked at '<unnamed><unnamed>' panicked at '' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }' panicked at '' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', ', src/main.rs', ', src/main.rssrc/main.rs:src/main.rssrc/main.rs::222::222222:222222::77::7777
    7777
    
    
    
    thread 'thread 'thread 'thread 'thread 'thread 'thread '<unnamed><unnamed><unnamed><unnamed><unnamed><unnamed><unnamed>' panicked at '' panicked at '' panicked at '' panicked at '' panicked at '' panicked at '' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', ', ', ', ', ', ', src/main.rssrc/main.rssrc/main.rssrc/main.rssrc/main.rssrc/main.rssrc/main.rs:::::::222222222222222222222:::::::77777777777777
    
    
    
    
    
    
    thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
    thread 'thread '<unnamed><unnamed>' panicked at '' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }thread 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }thread 'thread 'thread '', <unnamed><unnamed>', <unnamed><unnamed>src/main.rsthread '' panicked at '<unnamed>' panicked at 'src/main.rsthread 'thread 'thread 'thread 'thread '' panicked at '' panicked at ':called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }:<unnamed><unnamed><unnamed><unnamed><unnamed>called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }222', called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', 222' panicked at '' panicked at '' panicked at '' panicked at '' panicked at '', ', :src/main.rs', src/main.rs:called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }src/main.rssrc/main.rs77:src/main.rs:77', ', ', ', ', ::
    222:thread '222thread '
    src/main.rssrc/main.rssrc/main.rssrc/main.rssrc/main.rs875222:22277:<unnamed>:thread '<unnamed>::::thread '::thread ':
    77' panicked at '77<unnamed>' panicked at '875222222222<unnamed>222125<unnamed>77
    called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }
    ' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }::::' panicked at ':
    ' panicked at '
    ', called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', 125777777called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }77called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }src/main.rs', src/main.rs
    
    
    
    ',
    ', :src/main.rs:src/main.rssrc/main.rs222:222::222:222:222:77::7777
    7777
    
    
    
    thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }thread '', <unnamed>src/main.rs' panicked at 'thread ':called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }<unnamed>875', :' panicked at 'src/main.rsthread '125called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }:<unnamed>
    ', 222' panicked at ':src/main.rscalled `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }77:',
    222src/main.rs::77222
    :77
    thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:875:125
    thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:875:125
    thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', thread 'src/main.rs<unnamed>:' panicked at '875called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }:', 125src/main.rs
    :875:125
    thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:875:125
    thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
    thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
    thread 'thread '<unnamed><unnamed>' panicked at '' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', ', src/main.rssrc/main.rs::222875::77125
    
    thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:875:125
    thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:875:125
    thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
    thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
    thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:875:125
    thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
    thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:875:125
    thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', thread 'src/main.rs<unnamed>:' panicked at '222called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }:', 77src/main.rs
    :875:125
    thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }thread '', <unnamed>src/main.rs' panicked at ':called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }222', :src/main.rs77:
    222:77
    thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:875:125
    thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:875:thread '125<unnamed>
    ' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', thread 'src/main.rs<unnamed>:' panicked at '222called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }:', 77src/main.rs
    :875:125
    thread 'thread '<unnamed><unnamed>' panicked at '' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', ', src/main.rssrc/main.rsthread '::<unnamed>thread '222875' panicked at '<unnamed>::called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }' panicked at '77125', called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }
    
    src/main.rs', :src/main.rs875::222125:
    77
    thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
    thread '<unnamed>' panicked at 'thread 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }<unnamed>', ' panicked at 'src/main.rscalled `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }:', 222src/main.rs::77875
    :125
    thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
    thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
    thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
    thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
    thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Any { .. }', /home/yiming/.cargo/registry/src/github.com-1ecc6299db9ec823/crossbeam-0.3.2/src/scoped.rs:34:43
    Now merging individual tables
    Cleaning up
    

    Yet, it successfully create the suffix array file

    c4-train.00000-of-01024.txt.part.0-114700294
    c4-train.00000-of-01024.txt.part.0-114700294.table.bin
    c4-train.00000-of-01024.txt.part.114600294-229300588
    c4-train.00000-of-01024.txt.part.114600294-229300588.table.bin
    c4-train.00000-of-01024.txt.part.229200588-343900882
    c4-train.00000-of-01024.txt.part.229200588-343900882.table.bin  
    c4-train.00000-of-01024.txt.part.343800882-458401177           
    c4-train.00000-of-01024.txt.part.343800882-458401177.table.bin  
    c4-train.00000-of-01024.txt.table.bin
    

    Then, I run

    cargo run self-similar --data-file c4-train.00000-of-01024.txt --length-threshold 15 --cache-dir cache --num-threads 128
    

    It gives me below error:

        Finished dev [optimized + debuginfo] target(s) in 5.69s
         Running `target/debug/dedup_dataset self-similar --data-file c4-train.00000-of-01024.txt --length-threshold 15 --cache-dir cache --num-threads 128`
    Start load!
    thread 'main' panicked at 'assertion failed: metadata.len() % (text.len() as u64) == 0', src/main.rs:479:5
    note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
    

    May I ask how to fix this? Thank you!

    Yiming

    opened by MatthewCYM 14
Owner
Google Research
Google Research
The source code for Generating Training Data with Language Models: Towards Zero-Shot Language Understanding.

SuperGen The source code for Generating Training Data with Language Models: Towards Zero-Shot Language Understanding. Requirements Before running, you

Yu Meng 38 Dec 12, 2022
DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective.

DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective.

Microsoft 8.4k Jan 1, 2023
Rayvens makes it possible for data scientists to access hundreds of data services within Ray with little effort.

Rayvens augments Ray with events. With Rayvens, Ray applications can subscribe to event streams, process and produce events. Rayvens leverages Apache

CodeFlare 32 Dec 25, 2022
[CVPR 2021] MiVOS - Mask Propagation module. Reproduced STM (and better) with training code :star2:. Semi-supervised video object segmentation evaluation.

MiVOS (CVPR 2021) - Mask Propagation Ho Kei Cheng, Yu-Wing Tai, Chi-Keung Tang [arXiv] [Paper PDF] [Project Page] [Papers with Code] This repo impleme

Rex Cheng 106 Jan 3, 2023
Code of PVTv2 is released! PVTv2 largely improves PVTv1 and works better than Swin Transformer with ImageNet-1K pre-training.

Updates (2020/06/21) Code of PVTv2 is released! PVTv2 largely improves PVTv1 and works better than Swin Transformer with ImageNet-1K pre-training. Pyr

null 1.3k Jan 4, 2023
ST++: Make Self-training Work Better for Semi-supervised Semantic Segmentation

ST++ This is the official PyTorch implementation of our paper: ST++: Make Self-training Work Better for Semi-supervised Semantic Segmentation. Lihe Ya

Lihe Yang 147 Jan 3, 2023
Code used to generate the results appearing in "Train longer, generalize better: closing the generalization gap in large batch training of neural networks"

Train longer, generalize better - Big batch training This is a code repository used to generate the results appearing in "Train longer, generalize bet

Elad Hoffer 145 Sep 16, 2022
[NeurIPS 2021] Better Safe Than Sorry: Preventing Delusive Adversaries with Adversarial Training

Better Safe Than Sorry: Preventing Delusive Adversaries with Adversarial Training Code for NeurIPS 2021 paper "Better Safe Than Sorry: Preventing Delu

Lue Tao 29 Sep 20, 2022
QuickAI is a Python library that makes it extremely easy to experiment with state-of-the-art Machine Learning models.

QuickAI is a Python library that makes it extremely easy to experiment with state-of-the-art Machine Learning models.

null 152 Jan 2, 2023
PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Salesforce 1.3k Dec 31, 2022
Code for pre-training CharacterBERT models (as well as BERT models).

Pre-training CharacterBERT (and BERT) This is a repository for pre-training BERT and CharacterBERT. DISCLAIMER: The code was largely adapted from an o

Hicham EL BOUKKOURI 31 Dec 5, 2022
Ultra-Data-Efficient GAN Training: Drawing A Lottery Ticket First, Then Training It Toughly

Ultra-Data-Efficient GAN Training: Drawing A Lottery Ticket First, Then Training It Toughly Code for this paper Ultra-Data-Efficient GAN Tra

VITA 77 Oct 5, 2022
Meta Language-Specific Layers in Multilingual Language Models

Meta Language-Specific Layers in Multilingual Language Models This repo contains the source codes for our paper On Negative Interference in Multilingu

Zirui Wang 20 Feb 13, 2022
Automatically download the cwru data set, and then divide it into training data set and test data set

Automatically download the cwru data set, and then divide it into training data set and test data set.自动下载cwru数据集,然后分训练数据集和测试数据集

null 6 Jun 27, 2022
Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm

DeCLIP Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm. Our paper is available in arxiv Updates ** Ou

Sense-GVT 470 Dec 30, 2022
CLIP (Contrastive Language–Image Pre-training) trained on Indonesian data

CLIP-Indonesian CLIP (Radford et al., 2021) is a multimodal model that can connect images and text by training a vision encoder and a text encoder joi

Galuh 17 Mar 10, 2022
The source codes for ACL 2021 paper 'BoB: BERT Over BERT for Training Persona-based Dialogue Models from Limited Personalized Data'

BoB: BERT Over BERT for Training Persona-based Dialogue Models from Limited Personalized Data This repository provides the implementation details for

null 124 Dec 27, 2022