Codebase for testing whether hidden states of neural networks encode discrete structures.

John Hewitt

Last update: Dec 17, 2022

Related tags

Deep Learning structural-probes

Overview

structural-probes

Codebase for testing whether hidden states of neural networks encode discrete structures.

Based on the paper A Structural Probe for Finding Syntax in Word Representations.

See the blog post on structural probes for a brief introduction.

Installing & Getting Started

Clone the repository.

 git clone https://github.com/john-hewitt/structural-probes/
 cd structural-probes

[Optional] Construct a virtual environment for this project. Only python3 is supported.
```
 conda create --name sp-env
 conda activate sp-env
```
Install the required packages. This mainly means pytorch, scipy, numpy, seaborn, etc. Look at pytorch.org for the PyTorch installation that suits you and install it; it won't be installed via requirements.txt. Everything in the repository will use a GPU if available, but if none is available, it will detect so and just use the CPU, so use the pytorch install of your choice.
```
 conda install --file requirements.txt
 pip install pytorch-pretrained-bert
```
Download some pre-packaged data from the English Universal Dependencies (EWT) dataset and pretrained probes to get your feet wet.
```
 bash ./download_example.sh
```
This will make the directory example/data, and in it will be 9 files, 3 for each of train,dev,test.
- en_ewt-ud-{train,dev,test}.conllu: the parsed language data
- en_ewt-ud-{train,dev,test}.txt: whitespace-tokenized, sentence-per-line language data.
- en_ewt-ud-{train,dev,test}.elmo-layers.hdf5: the ELMo hidden states for each sentence of the language data, constructed by running elmo on the .txt files.
Test a pre-trained structural probe on BERTlarge with our demo script!
```
 printf "The chef that went to the stores was out of food" | python structural-probes/run_demo.py example/demo-bert.yaml
```
The script will make a new directory under example/results/ and store some neat visualizations there. It will use pre-trained probe parameters stored at example/data, downloaded with download_example.sh. Try out some other sentences too!
Run an experiment using an example experiment configuration, and take a look at the resultant reporting!
```
 python structural-probes/run_experiment.py example/config/prd_en_ewt-ud-sample.yaml
```
The path to a new directory containing the results of the experiment will be in the first few lines of the logging output of the script. Once you go there, you can see dev-pred*.png: some distance matrices printed by the script, as well as files containing the quantitative reporting results, like dev.uuas, the unlabeled undirected attachment score. These will all be very low, since the probe was trained on very little data!

Run a pretrained structural probe on `BERT-large` quickly on the command line.

It's easy to get predictions on a sentence (or file of sentences) using our demo script and the pre-trained structural probes we release. We use pytorch-pretrained-bert to get BERT subword embeddings for each sentence; it should be installed during setup of the repository.

Make sure you've run download_example.sh; this will download two probe parameter files to example/data/. Also make sure you've installed all dependencies. One is a distance probe on the 16th hidden layer of BERT large, and the other is a depth probe on the same layer. The configuration file example/demo-bert.yaml has the right paths already plugged in; just pipe text into the demo file, as follows:

 printf "The chef that went to the stores was out of food" | python structural-probes/run_demo.py example/demo-bert.yaml

If you want to run multiple sentences at once, you can either do so via printf:

 printf "The chef that went to the stores was out of food\nThe chef that went to the stores and talked to the parents was out of food" | python structural-probes/run_demo.py example/demo-bert.yaml

Or piping/redirecting a file to stdin:

 cat my_set.txt | python structural-probes/run_demo.py example/demo-bert.yaml

The script will print out a directory to which it has written visualizations of both parse depths and parse distances as predicted by a distance probe and a depth probe. You'll also see demo.tikz, which is a bit of LaTeX for the tikz-dependency package. With tikz-dependency in the same directory as your LaTeX file, you can plop this bit of LaTeX in a figure environment and see the minimum spanning tree it constructs. It'd look a bit like this:

\documentclass{article}
\usepackage{tikz-dependency}
\usepackage{tikz}

\pgfkeys{%
/depgraph/reserved/edge style/.style = {% 
white, -, >=stealth, % arrow properties                                                                            
black, solid, line cap=round, % line properties
rounded corners=2, % make corners round
},%
}
\begin{document}
\begin{figure}
  \centering
  \small
  \begin{dependency}[hide label, edge unit distance=.5ex]
    \begin{deptext}[column sep=0.05cm]
      The\& chef\& who\& ran\& to\& the\& stores\& is\& out\& of\& food \\
    \end{deptext}                                                                                                                                                                                                                           
    \depedge[edge style={red}, edge below]{8}{9}{.}
    \depedge[edge style={red}, edge below]{5}{7}{.}
    \depedge[edge style={red}, edge below]{4}{5}{.}
    \depedge[edge style={red}, edge below]{1}{2}{.}
    \depedge[edge style={red}, edge below]{6}{7}{.}
    \depedge[edge style={red}, edge below]{9}{10}{.}
    \depedge[edge style={red}, edge below]{10}{11}{.}
    \depedge[edge style={red}, edge below]{3}{4}{.}
    \depedge[edge style={red}, edge below]{2}{4}{.}
    \depedge[edge style={red}, edge below]{2}{8}{.}
  \end{dependency}
\end{figure}
\end{document}

Which results in a PDF with the following:

Note that your text should be whitespace-tokenized! If you want to evaluate on a test set with gold parses, or if you want to train your own structural probes, read on.

The experiment config file

Experiments run with this repository are specified via yaml files that completely describe the experiment (except the random seed.) In this section, we go over each top-level key of the experiment config.

Dataset:

observation_fieldnames: the fields (columns) of the conll-formatted corpus files to be used. Must be in the same order as the columns of the corpus. Each field will be accessable as an attribute of each Observation class (e.g., observation.sentence contains the sequence of tokens comprising the sentence.)
corpus: The location of the train, dev, and test conll-formatted corpora files. Each of train_path, dev_path, test_path will be taken as relative to the root field.
embeddings: The location of the train, dev, and test pre-computed embedding files (ignored if not applicable. Each of train_path, dev_path, test_path will be taken as relative to the root field. - type is ignored.
batch_size: The number of observations to put into each batch for training the probe. 20 or so should be great.

dataset:
  observation_fieldnames:
     - index
     - sentence
     - lemma_sentence
     - upos_sentence
     - xpos_sentence
     - morph
     - head_indices
     - governance_relations
     - secondary_relations
     - extra_info
     - embeddings
  corpus:
    root: example/data/en_ewt-ud-sample/
    train_path: en_ewt-ud-train.conllu
    dev_path: en_ewt-ud-dev.conllu
    test_path: en_ewt-ud-test.conllu
  embeddings:
    type: token #{token,subword}
    root: example/data/en_ewt-ud-sample/ 
    train_path: en_ewt-ud-train.elmo-layers.hdf5
    dev_path: en_ewt-ud-dev.elmo-layers.hdf5
    test_path: en_ewt-ud-test.elmo-layers.hdf5
  batch_size: 40

Model

hidden_dim: The dimensionality of the representations to be probed. The probe parameters constructed will be of shape (hidden_dim, maximum_rank)
embedding_dim: ignored
model_type: One of ELMo-disk, BERT-disk, ELMo-decay, ELMo-random-projection as of now. Used to help determine which Dataset class should be constructed, as well as which model will construct the representations for the probe. The Decay0 and Proj0 baselines in the paper are from ELMo-decay and ELMo-random-projection, respectively. In the future, will be used to specify other PyTorch models.
use_disk: Set to True to assume that pre-computed embeddings should be stored with each Observation; Set to False to use the words in some downstream model (this is not supported yet...)
model_layer: The index of the hidden layer to be used by the probe. For example, ELMo models can use layers 0,1,2; BERT-base models have layers 0 through 11; BERT-large 0 through 23.
tokenizer: If a model will be used to construct representations on the fly (as opposed to using embeddings saved to disk) then a tokenizer will be needed. The type string will specify the kind of tokenizer used. The vocab_path is the absolute path to a vocabulary file to be used by the tokenizer.

model:
  hidden_dim: 1024 # ELMo hidden dim
  #embedding_dim: 1024 # ELMo word embedding dim
  model_type: ELMo-disk # BERT-disk, ELMo-disk,
  tokenizer:
    type: word
    vocab_path: example/vocab.vocab
  use_disk: True
  model_layer: 2 # BERT-base: {1,...,12}; ELMo: {1,2,3}

Probe, probe-training

task_signature: Specifies the function signature of the task. Currently, can be either word, for parse depth (or perhaps labeling) tasks; or word_pair for parse distance tasks.
task_name: A unique name for each task supported by the repository. Right now, this includes parse-depth and parse-distance.
maximum_rank: Specifies the dimensionality of the space to be projected into, if psd_parameters=True. The projection matrix is of shape (hidden_dim, maximum_rank). The rank of the subspace is upper-bounded by this value. If psd_parameters=False, then this is ignored.
psd_parameters: though not reported in the paper, the parse_distance and parse_depth tasks can be accomplished with a non-PSD matrix inside the quadratic form. All experiments for the paper were run with psd_parameters=True, but setting psd_parameters=False will simply construct a square parameter matrix. See the docstring of probe.TwoWordNonPSDProbe and probe.OneWordNonPSDProbe for more info.
diagonal: Ignored.
prams_path: The path, relative to args['reporting']['root'], to which to save the probe parameters.
epochs: The maximum number of epochs to which to train the probe. (Regardless, early stopping is performed on the development loss.)
loss: A string to specify the loss class. Right now, only L1 is available. The class within loss.py will be specified by a combination of this and the task name, since for example distances and depths have different special requirements for their loss functions.

probe:
  task_signature: word_pair # word, word_pair
  task_name: parse-distance
  maximum_rank: 32
  psd_parameters: True
  diagonal: False
  params_path: predictor.params
probe_training:
  epochs: 30
  loss: L1

Reporting

root: The path to the directory in which a new subdirectory should be constructed for the results of this experiment.
observation_paths: The paths, relative to root, to which to write the observations formatted for quick reporting later on.
prediction_paths: The paths, relative to root, to which to write the predictions of the model.
reporting_methods: A list of strings specifying the methods to use to report and visualize results from the experiment. For parse-distance, the valid methods are spearmanr, uuas, write_predictions, and image_examples. When reporting uuas, some tikz-dependency examples are written to disk as well. For parse-depth, the valid methods are spearmanr, root_acc, write_predictions, and image_examples. Note that image_examples will be ignored for the test set.

reporting:
  root: example/results
  observation_paths:
    train_path: train.observations
    dev_path: dev.observations
    test_path: test.observations
  prediction_paths:
    train_path: train.predictions
    dev_path: dev.predictions
    test_path: test.predictions
  reporting_methods:
    - spearmanr
      #- image_examples
    - uuas

Reporting + visualization

It can be time-consuming to make nice visualizations and make sense of the results from a probing experiment, so this repository does a bit of work for you. This section goes over each of the reporting methods available (under args['reporting']['reporting_methods'] in the experiment config), and exmaples of results.

spearmanr: This reporting method calculates the spearman correlation between predicted (distances/depths) and true (distances/depths) as defined by gold parse trees. See the paper or reporting.py docstrings for specifics. With this option enabled, you'll see dev.spearmanr, a TSV with an average Spearman correlation for each sentence length represented in the dev set, as well as dev.spearmanr-5-50-mean, which averages the sentence-average values for all sentence lengths between 5 and 50 (inclusive.)
image_examples: This reporting method prints out true and predicted distance matrices as pngs for the first 20 examples in the split. These will be labeled dev-pred0.png, dev-gold0.png, etc. They'll look something like this:

uuas: This reporting method (only used by parse-distance tasks) will print the unlabeled undirected attachment score to dev.uuas, and write the first 20 development examples' minimum spanning trees (for both gold and predicted distance matrices) in a tikz-dependency LaTeX code format, to dev.tikz. Each sentence can be copy-pasted into a LaTeX doc for visualization. Then they'l look something like this:

root_acc: This reporting method (only used by parse-depth tasks) will print to dev.root_acc the percentage of sentences where the least-deep word in the gold tree (the root) is also the least-deep according to the predicted depths.

Replicating PTB Results for the NAACL'19 Paper

As usual with the PTB, a bit of work has to be done in prepping data (and you have to have the unadulterated PTB data already, not the mangled language modeling benchmark version.)

To replicate our results on the PTB, you'll have to prep some data files. The prep scripts will need to be modified to use paths on your system, but the process is as follows:

Have Stanford CoreNLP installed / on your java classpath, and have allennlp installed.
Convert the PTB constituency trees to Stanford Dependencies in conllx format, using the script scripts/convert_splits_to_depparse.sh. This will write a single conllx file for each of train/dev/test. (This uses CoreNLP.)
Convert the conllx files to sentence-per-line whitespace-tokenized files, using scripts/convert_conll_to_raw.py.
Use scripts/convert_raw_to_bert.py and scripts/convert_raw_to_elmo.sh to take the sentencep-er-line whitespace-tokenized files and write BERT and ELMo vectors to disk in hdf5 format.
Replace the data paths (and choose a results path) in the yaml configs in example/config/naacl19/*/* with the paths that point to your conllx and .hdf5 files as constructed in the above steps. These 118 experiment files specify the configuration of all the experiments that end up in the paper.

Experiments on new datasets or models

In the future I hope to streamline support for plugging in arbitrary PyTorch models in model.py, but because of subword models, tokenization, batching etc. this is beyond my current scope.

Right now, the official way to run experiments on new datasets and representation learners is:

Have a conllx file for the train, dev, and test splits of your dataset.
Write contextual word representations to disk for each of the train, dev, and test split in hdf5 format, where the index of the sentence in the conllx file is the key to the hdf5 dataset object. That is, your dataset file should look a bit like {'0': <np.ndarray(size=(1,SEQLEN1,FEATURE_COUNT))>, '1':<np.ndarray(size=(1,SEQLEN1,FEATURE_COUNT))>...}, etc. Note here that SEQLEN for each sentence must be the number of tokens in the sentence as specified by the conllx file.
Edit a config file from example/config to match the paths to your data, as well as the hidden dimension and labels for the columns in the conllx file. Look at the experiment config section of this README for more information therein. One potential gotcha is that you must have an xpos_sentence field in your conllx (as labeled by your yaml config) since this will be used at evaluation time.

Citation

If you use this repository, please cite:

  @InProceedings{hewitt2019structural,
    author =      "Hewitt, John and Manning, Christopher D.",
    title =       "A Structural Probe for Finding Syntax in Word Representations",
    booktitle =   "North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
    year =        "2019",
    publisher =   "Association for Computational Linguistics",
    location =    "Minneapolis, USA",
  }

Comments

The range of BERT-disk layers

Hi, I have been trying to make a structural probe for BERT-large's 16th layer. However, I am facing a problem that the valid range of layers is (0-2). I have no idea why the range is not correct.

P.S Could you please check my configs?

run_experiment.py:238: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  yaml_args= yaml.load(open(cli_args.experiment_config))
Constructing new results directory at /Users/nurikkozahmet/Desktop/structural-probes/example/results/BERT-disk-parse-distance-2020-4-2-23-53-17-192645/
Loading BERT Pretrained Embeddings from /Users/nurikkozahmet/Desktop/structural-probes/example/data/en_ewt-ud-sample/en_ewt-ud-train.elmo-layers.hdf5; using layer 15
The pre-trained model you are loading is a cased model but you have not set `do_lower_case` to False. We are setting `do_lower_case=False` for you but you may want to check this behavior.
Using BERT-large-cased tokenizer to align embeddings with PTB tokens
[aligning embeddings]:   0%|                                                                                           | 0/100 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "run_experiment.py", line 242, in <module>
    execute_experiment(yaml_args, train_probe=cli_args.train_probe, report_results=cli_args.report_results)
  File "run_experiment.py", line 170, in execute_experiment
    expt_dataset = dataset_class(args, task)
  File "/Users/nurikkozahmet/Desktop/structural-probes/structural-probes/data.py", line 34, in __init__
    self.train_obs, self.dev_obs, self.test_obs = self.read_from_disk()
  File "/Users/nurikkozahmet/Desktop/structural-probes/structural-probes/data.py", line 65, in read_from_disk
    train_observations = self.optionally_add_embeddings(train_observations, train_embeddings_path)
  File "/Users/nurikkozahmet/Desktop/structural-probes/structural-probes/data.py", line 407, in optionally_add_embeddings
    embeddings = self.generate_subword_embeddings_from_hdf5(observations, pretrained_embeddings_path, layer_index)
  File "/Users/nurikkozahmet/Desktop/structural-probes/structural-probes/data.py", line 393, in generate_subword_embeddings_from_hdf5
    single_layer_features = feature_stack[elmo_layer]
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "/anaconda3/lib/python3.7/site-packages/h5py/_hl/dataset.py", line 553, in __getitem__
    selection = sel.select(self.shape, args, dsid=self.id)
  File "/anaconda3/lib/python3.7/site-packages/h5py/_hl/selections.py", line 94, in select
    sel[args]
  File "/anaconda3/lib/python3.7/site-packages/h5py/_hl/selections.py", line 261, in __getitem__
    start, count, step, scalar = _handle_simple(self.shape,args)
  File "/anaconda3/lib/python3.7/site-packages/h5py/_hl/selections.py", line 466, in _handle_simple
    x,y,z = _translate_int(int(arg), length)
  File "/anaconda3/lib/python3.7/site-packages/h5py/_hl/selections.py", line 486, in _translate_int
    raise ValueError("Index (%s) out of range (0-%s)" % (exp, length-1))
ValueError: Index (15) out of range (0-2)

configs.txt

opened by nuradilK 5

Error during new experiment

I tried to duplicate the experiment using a different training data, but run_experiment.py reports the following error:

Constructing new results directory at /content/drive/My Drive/structural-probes/example/results_it/BERT-disk-parse-depth-2020-2-10-9-22-7-778090/
Loading BERT Pretrained Embeddings from /content/drive/My Drive/structural-probes/scripts/rawbert_12layers_train.hdf5; using layer 12
The pre-trained model you are loading is a cased model but you have not set `do_lower_case` to False. We are setting `do_lower_case=False` for you but you may want to check this behavior.
100% 213450/213450 [00:00<00:00, 1089299.05B/s]
Using BERT-base-cased tokenizer to align embeddings with PTB tokens
[aligning embeddings]:   0% 0/13121 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "run_experiment.py", line 242, in <module>
    execute_experiment(yaml_args, train_probe=cli_args.train_probe, report_results=cli_args.report_results)
  File "run_experiment.py", line 170, in execute_experiment
    expt_dataset = dataset_class(args, task)
  File "/content/drive/My Drive/structural-probes/structural-probes/data.py", line 34, in __init__
    self.train_obs, self.dev_obs, self.test_obs = self.read_from_disk()
  File "/content/drive/My Drive/structural-probes/structural-probes/data.py", line 65, in read_from_disk
    train_observations = self.optionally_add_embeddings(train_observations, train_embeddings_path)
  File "/content/drive/My Drive/structural-probes/structural-probes/data.py", line 407, in optionally_add_embeddings
    embeddings = self.generate_subword_embeddings_from_hdf5(observations, pretrained_embeddings_path, layer_index)
  File "/content/drive/My Drive/structural-probes/structural-probes/data.py", line 393, in generate_subword_embeddings_from_hdf5
    single_layer_features = feature_stack[elmo_layer]
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "/usr/local/lib/python3.6/dist-packages/h5py/_hl/dataset.py", line 476, in __getitem__
    selection = sel.select(self.shape, args, dsid=self.id)
  File "/usr/local/lib/python3.6/dist-packages/h5py/_hl/selections.py", line 94, in select
    sel[args]
  File "/usr/local/lib/python3.6/dist-packages/h5py/_hl/selections.py", line 261, in __getitem__
    start, count, step, scalar = _handle_simple(self.shape,args)
  File "/usr/local/lib/python3.6/dist-packages/h5py/_hl/selections.py", line 451, in _handle_simple
    x,y,z = _translate_int(int(arg), length)
  File "/usr/local/lib/python3.6/dist-packages/h5py/_hl/selections.py", line 471, in _translate_int
    raise ValueError("Index (%s) out of range (0-%s)" % (exp, length-1))
ValueError: Index (12) out of range (0-11)

Is it an error related to the training data ?

opened by nlpirate 4

TypeError: can't convert CUDA tensor to numpy

When running with CUDA and generating tables, the labels are trying to be implicitly converted to numpy using scipy. I can see that there is already a .detach().cpu().numpy() in regimen.py, but for some reason it appears that the labels are being converted back to CUDA. Calling .cpu() before handing off the labels to scipy fixes it for me.

Full trace below.

Traceback (most recent call last):
  File "structural-probes/run_experiment.py", line 242, in <module>
    execute_experiment(yaml_args, train_probe=cli_args.train_probe, report_results=cli_args.report_results)
  File "structural-probes/run_experiment.py", line 182, in execute_experiment
    run_report_results(args, expt_probe, expt_dataset, expt_model, expt_loss, expt_reporter, expt_regimen)
  File "structural-probes/run_experiment.py", line 143, in run_report_results
    reporter(dev_predictions, dev_dataloader, 'dev')
  File "/home/hyper/Documents/repos/structural-probes/structural-probes/reporter.py", line 50, in __call__
    , dataloader, split_name)
  File "/home/hyper/Documents/repos/structural-probes/structural-probes/reporter.py", line 143, in report_image_examples
    ax = sns.heatmap(label)
  File "/home/hyper/Documents/anaconda3/envs/probe/lib/python3.7/site-packages/seaborn/matrix.py", line 517, in heatmap
    yticklabels, mask)
  File "/home/hyper/Documents/anaconda3/envs/probe/lib/python3.7/site-packages/seaborn/matrix.py", line 109, in __init__
    plot_data = np.asarray(data)
  File "/home/hyper/Documents/anaconda3/envs/probe/lib/python3.7/site-packages/numpy/core/numeric.py", line 538, in asarray
    return array(a, dtype, copy=False, order=order)
  File "/home/hyper/Documents/anaconda3/envs/probe/lib/python3.7/site-packages/torch/tensor.py", line 450, in __array__
    return self.numpy()
TypeError: can't convert CUDA tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

opened by Hyperparticle 4

Error during parse-distance new experiment

Hi John, I run a parse-depth experiment on my own data and everything worked fine.

When I tried to run a parse-distance experiment using the same hdf5 files previously used, the runtime reports the following error:

Traceback (most recent call last): File "/content/drive/My Drive/structural-probes/structural-probes/run_experiment.py", line 242, in <module> execute_experiment(yaml_args, train_probe=cli_args.train_probe, report_results=cli_args.report_results) File "/content/drive/My Drive/structural-probes/structural-probes/run_experiment.py", line 179, in execute_experiment run_train_probe(args, expt_probe, expt_dataset, expt_model, expt_loss, expt_reporter, expt_regimen) File "/content/drive/My Drive/structural-probes/structural-probes/run_experiment.py", line 128, in run_train_probe dataset.get_train_dataloader(), dataset.get_dev_dataloader()) File "/content/drive/My Drive/structural-probes/structural-probes/regimen.py", line 65, in train_until_convergence batch_loss, count = loss(predictions, label_batch, length_batch) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in __call__ result = self.forward(*input, **kwargs) File "/content/drive/My Drive/structural-probes/structural-probes/loss.py", line 31, in forward predictions_masked = predictions * labels_1s RuntimeError: The size of tensor a (20) must match the size of tensor b (42) at non-singleton dimension 1

I cannot figure out how oe where this disallignement happens... Do you have some clue?

Thanks alot

opened by siberio76 2
Deprecation warnings in new h5py, etc.

Seems like the new version of h5py (and other things?) want a few data transformation things to be done differently; current ops used are deprecated. Doesn't seem like anything is affected yet; worth fixing soon.

opened by john-hewitt 1
add fixes for torch-1.0.0, new h5py

deprecation warning/error fixes for new dependency versions

Looks like in the torch-1.0.x, the labels tensor is in CUDA mem by the time it hits reporter.py, leading to problems when scipy forces a conversion of the tensor to numpy. Brought to light in #4; this PR should fix that issue.

Further, now h5py preferred syntax for accessing data values doesn't use .value; using this was emitting a warning for each observation.

Finally, a sum call that should have been a torch.sum call was requiring a further wrapping in torch.tensor, which is a mess and was spitting errors; this has been fixes.

opened by john-hewitt 0

segment type ids should be zeros instead of ones (minor update suggestion)

Just for the record, in case someone finds it useful or plans to extend it.

In line 48 of https://github.com/john-hewitt/structural-probes/blob/4c2e265d6bd071e6ab380fd9806e4c6a128b5e97/scripts/convert_raw_to_bert.py#L48 Segment type ids should be zeros, not ones as implemented there (sentence A = 0, sentence B = 1) I believe this will not make much difference anyway. Moreover, in the new huggingface library's API this parameter can be ignored and the library creates it automatically, as seen in the code below.

In case someone finds it useful, I also updated the code to a version compatible with the updated library (transformers)

The relevant lines that changed are these (some lines ignored for clarity):

from transformers import BertTokenizerFast, BertModel

# Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizerFast.from_pretrained(...)
model = BertModel.from_pretrained(....)
LAYER_COUNT = 12+1 # 24+1 for bert-large
FEATURE_COUNT = 768 # 1024 for bert-large
model.eval()

# tokenize text, preserving PTB tokenized words
indexed_tokens = tokenizer._batch_encode_plus(line.split(), add_special_tokens=False, return_token_type_ids=False, return_attention_mask=False)
indexed_tokens = [item for sublist in indexed_tokens['input_ids'] for item in sublist]
indexed_tokens = tokenizer.build_inputs_with_special_tokens(indexed_tokens) # Add [CLS] and [SEP]

# Build batch and run the model
tokens_tensor = torch.tensor([indexed_tokens])
with torch.no_grad():
    encoded_layers = model(input_ids=tokens_tensor, output_hidden_states=True)['hidden_states']

# Notice that index and fout comes from the loop in the original code, ignored here for clarity
dset = fout.create_dataset(str(index), (LAYER_COUNT, len(indexed_tokens), FEATURE_COUNT))
dset[:,:,:] = np.vstack([np.array(x) for x in encoded_layers])

opened by caspillaga 0

AssertionError appears when trying to align embeddings

I am trying to run an experiment on a new dataset. Followed your instructions, but everytime when I run this code, 'AssertionError' appears. I don't know whether it is an issue or not, nevertheless, could you check my yaml file? I am working in google colab. bert_exam (my yaml file) is located here: https://github.com/Awkwafina/files/blob/master/bert_exam

opened by Awkwafina 6
WordPiece tokenizer vs Full Tokenizer

I noticed that in data.py for BERT you use wordpiece tokenizer instead of full tokenizer. I tried switching to full tokenizer, but strangely it gave me much worse results. I suspect there could be a bug somewhere when you're aligning the embeddings, because I then calculated the average myself when I'm calculating the embeddings, and the drop in performance got fixed (i.e. when I create the embedding file, it is word-level already, so I don't have to rely on your code to calculate the average).

By the way, I think it would be nice if you could mention in the README.md file that the ELMO config files allow you to use word-level (instead of subword-level) embeddings. This is good because if you pre-calculate these (save word-level embeddings instead of subword-level embeddings) when you prepare your BERT embedding file and just use the ELMO config files, you don't have to waste time aligning embeddings every time you start a new experiment.

Also, I think a better workflow may be generating one embedding file for each BERT layer instead of generating one big file with all the layers and have many config files each specifying a different layer index. This is because I tried the original workflow on a bigger dataset (Czech UD treebank) and ran into out of memory issues. The original workflow also makes preparing the data for training really slow.

opened by chiehminwei 2
Enable easy swapping of PyTorch models
Right now, to test a new representation learner, one must:

Use the representation learner to write hidden state vectors for each token (or subword) to disk. (Better idea for subword models: decide how to combine subword representations; write resultant token embeddings to disk)

Run structural probe code on the hidden states as saved to disk.

This is "nice" in that the hidden states don't need to be computed at each pass (BERT is big/slow; I actually run most experiments on CPUs because the probe training is so fast and CPUs are so plentiful)

However, it's "not nice" that one can't swap representation model parameters on the fly, and especially that big huge vectors take up a lot of disk space (115GB for BERT-large on PTB WSJ train -- 40k sents.)

We'd like to enable easy swapping of new models by defining a new class in model.py. We'll need to read in the tokenizer (and perhaps subword-tokenizer) so we pass the model words as identified by its own vocabulary, and are able to map from subword reprs back to token reprs. There's also the problem of inefficiency of aligning subword reprs to token reprs at every batch
enhancement
opened by john-hewitt 0

Owner

John Hewitt

I'm a PhD student working on: NLP, structure, graphs, bash scripts, RNNs, multilinguality, and teaching others to do the same.

GitHub

Learning hidden low dimensional dyanmics using a Generalized Onsager Principle and neural networks

OnsagerNet Learning hidden low dimensional dyanmics using a Generalized Onsager Principle and neural networks This is the original pyTorch implemenati

3 Aug 24, 2022

Code for the paper "Ordered Neurons: Integrating Tree Structures into Recurrent Neural Networks"

ON-LSTM This repository contains the code used for word-level language model and unsupervised parsing experiments in Ordered Neurons: Integrating Tree

572 Nov 21, 2022

Drslmarkov - Distributionally Robust Structure Learning for Discrete Pairwise Markov Networks

Distributionally Robust Structure Learning for Discrete Pairwise Markov Networks

1 Nov 24, 2022

Generative Flow Networks for Discrete Probabilistic Modeling

Energy-based GFlowNets Code for Generative Flow Networks for Discrete Probabilistic Modeling by Dinghuai Zhang, Nikolay Malkin, Zhen Liu, Alexandra Vo

51 Dec 20, 2022

Interpretation of T cell states using reference single-cell atlases

Interpretation of T cell states using reference single-cell atlases ProjecTILs is a computational method to project scRNA-seq data into reference sing

139 Jan 3, 2023

Face Mask Detection is a project to determine whether someone is wearing mask or not, using deep neural network.

face-mask-detection Face Mask Detection is a project to determine whether someone is wearing mask or not, using deep neural network. It contains 3 scr

13 Jan 18, 2022

Vector AI — A platform for building vector based applications. Encode, query and analyse data using vectors.

Vector AI is a framework designed to make the process of building production grade vector based applications as quickly and easily as possible. Create

267 Dec 23, 2022

Encode and decode text application

Text Encoder and Decoder Encode and decode text in many ways using this application! Encode in: ASCII85 Base85 Base64 Base32 Base16 Url MD5 Hash SHA-1

1 Feb 12, 2022

Simple codebase for flexible neural net training

neural-modular Simple codebase for flexible neural net training. Allows for seamless exchange of models, dataset, and optimizers. Uses hydra for confi

7 Apr 5, 2022

PyTorch package for the discrete VAE used for DALL·E.

Overview [Blog] [Paper] [Model Card] [Usage] This is the official PyTorch package for the discrete VAE used for DALL·E. Installation Before running th

9.5k Jan 5, 2023

DCT-Mask: Discrete Cosine Transform Mask Representation for Instance Segmentation

DCT-Mask: Discrete Cosine Transform Mask Representation for Instance Segmentation This project hosts the code for implementing the DCT-MASK algorithms

57 Nov 27, 2022

Official codes for the paper "Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech"

ResDAVEnet-VQ Official PyTorch implementation of Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech What is in this repo? M

21 Aug 23, 2022

This is 2nd term discrete maths project done by UCU students that uses backtracking to solve various problems.

Backtracking Project Sponsors This is a project made by UCU students: Olha Liuba - crossword solver implementation Hanna Yershova - sudoku solver impl

4 Oct 17, 2021

An official reimplementation of the method described in the INTERSPEECH 2021 paper - Speech Resynthesis from Discrete Disentangled Self-Supervised Representations.

Speech Resynthesis from Discrete Disentangled Self-Supervised Representations Implementation of the method described in the Speech Resynthesis from Di

253 Jan 6, 2023

Implementation of the method described in the Speech Resynthesis from Discrete Disentangled Self-Supervised Representations.

Speech Resynthesis from Discrete Disentangled Self-Supervised Representations Implementation of the method described in the Speech Resynthesis from Di

4 Mar 11, 2022

Codebase for testing whether hidden states of neural networks encode discrete structures.

Related tags

Overview

structural-probes

Installing & Getting Started

Run a pretrained structural probe on BERT-large quickly on the command line.

The experiment config file

Dataset:

Model

Probe, probe-training

Reporting

Reporting + visualization

Replicating PTB Results for the NAACL'19 Paper

Experiments on new datasets or models

Citation

Comments

deprecation warning/error fixes for new dependency versions

Owner

John Hewitt

Learning hidden low dimensional dyanmics using a Generalized Onsager Principle and neural networks

Code for the paper "Ordered Neurons: Integrating Tree Structures into Recurrent Neural Networks"

Drslmarkov - Distributionally Robust Structure Learning for Discrete Pairwise Markov Networks

Generative Flow Networks for Discrete Probabilistic Modeling

Interpretation of T cell states using reference single-cell atlases

Face Mask Detection is a project to determine whether someone is wearing mask or not, using deep neural network.

Vector AI — A platform for building vector based applications. Encode, query and analyse data using vectors.

Encode and decode text application

Simple codebase for flexible neural net training

PyTorch package for the discrete VAE used for DALL·E.

DCT-Mask: Discrete Cosine Transform Mask Representation for Instance Segmentation

Official codes for the paper "Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech"

This is 2nd term discrete maths project done by UCU students that uses backtracking to solve various problems.

An official reimplementation of the method described in the INTERSPEECH 2021 paper - Speech Resynthesis from Discrete Disentangled Self-Supervised Representations.

Implicit MLE: Backpropagating Through Discrete Exponential Family Distributions

Auto HMM: Automatic Discrete and Continous HMM including Model selection

This Jupyter notebook shows one way to implement a simple first-order low-pass filter on sampled data in discrete time.

Projecting interval uncertainty through the discrete Fourier transform

Implementation of the method described in the Speech Resynthesis from Discrete Disentangled Self-Supervised Representations.

Run a pretrained structural probe on `BERT-large` quickly on the command line.