Running Experiments
After installing our library, experiments are run by running run_exp TASK_NAME PARAMS
. The TASK_NAME sets the data and model to use while PARAMS adjust some model and training settings as well as logging.
Experiment Task Names
When running the experiment script, each experiment type has its own "task" which defines the data and model to be used as well as the training schedule. Most model settings are hard-coded in configs to mirror the configurations used in the Reformer paper. All model and training hypermaramters used in training can be found in Experiments/Configs. Depending on the task, a limited number of configs can be changed such as seq_len
or n_layers
for example.
synt
: Language Modelling with Synthetic Datalm_XX
: Thelm_base
argument will train a baseline TransformerLM,lm_rev
to train a ReversibleLM andlm_shared_qk
for a baseline Transformer with shared query-key values. All training is on the enwik9 dataset.n_hashes
: Trains a LSH-LM on the enwik8 datan_layers
: Trains a LSH-LM on the enwik8 datawmt_XX
: Thewmt_base
argument will train a classic Transformer on the WMT-14 dataset, whilewmt_rev
will train a ReversibleTransformer on the same dataset
Commands Used
Language Model Experiments with enwik8
Section 4.2 - LSH attention analysis on synthetic task
run_exp "synt" --n_epochs=750 --bs=64 --save_model=True --seed=123 --do_wandb_logging=False --n_hashes=4
Section 4.3 & 4.4 - Effect of sharing QK & Effect of reversible layers
The language modelling experiments outlined in section 4.3 and 4.4 of the paper were run with the following command. The lm_base
argument was passed to train a baseline TransformerLM, lm_rev
to train a ReversibleLM and lm_shared_qk
for a baseline Transformer with shared query-key values
run_exp "lm_base" --n_epochs=10 --bs=1 --max_seq_len=4096 --grad_accum=8 --save_model=True --clip=1.0 --seed=42 --do_wandb_logging=False
Section 4.5 - Reversible Transformer on translation task experiment
ReversibleTransformer on WMT-14:
run_exp "wmt_rev" --lr=1e-4 --n_epochs=2 --bs=64 --n_layers=6 --max_seq_len=256 --do_wandb_logging=False --save_model=True --clip=1.0 --seed=8230 --precision=2
Run the above with task "wmt_base"
to train a baseline Transformer
Section 4.6 - Effect of number of hashing rounds on the performance
run_exp "n_hashes" --n_hashes=2 --n_epochs=10 --bs=8 --max_seq_len=4096 --do_wandb_logging=True --wandb_group='n_hashes' --wandb_notes='performance as function of n_hashes (2)' --wandb_tags='lm exp lsh nhashes' --grad_accum=8 --clip=1.0 --seed=2
Section 4.7 - LSH attention evaluation speed The LSH-LM evaluation speed experiment used the same functions as the script but was carried out in the "LSH evaluation speed" notebook here
Section 4.8 - Deep Reformer models
run_exp "n_layers" --n_layers=6 --n_epochs=8 --bs=2 --max_seq_len=16384 --do_wandb_logging=True --wandb_group='n_layers' --wandb_notes='performance as function of n_layers (6)' --wandb_tags='lm exp lsh nlayers' --grad_accum=8 --clip=1.0 --seed=48 --save_model=True