As a applications development company we were thinking of what are the real problems that we solving. Boiling down all the products to the core problems which they were solving. It took a lot of first principles thinking but we figured out the timeline of increasing capabilities. One of these capabilities were context length.
Developers around the world have been solving the problem of context length programmatically for the past 9 months. But all of us saw it coming. Cause this is an obvious need that users needs from applications or even further - you can ask the model an infinitely long question.
Here are some key points about the LONGNET paper:
It introduces a new attention mechanism called "dilated attention" that allows Transformers to scale to very long sequence lengths of over 1 billion tokens.
Dilated attention reduces the quadratic computation complexity of standard Transformer attention to linear complexity. This is done by sparsifying the attention using exponentially increasing dilation rates.
With dilated attention, the paper shows LONGNET can model sequences of up to 1 billion tokens with almost constant runtime, overcoming the limitations of previous methods.
LONGNET achieves strong performance on language modeling benchmarks, outperforming sparse and dense Transformer baselines on both long and short sequences.
The linear complexity allows LONGNET to be parallelized across GPUs/nodes for distributed training of extremely long sequences. This enables scaling to 1 billion tokens.
Experiments show LONGNET follows similar scaling laws to standard Transformers when increasing model size or amount of training data.
Yes, the dilated attention mechanism allows LONGNET to take variable-length context as input during training and inference. So in theory, it can handle input sentences of any length, as long as the model capacity is large enough.
Let's do some calculations:
1 billion tokens is a huge amount of text. For reference, War and Peace is roughly 560,000 words or 3 million characters.
On an A4 page with standard 12 pt font and double line spacing, there are typically around 500 words per page.
So 1 billion tokens would be around 2 million A4 pages!
More precisely:
1 billion tokens / 500 words per A4 page = 2,000,000 A4 pages
If we estimate 5 characters per word on average, then:
1 billion tokens x 5 characters per token = 5 billion characters
An A4 page in 12 pt double spaced font has around 3,000 characters.
So 5 billion characters / 3,000 chars per page = 1,666,667 A4 pages
So in summary, 1 billion tokens works out to somewhere between 1.7 - 2 million A4 pages if printed out in a typical document format. It's an astonishingly large amount of text, equivalent to over 3,300 copies of War and Peace! Processing and modelling sequences of this length was unheard of before techniques like the dilated attention in LONGNET.
Let's see what's next to be disrupted.
Link to paper - https://arxiv.org/pdf/2307.02486.pdf
Comments