About This Document
- sl:arxiv_author :
- sl:arxiv_firstAuthor : Lili Yu
- sl:arxiv_num : 2305.07185
- sl:arxiv_published : 2023-05-12T00:55:41Z
- sl:arxiv_summary : Autoregressive transformers are spectacular models for short sequences but
scale poorly to long sequences such as high-resolution images, podcasts, code,
or books. We proposed Megabyte, a multi-scale decoder architecture that enables
end-to-end differentiable modeling of sequences of over one million bytes.
Megabyte segments sequences into patches and uses a local submodel within
patches and a global model between patches. This enables sub-quadratic
self-attention, much larger feedforward layers for the same compute, and
improved parallelism during decoding -- unlocking better performance at reduced
cost for both training and generation. Extensive experiments show that Megabyte
allows byte-level models to perform competitively with subword models on long
context language modeling, achieve state-of-the-art density estimation on
ImageNet, and model audio from raw files. Together, these results establish the
viability of tokenization-free autoregressive sequence modeling at scale.@en
- sl:arxiv_title : MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers@en
- sl:arxiv_updated : 2023-05-19T21:09:11Z
- sl:bookmarkOf : https://arxiv.org/abs/2305.07185
- sl:creationDate : 2023-07-01
- sl:creationTime : 2023-07-01T09:10:39Z
Documents with similar tags (experimental)