DistilBERT is a distilled version of BERT that retains the performance capabilities of BERT but uses only half of the parameters, is faster, and smaller. It does not have token-type embeddings that BERT does.
DiustilBERT uses a technique called ‘distillation’ where it closely resembles Google’s large neural network with a smaller one. The idea is to create task-specific models.
- Leverage the inductive biases learned by larger models during pretraining
- Prediction models with faster inference speed
- Knowledge distillation for reducing size.