DistilBERT is a distilled version of BERT that retains the performance capabilities of BERT but uses only half of the parameters, is faster, and smaller. It does not have token-type embeddings that BERT does.

DiustilBERT uses a technique called ‘distillation’ where it closely resembles Google’s large neural network with a smaller one. The idea is to create task-specific models.

Project Background

  • Project: DistilBERT
  • Author: Victor Sanh, Julien Chaumond, and Thomas Wolf
  • Initial Release: 2019
  • Type: Transfer Learning
  • GitHub:/distilbert.rst with 53.3k stars and 9 contributors
  • Twitter: None


  • Leverage the inductive biases learned by larger models during pretraining
  • Prediction models with faster inference speed
  • Knowledge distillation for reducing size.
Scroll to Top