Show HN: Model2Vec: make sentence transformers 500x faster on CPU, 15x smaller

9 points by stephantul 10 months ago

Hi HN!

We (Thomas and Stéphan, hello!) recently released Model2Vec, a Python library for distilling any sentence transformer into a small set of static embeddings. This makes inference with such a model up to 500x faster, and reduces model size by a factor of 15 (7.5M params or 15/30MB on disk, depending on whether you use float16 or float32). This allows you to embed 50-100k documents per second on a cpu on a macbook.

This reduction of course comes at a cost: distilled models are worse than their parent models. Even so, they are actually a lot better than large sets of conventional static embeddings, such as GLoVe or word2vec-based models, which are many times larger. In addition, the performance gap between a Model2Vec model and a sentence-transformer ends up being smaller than you would expect, see: https://github.com/MinishLab/model2vec/tree/main?tab=readme-... for results. Fitting a Model2Vec does not require any data, just a sentence transformer and, possibly, a frequency-sorted vocabulary, making it an easy solution to implement in whatever workflow you have lying around.

We wrote this library because we separately got a bit frustrated with the lack of options if you need extremely fast CPU inference that still works well. If MiniLM isn’t fast enough and you don’t have access to a GPU, you’re often resigned to using BPemb, which is not flexible, or training your own GLoVe/word2vec models, which requires lots of data. Model2Vec solves all of these problems, and works better than specialized static embeddings trained on huge corpora. We spent a lot of time thinking about how the library could be easy to use and integrate into common workflows.

Besides tight integration with huggingface, we also have an upcoming sentence transformers integration, releasing this week. This will allow you to directly integrate and distill model2vec models in whatever libraries support sentence transformers. This means out of the box support for llama-index, scikit-learn via embetter, langchain, and many other frameworks.

Please let us know what you think. We’re very interested in getting feedback from you. We’re already using this in our own projects, and ultimately built this because we kind of needed it, but we’d be happy to hear from you if you have interesting use-cases or questions.

Have a nice day!

tuanmount2 10 months ago

I saw another repo use weights from llama3 to construct embedding model. I think the use case will be use this small model to search and use bigger model to re-rank later. So the question will be how this approach compared to BM25

stephantul 10 months ago

Hey, you are thinking about wordllama. We explicitly compare to wordllama in the results. Feel free to check it out.
Your point about BM25 is very interesting. We can run some comparisons! Thanks