Search for Rust crates with Meili
Demo of the Meili instant search engine that exposes packages from crates.io.
Today, I am about to guide you in the depths of crates.io and how I made an alternative search bar using our instant search engine: Meilisearch.
The Meili instant, relevant, and typo-tolerant search engine
Crates.io is the official website that stores Rust community crates (packages), and it's the place where the cargo package manager uploads, updates, and downloads those.
Sean Griffin is part of the crates.io team and maintains the current search engine of it along with the whole website. Kornel Lesinski built lib.rs an alternative to crates.io and uses Tantivy to power its search bar. To be honest, I prefer its color design, that's why I took it for our search demo.
I decided to run our instant search engine and test its relevancy overtime against these existing solutions. Our search engine uses completely different algorithms; it is based on prefix search and is typo tolerant.
Meili is typo-tolerant and supports a whole lot of other features
So, I asked myself: Why not use our new instant search engine and make it useful to my beloved community? It would give us lots of feedback and probably some pull requests during the process.
At Meili we manage an internal Kubernetes cluster, this is useful to host demos for clients. The Meilisearch server for this demo currently run on a pod in this cluster.
To make Meilisearch exhibit those crates, we needed to find all the currently available packages on crates.io. Fortunately, this index is available on GitHub in the form of several subfolders with the names and versions of the packages, containing something like 32 000 files. A commit is done each time a crate is updated to a new version, or a version is yanked.
So, I used the crates.io-index repository to initialize our newly created Meili search engine but needed more data first, like, the description, keywords, and categories of each of those crates. Again, the Rust crates.io team was here for us, I talked to Pietro Albini, and he pointed me to the not rate-limited servers that deliver packages content.
Now that we can retrieve useful data, I create an async crawler that downloads, extracts, retrieves the Cargo.toml, and upload the essential data to Meilisearch.
The Meili dashboard interface showing raw documents
Meilisearch now understands that data and gives us instant, relevant and typo tolerant responses. But, what about new crates? We want to be notified about new crates and be able to send those to Meilisearch.
Docs.rs is the official website to compute and store the documentation of all the Rust packages hosted by crates.io. It diffs the crates.io index each minute to know about new crates updates. Luckily, it provides an Atom feed of those updates.
It is where Heroku enters the game. Heroku offers something like 1000 free hours per month of computing power on their servers and provides us schedulers. We can consume those credits and freely ask docs.rs about newly updated crates, every 10 minutes, by fetching the Atom feed, download the updated packages like previously, and finally update our search engine, live!
Fetching new crates takes us 4 seconds
Search results were satisfying but not as expected, there was something wrong. When we wrote "serde", for example, the first result was relevant, but the next ones were not. It was related to the fact that Meilisearch did not have enough data to rate crates except the query matching words.
Meilisearch doesn't know how to settle equal crates with only the matching words
Downloads counts amazingly improved the search results. This data is available through crates.io. A full database export is done each day, and it contains the number of downloads of each crate. I decided to use those as the last ranking criterion to help Meilisearch settle about crates considered equal.
Meilisearch shows better results thanks to the number of downloads
I deployed a Heroku scheduler to run each day to update all the crates downloads; it takes something like 30 seconds to download the tarball, extract it, read the CSV and upload the 32 000 crates downloads counts to Meilisearch. So we are far from 1000 free hours per month.
I think this search demo is pretty good, but I also thought about adding synonyms and stop-words as Meilisearch supports those. For example, it would be pleasant to write "db" and see results associated with "database". Stop-words would help ignore useless words like "the" or "that" which could pollute search results sometimes, but we need to be careful as crates names could be composed of stop-words.
Meilisearch also supports basic filtering, and it would be great in the future to be able to search for crates in a category or with a specific keyword.
All these improvements can be made by you; the demo is open source. The core engine source code is also available on GitHub, that's the whole point of this article! Take a look and talk about it, more people involved, more features!