Go to homeMeilisearch's logo
Back to articles
25 Feb 2025

What is latent semantic indexing (LSI) and how does it work?

Learn how LSI works under the hood, see a practical Python implementation, and discover why this foundational technique remains relevant in today's AI-driven search landscape.

Ilia Markov
Ilia MarkovSenior Growth Marketing Managernochainmarkov
What is latent semantic indexing (LSI) and how does it work?

Latent semantic indexing (LSI) is a mathematical document understanding and retrieval approach. LSI is commonly used in search engines, e-commerce, website search, and other applications that require search capabilities.

This article provides a comprehensive overview of LSI. In particular, it answers the following questions:

  • What is LSI, and how does it work?
  • What are some benefits of LSI, and where is it used?
  • How do you implement LSI in Python? (we provide a step-by-step explanation)
  • What are some modern alternatives to LSI?

So, let’s get right into it.

What is latent semantic indexing (LSI)?

LSI is an information retrieval method used in natural language processing (NLP) to uncover latent (hidden) relationships between words and concepts within a body of text.

Unlike the traditional keyword-based methods, LSI is a type of semantic search that analyzes the semantic relationships between terms in a document to extract hidden concepts and groups documents according to those concepts.

LSI uses singular value decomposition (SVD) to simplify complex, high-dimensional data by breaking it into smaller, hidden concepts. This helps identify patterns in the relationships between words and documents. LSI addresses the challenges of synonyms and antonyms and projects words with the same meaning into similar higher dimensions.

For example, related terms ‘doctor’ and ‘physician’ will be placed closer together in a high-dimensional LSI graph, reflecting the same concept. When a user searches for a document, the query is projected into high-dimensional space, and the most relevant document is returned.

LSI is one of the foundational techniques for document understanding and retrieval. Owing to its simplicity and computationally less expensive nature, it is still widely used.

Now that you know what LSI is, let’s see how it works.

How does latent semantic indexing work?

LSI employs SVD, a mathematical technique that decomposes a term-document matrix into smaller matrices, capturing the underlying relationship between terms and concepts in the documents.

The following figure demonstrates the workflow for LSI.

How does latent semantic indexing work_.png

Let’s discuss the above steps with examples:

Step 1: Import the dataset

The first step is to create a set of documents to which you want to apply LSI.

Let’s assume you have the following four documents:

Doc 1Cats and dogs are wonderful pets.
Doc 2Dogs are loyal pets.
Doc 3Pets bring joy and happinesss.
Doc 4Happiness and joy bring meaning to life.

Step 2: Preprocess documents

Text documents may contain stopwords that do not contribute to the meaning or concepts in a document. In preprocessing, you can remove stop words, convert the text into lowercase, and remove other useless information.

After preprocessing, our documents may look like this:

Doc 1cats dogs wonderful pets.
Doc 2dogs loyal pets
Doc 3pets bring joy happiness
Doc 4happiness joy bring meaning life.

Step 3: Create term-document matrix

Before you create a term-document matrix, you need to create a set of unique words for all the documents. This set is often referred to as the vocabulary. The vocabulary for the documents in our sample dataset looks like this:

PATCH /network
['bring', 'cats', 'dogs', 'happiness', 'joy', 'life', 'loyal', 'meaning', 'pets' 'wonderful']

The next step is to create a term-document matrix of the shape N x M, where N is the number of documents and M is the vocabulary size. Each row in the matrix corresponds to the frequency of word occurrence in a document. This matrix captures patterns of word co-occurrence across documents, which is crucial for identifying latent concepts.

The term-document matrix for our dataset looks like this:

image10.png

Step 4: Singular value decomposition

The SVD algorithm decomposes a matrix into smaller matrices. In LSI, SVD decomposes a term-document matrix A into three matrices: A=UΣVT

  • Matrix U: Relates documents to latent concepts. It is also known as the document-concept similarity matrix. This matrix shows how much a document relates to a particular concept.
  • Matrix Σ: A diagonal matrix of singular values representing the strength of each concept.
  • VT: Relates terms to latent concepts and is often called the term-concept similarity matrix. It shows how much a term relates to a particular concept.

I will not delve into the mathematical details of SVD here. However, the three matrices retrieved for the two concepts in our dataset look like this.

image1.png

Step 5: Analyze LSI matrices

It is important to note that the concept names are not automatically generated in LSI. Instead, you must look at the documents or terms grouped and infer the concepts.

For example, you can see that Doc 1 and Doc 2 belong to concept 2 since they have higher values for the second column in the document-concept similarity matrix. Similarly, Doc 3 and Doc 4 belong to concept 1.

Docs 1 and 2 mention animals and pets. Docs 3 and 4 are more about happiness and joy. Therefore, we can name the two concepts: pet animals and life and happiness. This allows LSI to retrieve documents with the most relevant content, even when exact terms are not matched.

You will see the complete Python application of the above example in a later section; for now, let’s look at some uses and benefits of LSI.

Where is latent semantic indexing used?

The latest semantic indexing is used in various NLP domains, including text summarization, automatic document categorization, online customer support, and spam filtering.

The following are some use cases where LSI comes in handy:

  • Search engines: LSI analyzes user queries and documents semantically to improve search engine performance. This helps search engines understand a user’s search intent and retrieve more relevant web pages and related searches.
  • Automated document classification: LSI search algorithms efficiently classify documents into predefined categories. They are commonly used for unsupervised sentiment classification, email classification, and other purposes.
  • Online customer support: As with search engines, LSI can match searcher queries with relevant solutions in customer management systems.
  • Spam filtering: LSI detects and filters spam emails based on semantic content.
  • Information visualization: Document clusters generated via LSI can be plotted in low-dimensional space to view the relationship between the documents.

Now that we have seen some uses of LSI, let’s discuss its advantages.

What are the benefits of latent semantic indexing?

The key benefits of LSI are the following:

  • Concept-based clustering: LSI groups related documents together, making organizing and analyzing large datasets easier.
  • Handling synonyms and polysemy: LSI can effectively handle synonyms. For instance, the words ‘car’ and ‘automobile’ will have similar semantic representations.
  • Scalability: LSI is a mathematical technique that leverages high computational power to scale efficiently, making it ideal for applications such as e-discovery and enterprise search systems.
  • Robustness against typos: LSI’s reliance on semantic meaning makes it less sensitive to spelling errors, improving user experience in search and retrieval systems.
  • Versatility across domains: Applied in search engines, education, finance, and more.

The following section shows the main difference between LSI and latent semantic analysis.

What is the difference between LSI and LSA?

LSI and Latent Semantic Analysis (LSA) are often used interchangeably. Both techniques use SVD at their heart. However, there are slight differences between their applications and focuses.

LSI was initially developed as an information retrieval and search technique that addresses challenges like semantic understanding and synonymy of documents. LSI's primary application retrieves semantically similar documents against a user’s search query.

On the other hand, latent semantic analysis goes beyond information retrieval to focus on other NLP tasks such as speech recognition, document clustering and classification, and cognitive modeling.

Let’s see how to implement LSI in Python.

Implementing Latent Semantic Indexing in Python

This section will show a hands-on example of implementing an LSI algorithm in Python.

Installing and importing required libraries

We will use the Python Scikit Learn library and NLTK toolkit to implement LSI in Python. The following script installs these libraries.

PATCH /network
!pip install -qU scikit-learn nltk pandas

The script below imports the required modules and classes to run the Python scripts mentioned in this article.

PATCH /network

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics.pairwise import cosine_similarity
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk
import matplotlib.pyplot as plt
import pandas as pd

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('punkt_tab')

Let’s implement LSI step by step. It is pertinent to mention that these are the steps you saw in a previous section explaining how LSI works. Here, they are implemented in Python.

Step 1: Import documents

The first step is to collect the documents on which you want to implement LSI. These can be your personal, business, or client documents.

This section will use a small dataset of four dummy documents containing one sentence each.

PATCH /network

# Example documents
documents = [
 "Cats and dogs are wonderful pets.",
 "Dogs are loyal pets.",
 "Pets bring joy and happiness.",
 "Happiness and joy bring meaning to life."
]

Step 2: Preprocess documents

In the preprocessing step, we will remove stopwords and punctuation from the documents, as shown in the following script:

PATCH /network

# Preprocessing: Tokenization and stopword removal
stop_words = set(stopwords.words('english'))
def preprocess(doc):
 words = word_tokenize(doc.lower())
 return ' '.join([word for word in words if word.isalnum() and word not in stop_words])

processed_docs = [preprocess(doc) for doc in documents]
processed_docs

Output:

image3.png

Step 3: Create term-document matrix

You can use the fit_transform() method of the CountVectorizer() class from the Sklearn library to create a term-document matrix. You can retrieve the document vocabulary using the get_feature_names_out() method.

PATCH /network

vectorizer = CountVectorizer()
term_document_matrix = vectorizer.fit_transform(processed_docs)

feature_names = vectorizer.get_feature_names_out()
print(feature_names)
term_document_array = term_document_matrix.toarray()

df_term_document = pd.DataFrame(term_document_array, columns=feature_names, 
 index=[f"Doc {i+1}" for i in range(len(term_document_array))])

print(df_term_document)

Output:

image16.png

The above output shows the documents’ vocabulary (list of unique words) and the term-document matrix, demonstrating word frequencies for each document.

Step 4: Apply singular value decomposition

You can use the TrancatedSVD class from the Sklearn library to implement SVD. You must pass the number of concepts you want to extract from the documents.

In the script below, we extract two concepts. The output shows the concept strength matrix, the document-concept similarity matrix, and the term-concept similarity matrix. The concept strength matrix shows that concept 1 is slightly more dominant in the documents.

PATCH /network

svd = TruncatedSVD(n_components=2, random_state=42)
lsi_matrix = svd.fit_transform(term_document_matrix)

# Display Results
print("
Singular Values (Concept Strength):
", svd.singular_values_)
print("
Document-Concept Similarity Matrix:
", lsi_matrix)
print("
Term-Concept Similarity Matrix:
", svd.components_.T)

image11.png

We will analyze the document-concept and term-concept similarity matrices in the next step.

Step 5: Analyze LSI matrices

Let’s plot a 2-D plot that shows documents on the concept axis.

PATCH /network

# Extract values for Concept 1 (x-axis) and Concept 2 (y-axis)
x = lsi_matrix[:, 0] # Values for Concept 1
y = lsi_matrix[:, 1] # Values for Concept 2

# Create a scatter plot
plt.figure(figsize=(8, 6))
plt.scatter(x, y, color='blue', label='Documents')

# Annotate each document
for i, (x_val, y_val) in enumerate(zip(x, y)):
 plt.text(x_val + 0.02, y_val, f'Doc {i+1}', fontsize=9)

# Add gridlines, labels, and title
plt.axhline(0, color='gray', linestyle='--', linewidth=0.5)
plt.axvline(0, color='gray', linestyle='--', linewidth=0.5)
plt.title("Document-Concept Similarity")
plt.xlabel("Concept 1")
plt.ylabel("Concept 2")
plt.grid()
plt.legend()
plt.show()

Output:

image12.png

The output shows that documents 1 and 2 belong mainly to concept 2, while documents 3 and 4 belong to concept 1.

To get an idea of the information in each concept, you can plot the terms for each concept.

PATCH /network

terms = vectorizer.get_feature_names_out()
concept1_weights = svd.components_[0]
concept2_weights = svd.components_[1]

fig, ax = plt.subplots(1, 2, figsize=(12, 6))

ax[0].barh(terms, concept1_weights, color='orange')
ax[0].set_title("Term Weights for Concept 1")
ax[0].set_xlabel("Weight")

ax[1].barh(terms, concept2_weights, color='green')
ax[1].set_title("Term Weights for Concept 2")
ax[1].set_xlabel("Weight")

plt.tight_layout()
plt.show()

Output:

image4.png

The above output shows that related keywords such as ‘pets,’ ‘joy,’ ‘happiness,’ ‘bring,’ etc., belong mainly to concept 1, which is about life and emotions.

On the other hand, the terms ‘pets,’ ‘wonderful,’ ‘cats,’ ‘dogs,’ etc., belong mainly to concept 2. We can infer that concept 2 is about pets and animals.

Now you know why documents 1 and 2 belong to concept 2 and documents 3 and 4 belong to concept 1.

And that’s it. You have developed your first LSI model using your custom documents.

In the next step, you will learn to use LSI to retrieve relevant search results against a user query.

Search and retrieval

You need to preprocess the query like you preprocessed your documents for LSA.

PATCH /network

user_query = "Joyful pets bring happiness to life." # Example query
preprocessed_query = preprocess(user_query) # Preprocess query
print("Preprocessed Query:", preprocessed_query)

Output:

image2.png

Next, use the SVD model you used to create document-concept and term-concept similarity matrices to embed a query in the LSI space.

You can then find the similarity between the query and the documents in the LSI space using the cosine similarity or any other vector similarity function.

PATCH /network

query_vector = vectorizer.transform([preprocessed_query]) # Transform query to term-document matrix
query_lsi = svd.transform(query_vector) # Map query to LSI latent space
print("
Query in LSI Space (Concepts):
", query_lsi)

# Use cosine similarity between the query and document vectors
similarities = cosine_similarity(query_lsi, lsi_matrix)
print("
Similarity Scores:
", similarities)

Output:

image8.png

Finally, you can retrieve and rank the documents based on their relevance to the input query. In this case, cosine similarity will be the ranking factor for the retrieved documents.

PATCH /network
# Rank documents by similarity
doc_indices = np.argsort(similarities[0])[::-1] # Sort by descending similarity
print("
Ranked Document Indices (Most Relevant First):", doc_indices)

# Output relevant documents
print("
Top Relevant Documents:")
for idx in doc_indices:
 print(f"Doc {idx + 1}: {documents[idx]} (Similarity: {similarities[0][idx]:.3f})")
 

Output:

image5.png

The output shows the search rankings for the documents against the input query. Document 3 is most relevant to the search terms in the query, which makes sense as it discusses both pets and happiness.

Now, let’s see if LSI still matters.

Is latent semantic indexing still relevant?

LSI is easy to implement and it isn’t computationally expensive. That's why it's still used when implementing simple document understanding and retrieval solutions where a deep understating of the relationships between words and concepts is unnecessary.

However, newer methods have been developed to enable a more advanced understanding of documents. These include vector search, word embeddings, and transformer approaches based on machine learning and deep learning techniques. These methods outperform LSI on almost all benchmarks.

Meilisearch is an advanced AI search engine that leverages cutting-edge vector search approaches to integrate state-of-the-art search engine capabilities into your applications. It implements semantic search techniques based on word embeddings and vector search that allow a deeper understanding of relationships and concepts within a document, improving the relevance and robustness of retrieved documents.

Meilisearch's AI search engine can seamlessly integrate into e-commerce, websites, app searches, and any other application that involves searching for items or documents.

The bottom line

LSI is a foundational technique for search and retrieval applications. It is simple to implement and computationally less expensive than advanced deep learning–based techniques. Nevertheless, it has trouble with scalability, real-time relevance, and multilingual understanding.

But with the advent of vector search and advanced word embeddings, tools like Meilisearch are redefining what’s possible in document understanding and retrieval. Meilisearch offers state-of-the-art features for document search:

  • Blazing-fast performance: Delivers search results in under 50 milliseconds for a smooth user experience.
  • Search-as-you-type: Provides real-time results with instant feedback as users type.
  • Typo tolerance: Ensures relevant results even with typos or misspellings in queries.
  • Comprehensive language support: Optimized for multiple languages, including Latin, Chinese, Japanese, and Hebrew.
  • Faceted search and filtering: Enables intuitive navigation through categories and filters.
  • Custom ranking and relevancy: Allows tailored ranking and relevancy rules for precise search results.
  • AI-ready integration: Works seamlessly with AI models for hybrid semantic and full-text search capabilities.

Do you plan to integrate an advanced AI search engine into your application? Sign up with Meilisearch to power your application with a high-performance search solution.

How we made Meilisearch talk to AI: introducing our MCP server

How we made Meilisearch talk to AI: introducing our MCP server

We've built a bridge between Meilisearch and AI assistants using the Model Context Protocol (MCP), enabling developers to manage search infrastructure through natural language.

Thomas Payet
Thomas Payet19 Feb 2025
Building a RAG system with Meilisearch: a comprehensive guide

Building a RAG system with Meilisearch: a comprehensive guide

Discover best practices for building a RAG system, with tips on optimizing documents, integrating AI, and why effective retrieval is key to success.

Beyond the hype: practical AI search strategies that deliver ROI

Beyond the hype: practical AI search strategies that deliver ROI

Unlock the power of AI-powered search for your SaaS business. Learn key features, budgeting tips, and implementation strategies to boost user engagement

Ilia Markov
Ilia Markov02 Dec 2024