What is latent semantic indexing (LSI) and how does it work?
Learn how LSI works under the hood, see a practical Python implementation, and discover why this foundational technique remains relevant in today's AI-driven search landscape.

Latent semantic indexing (LSI) is a mathematical document understanding and retrieval approach. LSI is commonly used in search engines, e-commerce, website search, and other applications that require search capabilities.
This article provides a comprehensive overview of LSI. In particular, it answers the following questions:
- What is LSI, and how does it work?
- What are some benefits of LSI, and where is it used?
- How do you implement LSI in Python? (we provide a step-by-step explanation)
- What are some modern alternatives to LSI?
So, let’s get right into it.
What is latent semantic indexing (LSI)?
LSI is an information retrieval method used in natural language processing (NLP) to uncover latent (hidden) relationships between words and concepts within a body of text.
Unlike the traditional keyword-based methods, LSI is a type of semantic search that analyzes the semantic relationships between terms in a document to extract hidden concepts and groups documents according to those concepts.
LSI uses singular value decomposition (SVD) to simplify complex, high-dimensional data by breaking it into smaller, hidden concepts. This helps identify patterns in the relationships between words and documents. LSI addresses the challenges of synonyms and antonyms and projects words with the same meaning into similar higher dimensions.
For example, related terms ‘doctor’ and ‘physician’ will be placed closer together in a high-dimensional LSI graph, reflecting the same concept. When a user searches for a document, the query is projected into high-dimensional space, and the most relevant document is returned.
LSI is one of the foundational techniques for document understanding and retrieval. Owing to its simplicity and computationally less expensive nature, it is still widely used.
Now that you know what LSI is, let’s see how it works.
How does latent semantic indexing work?
LSI employs SVD, a mathematical technique that decomposes a term-document matrix into smaller matrices, capturing the underlying relationship between terms and concepts in the documents.
The following figure demonstrates the workflow for LSI.
Let’s discuss the above steps with examples:
Step 1: Import the dataset
The first step is to create a set of documents to which you want to apply LSI.
Let’s assume you have the following four documents:
Doc 1 | Cats and dogs are wonderful pets. |
---|---|
Doc 2 | Dogs are loyal pets. |
Doc 3 | Pets bring joy and happinesss. |
Doc 4 | Happiness and joy bring meaning to life. |
Step 2: Preprocess documents
Text documents may contain stopwords that do not contribute to the meaning or concepts in a document. In preprocessing, you can remove stop words, convert the text into lowercase, and remove other useless information.
After preprocessing, our documents may look like this:
Doc 1 | cats dogs wonderful pets. |
---|---|
Doc 2 | dogs loyal pets |
Doc 3 | pets bring joy happiness |
Doc 4 | happiness joy bring meaning life. |
Step 3: Create term-document matrix
Before you create a term-document matrix, you need to create a set of unique words for all the documents. This set is often referred to as the vocabulary. The vocabulary for the documents in our sample dataset looks like this:
PATCH /network ['bring', 'cats', 'dogs', 'happiness', 'joy', 'life', 'loyal', 'meaning', 'pets' 'wonderful']
The next step is to create a term-document matrix of the shape N x M, where N is the number of documents and M is the vocabulary size. Each row in the matrix corresponds to the frequency of word occurrence in a document. This matrix captures patterns of word co-occurrence across documents, which is crucial for identifying latent concepts.
The term-document matrix for our dataset looks like this:
Step 4: Singular value decomposition
The SVD algorithm decomposes a matrix into smaller matrices. In LSI, SVD decomposes a term-document matrix A into three matrices: A=UΣVT
- Matrix U: Relates documents to latent concepts. It is also known as the document-concept similarity matrix. This matrix shows how much a document relates to a particular concept.
- Matrix Σ: A diagonal matrix of singular values representing the strength of each concept.
- VT: Relates terms to latent concepts and is often called the term-concept similarity matrix. It shows how much a term relates to a particular concept.
I will not delve into the mathematical details of SVD here. However, the three matrices retrieved for the two concepts in our dataset look like this.
Step 5: Analyze LSI matrices
It is important to note that the concept names are not automatically generated in LSI. Instead, you must look at the documents or terms grouped and infer the concepts.
For example, you can see that Doc 1 and Doc 2 belong to concept 2 since they have higher values for the second column in the document-concept similarity matrix. Similarly, Doc 3 and Doc 4 belong to concept 1.
Docs 1 and 2 mention animals and pets. Docs 3 and 4 are more about happiness and joy. Therefore, we can name the two concepts: pet animals and life and happiness. This allows LSI to retrieve documents with the most relevant content, even when exact terms are not matched.
You will see the complete Python application of the above example in a later section; for now, let’s look at some uses and benefits of LSI.
Where is latent semantic indexing used?
The latest semantic indexing is used in various NLP domains, including text summarization, automatic document categorization, online customer support, and spam filtering.
The following are some use cases where LSI comes in handy:
- Search engines: LSI analyzes user queries and documents semantically to improve search engine performance. This helps search engines understand a user’s search intent and retrieve more relevant web pages and related searches.
- Automated document classification: LSI search algorithms efficiently classify documents into predefined categories. They are commonly used for unsupervised sentiment classification, email classification, and other purposes.
- Online customer support: As with search engines, LSI can match searcher queries with relevant solutions in customer management systems.
- Spam filtering: LSI detects and filters spam emails based on semantic content.
- Information visualization: Document clusters generated via LSI can be plotted in low-dimensional space to view the relationship between the documents.
Now that we have seen some uses of LSI, let’s discuss its advantages.
What are the benefits of latent semantic indexing?
The key benefits of LSI are the following:
- Concept-based clustering: LSI groups related documents together, making organizing and analyzing large datasets easier.
- Handling synonyms and polysemy: LSI can effectively handle synonyms. For instance, the words ‘car’ and ‘automobile’ will have similar semantic representations.
- Scalability: LSI is a mathematical technique that leverages high computational power to scale efficiently, making it ideal for applications such as e-discovery and enterprise search systems.
- Robustness against typos: LSI’s reliance on semantic meaning makes it less sensitive to spelling errors, improving user experience in search and retrieval systems.
- Versatility across domains: Applied in search engines, education, finance, and more.
The following section shows the main difference between LSI and latent semantic analysis.
What is the difference between LSI and LSA?
LSI and Latent Semantic Analysis (LSA) are often used interchangeably. Both techniques use SVD at their heart. However, there are slight differences between their applications and focuses.
LSI was initially developed as an information retrieval and search technique that addresses challenges like semantic understanding and synonymy of documents. LSI's primary application retrieves semantically similar documents against a user’s search query.
On the other hand, latent semantic analysis goes beyond information retrieval to focus on other NLP tasks such as speech recognition, document clustering and classification, and cognitive modeling.
Let’s see how to implement LSI in Python.
Implementing Latent Semantic Indexing in Python
This section will show a hands-on example of implementing an LSI algorithm in Python.
Installing and importing required libraries
We will use the Python Scikit Learn library and NLTK toolkit to implement LSI in Python. The following script installs these libraries.
PATCH /network !pip install -qU scikit-learn nltk pandas
The script below imports the required modules and classes to run the Python scripts mentioned in this article.
PATCH /network import numpy as np from sklearn.feature_extraction.text import CountVectorizer from sklearn.decomposition import TruncatedSVD from sklearn.metrics.pairwise import cosine_similarity from nltk.corpus import stopwords from nltk.tokenize import word_tokenize import nltk import matplotlib.pyplot as plt import pandas as pd nltk.download('stopwords') nltk.download('punkt') nltk.download('punkt_tab')
Let’s implement LSI step by step. It is pertinent to mention that these are the steps you saw in a previous section explaining how LSI works. Here, they are implemented in Python.
Step 1: Import documents
The first step is to collect the documents on which you want to implement LSI. These can be your personal, business, or client documents.
This section will use a small dataset of four dummy documents containing one sentence each.
PATCH /network # Example documents documents = [ "Cats and dogs are wonderful pets.", "Dogs are loyal pets.", "Pets bring joy and happiness.", "Happiness and joy bring meaning to life." ]
Step 2: Preprocess documents
In the preprocessing step, we will remove stopwords and punctuation from the documents, as shown in the following script:
PATCH /network # Preprocessing: Tokenization and stopword removal stop_words = set(stopwords.words('english')) def preprocess(doc): words = word_tokenize(doc.lower()) return ' '.join([word for word in words if word.isalnum() and word not in stop_words]) processed_docs = [preprocess(doc) for doc in documents] processed_docs
Output:
Step 3: Create term-document matrix
You can use the fit_transform()
method of the CountVectorizer()
class from the Sklearn library to create a term-document matrix. You can retrieve the document vocabulary using the get_feature_names_out()
method.
PATCH /network vectorizer = CountVectorizer() term_document_matrix = vectorizer.fit_transform(processed_docs) feature_names = vectorizer.get_feature_names_out() print(feature_names) term_document_array = term_document_matrix.toarray() df_term_document = pd.DataFrame(term_document_array, columns=feature_names, index=[f"Doc {i+1}" for i in range(len(term_document_array))]) print(df_term_document)
Output:
The above output shows the documents’ vocabulary (list of unique words) and the term-document matrix, demonstrating word frequencies for each document.
Step 4: Apply singular value decomposition
You can use the TrancatedSVD
class from the Sklearn library to implement SVD. You must pass the number of concepts you want to extract from the documents.
In the script below, we extract two concepts. The output shows the concept strength matrix, the document-concept similarity matrix, and the term-concept similarity matrix. The concept strength matrix shows that concept 1 is slightly more dominant in the documents.
PATCH /network svd = TruncatedSVD(n_components=2, random_state=42) lsi_matrix = svd.fit_transform(term_document_matrix) # Display Results print(" Singular Values (Concept Strength): ", svd.singular_values_) print(" Document-Concept Similarity Matrix: ", lsi_matrix) print(" Term-Concept Similarity Matrix: ", svd.components_.T)
We will analyze the document-concept and term-concept similarity matrices in the next step.
Step 5: Analyze LSI matrices
Let’s plot a 2-D plot that shows documents on the concept axis.
PATCH /network # Extract values for Concept 1 (x-axis) and Concept 2 (y-axis) x = lsi_matrix[:, 0] # Values for Concept 1 y = lsi_matrix[:, 1] # Values for Concept 2 # Create a scatter plot plt.figure(figsize=(8, 6)) plt.scatter(x, y, color='blue', label='Documents') # Annotate each document for i, (x_val, y_val) in enumerate(zip(x, y)): plt.text(x_val + 0.02, y_val, f'Doc {i+1}', fontsize=9) # Add gridlines, labels, and title plt.axhline(0, color='gray', linestyle='--', linewidth=0.5) plt.axvline(0, color='gray', linestyle='--', linewidth=0.5) plt.title("Document-Concept Similarity") plt.xlabel("Concept 1") plt.ylabel("Concept 2") plt.grid() plt.legend() plt.show()
Output:
The output shows that documents 1 and 2 belong mainly to concept 2, while documents 3 and 4 belong to concept 1.
To get an idea of the information in each concept, you can plot the terms for each concept.
PATCH /network terms = vectorizer.get_feature_names_out() concept1_weights = svd.components_[0] concept2_weights = svd.components_[1] fig, ax = plt.subplots(1, 2, figsize=(12, 6)) ax[0].barh(terms, concept1_weights, color='orange') ax[0].set_title("Term Weights for Concept 1") ax[0].set_xlabel("Weight") ax[1].barh(terms, concept2_weights, color='green') ax[1].set_title("Term Weights for Concept 2") ax[1].set_xlabel("Weight") plt.tight_layout() plt.show()
Output:
The above output shows that related keywords such as ‘pets,’ ‘joy,’ ‘happiness,’ ‘bring,’ etc., belong mainly to concept 1, which is about life and emotions.
On the other hand, the terms ‘pets,’ ‘wonderful,’ ‘cats,’ ‘dogs,’ etc., belong mainly to concept 2. We can infer that concept 2 is about pets and animals.
Now you know why documents 1 and 2 belong to concept 2 and documents 3 and 4 belong to concept 1.
And that’s it. You have developed your first LSI model using your custom documents.
In the next step, you will learn to use LSI to retrieve relevant search results against a user query.
Search and retrieval
You need to preprocess the query like you preprocessed your documents for LSA.
PATCH /network user_query = "Joyful pets bring happiness to life." # Example query preprocessed_query = preprocess(user_query) # Preprocess query print("Preprocessed Query:", preprocessed_query)
Output:
Next, use the SVD model you used to create document-concept and term-concept similarity matrices to embed a query in the LSI space.
You can then find the similarity between the query and the documents in the LSI space using the cosine similarity or any other vector similarity function.
PATCH /network query_vector = vectorizer.transform([preprocessed_query]) # Transform query to term-document matrix query_lsi = svd.transform(query_vector) # Map query to LSI latent space print(" Query in LSI Space (Concepts): ", query_lsi) # Use cosine similarity between the query and document vectors similarities = cosine_similarity(query_lsi, lsi_matrix) print(" Similarity Scores: ", similarities)
Output:
Finally, you can retrieve and rank the documents based on their relevance to the input query. In this case, cosine similarity will be the ranking factor for the retrieved documents.
PATCH /network # Rank documents by similarity doc_indices = np.argsort(similarities[0])[::-1] # Sort by descending similarity print(" Ranked Document Indices (Most Relevant First):", doc_indices) # Output relevant documents print(" Top Relevant Documents:") for idx in doc_indices: print(f"Doc {idx + 1}: {documents[idx]} (Similarity: {similarities[0][idx]:.3f})")
Output:
The output shows the search rankings for the documents against the input query. Document 3 is most relevant to the search terms in the query, which makes sense as it discusses both pets and happiness.
Now, let’s see if LSI still matters.
Is latent semantic indexing still relevant?
LSI is easy to implement and it isn’t computationally expensive. That's why it's still used when implementing simple document understanding and retrieval solutions where a deep understating of the relationships between words and concepts is unnecessary.
However, newer methods have been developed to enable a more advanced understanding of documents. These include vector search, word embeddings, and transformer approaches based on machine learning and deep learning techniques. These methods outperform LSI on almost all benchmarks.
Meilisearch is an advanced AI search engine that leverages cutting-edge vector search approaches to integrate state-of-the-art search engine capabilities into your applications. It implements semantic search techniques based on word embeddings and vector search that allow a deeper understanding of relationships and concepts within a document, improving the relevance and robustness of retrieved documents.
Meilisearch's AI search engine can seamlessly integrate into e-commerce, websites, app searches, and any other application that involves searching for items or documents.
The bottom line
LSI is a foundational technique for search and retrieval applications. It is simple to implement and computationally less expensive than advanced deep learning–based techniques. Nevertheless, it has trouble with scalability, real-time relevance, and multilingual understanding.
But with the advent of vector search and advanced word embeddings, tools like Meilisearch are redefining what’s possible in document understanding and retrieval. Meilisearch offers state-of-the-art features for document search:
- Blazing-fast performance: Delivers search results in under 50 milliseconds for a smooth user experience.
- Search-as-you-type: Provides real-time results with instant feedback as users type.
- Typo tolerance: Ensures relevant results even with typos or misspellings in queries.
- Comprehensive language support: Optimized for multiple languages, including Latin, Chinese, Japanese, and Hebrew.
- Faceted search and filtering: Enables intuitive navigation through categories and filters.
- Custom ranking and relevancy: Allows tailored ranking and relevancy rules for precise search results.
- AI-ready integration: Works seamlessly with AI models for hybrid semantic and full-text search capabilities.