Topic modeling is a technique in natural language processing (NLP) that aims to discover latent topics within a collection of documents. These topics are hidden patterns of co-occurring words that represent underlying themes in the text. One popular algorithm for topic modeling is Latent Dirichlet Allocation (LDA).
Imagine you have a collection of news articles. Using LDA, you can extract topics such as “politics,” “sports,” and “technology” from these articles. Each topic is represented by a distribution of words, where words with higher probabilities in a topic are more representative of that topic.
For example, in the topic of “technology,” words like “computer,” “software,” and “internet” might have higher probabilities, while in the topic of “sports,” words like “football,” “basketball,” and “team” might be more prominent.
Through topic modeling, you can gain insights into the main themes present in a large corpus of text, which can be useful for tasks such as document categorization, summarization, and information retrieval.
Let’s create a practical example using sample documents about different topics. We’ll then apply topic modeling to discover the underlying topics in these documents. For simplicity, we’ll use a small set of documents, but in real applications, you would use a larger corpus.
Sample Documents:
- Document 1 (Sports): “The football team won the championship last night. It was an exciting match with a lot of goals.”
- Document 2 (Technology): “The new smartphone features a high-resolution screen and a powerful processor. It is expected to be a hit in the market.”
- Document 3 (Politics): “The government announced new policies aimed at improving healthcare and education. There has been mixed reactions from the public.”
- Document 4 (Travel): “Traveling to exotic destinations is a great way to relax and unwind. Experience new cultures and cuisines.”
Now, let’s apply topic modeling using LDA to discover the underlying topics in these documents. We’ll use Python with the gensim
library for this task.
from gensim import corpora, models
import pprint
# Sample documents
documents = [
"The football team won the championship last night. It was an exciting match with a lot of goals.",
"The new smartphone features a high-resolution screen and a powerful processor. It is expected to be a hit in the market.",
"The government announced new policies aimed at improving healthcare and education. There has been mixed reactions from the public.",
"Traveling to exotic destinations is a great way to relax and unwind. Experience new cultures and cuisines."
]
# Tokenize documents
tokenized_documents = [doc.lower().split() for doc in documents]
# Create a dictionary from the tokenized documents
dictionary = corpora.Dictionary(tokenized_documents)
# Create a document-term matrix
corpus = [dictionary.doc2bow(doc) for doc in tokenized_documents]
# Apply LDA model
lda_model = models.LdaModel(corpus, num_topics=2, id2word=dictionary, passes=15)
# Print topics and their top words
pprint.pprint(lda_model.print_topics())
In this example, we are using LDA to discover 2 topics in the sample documents. Run above code in your google collab. The output will show the top words associated with each topic, helping us interpret the themes discovered by the model.
Let us understand this now.
Let’s break down the code with explanations for each part:
- Import necessary libraries:
from gensim import corpora, models
import pprint
gensim
: A Python library for topic modeling, document indexing, and similarity retrieval.corpora
: Module within gensim for building document corpora (collections of text documents).models
: Module within gensim for various models, including LDA.pprint
: Module for pretty-printing Python data structures.
- Sample Documents:
documents = [
"The football team won the championship last night. It was an exciting match with a lot of goals.",
"The new smartphone features a high-resolution screen and a powerful processor. It is expected to be a hit in the market.",
"The government announced new policies aimed at improving healthcare and education. There has been mixed reactions from the public.",
"Traveling to exotic destinations is a great way to relax and unwind. Experience new cultures and cuisines."
]
- These are sample text documents representing different topics (sports, technology, politics, and travel).
- Tokenize Documents:
tokenized_documents = [doc.lower().split() for doc in documents]
- Tokenization is the process of breaking down text into smaller units, such as words or sentences.
- Here, we convert each document to lowercase and split it into a list of words.
- Create Dictionary:
dictionary = corpora.Dictionary(tokenized_documents)
corpora.Dictionary
creates a mapping between words and their integer ids.- This step is necessary for converting text documents into a numerical format that can be used by the LDA model.
- Create Document-Term Matrix:
corpus = [dictionary.doc2bow(doc) for doc in tokenized_documents]
- Convert each document into a bag-of-words representation, where each word is represented by its id and its frequency in the document.
- This step creates the input format required for training the LDA model.
- Apply LDA Model:
lda_model = models.LdaModel(corpus, num_topics=2, id2word=dictionary, passes=15)
models.LdaModel
is used to train the LDA model on the corpus.num_topics=2
specifies the number of topics to discover (in this case, 2).id2word=dictionary
specifies the mapping between word ids and words.passes=15
specifies the number of passes through the corpus during training.
- Print Topics:
pprint.pprint(lda_model.print_topics())
- Print the discovered topics along with their top words.
- This step helps in interpreting the themes present in the documents.
In summary, the code processes a collection of text documents, tokenizes them, converts them into a numerical format, trains an LDA model to discover topics, and prints the discovered topics with their top words.
For more information check out https://www.youtube.com/watch?v=9mNV4AwA9QI