Python主题建模| Toptal®-欧博体育app下载

费德里科•阿尔巴内塞

验证专家 在工程

Federico is a developer 和 data scientist who has worked at Facebook, 他在哪里做了机器学习模型预测. 他是Python专家和大学讲师. 他的博士研究方向是图形机器学习.

专业知识

Python 数据科学 NLP

以前在

Computers 和 the processors that power them are built to work with numbers. 与此形成鲜明对比的是, 电子邮件和社交媒体帖子的日常语言结构松散，不适合计算.

这就是自然语言处理 (NLP)出现了. NLP是计算机科学的一个分支，通过应用计算技术(即人工智能)来分析自然语言和语音，与语言学重叠. Topic modeling focuses on underst和ing which topics a given text is about. 主题建模可以让开发人员实现有用的功能，比如检测社交媒体上的突发新闻, 推荐个性化信息, 检测假用户, 以及信息流的特征.

开发人员如何才能让专注于计算的计算机理解那些复杂程度的人类交流?

一袋文字

To answer that question, we need to be able to describe a text mathematically. 我们将开始主题建模 Python 教程用最简单的方法:包字.

此方法将文本表示为一组单词. 例如，这个句子 这是一个例子 can be described as a set of words using the frequency with which those words appear:

{"an": 1， "example": 1， "is": 1， "this": 1}

注意这个方法忽略了词序. 举几个例子:

“我喜欢《欧博体育app下载》，但不喜欢《欧博体育app下载》.”
“我喜欢《欧博体育app下载》，但我不喜欢《欧博体育app下载》.”

These sentiments are represented by the same words, but they have opposite meanings. 然而，为了分析文本的主题，这些差异并不重要. 这两种情况, we are talking about tastes for 哈利波特和明星战争, 不管这些口味是什么. 因此，词序并不重要.

When we have multiple texts 和 seek to underst和 the differences among them, 我们需要为整个语料库分别考虑每个文本的数学表示. 我们可以用矩阵, in which each column represents a word or term 和 each row represents a text. 语料库的一种可能的表示方式是在每个单元格中记录给定单词(列)在特定文本(行)中的使用频率。.

In our example, the corpus is composed of two sentences (our matrix rows):

“我喜欢哈利波特。”
我喜欢《欧博体育app下载》

We list the words in this corpus in the order in which we encounter them: I, 就像, 哈利, 波特, 明星, 战争. 这些对应于我们的矩阵列.

矩阵中的值表示给定单词在每个短语中使用的次数:

[[1,1,1,1,0,0],
[1,1,0,0,1,1]]

图片左边有两行文字:I 就像哈利波特和 I 就像明星战争. This text is then turned into a bag of words in the center, “我喜欢哈利波特”变成了“{I: 1”, 如:1, 哈利:1, 波特:1, 明星:0, 而“我喜欢星球大战”变成了“{I: 1}”, 如:1, 哈利:0, 波特:0, 明星:1, 战争:“向右, 然后将这些数字排列成矩阵表示:前者变成“1 1 1 1 0 0”行，后者变成“1 1 0 0 1 1”行." — 转换成矩阵表示的文本

请注意，矩阵的大小是通过将文本数乘以至少一个文本中出现的不同单词数来确定的. The latter is usually unnecessarily large 和 can be reduced. 例如, a matrix might contain two columns for conjugated verbs, 比如“play”和“played”,，而不管它们的意思是相似的.

But columns that describe new concepts could be missing. 例如, “古典”和“音乐”各有各自的含义，但当它们结合在一起时——“古典音乐”——它们就有了另一个含义.

由于这些问题，有必要对文本进行预处理，以获得良好的效果.

预处理和主题聚类模型

For best results, it’s necessary to use multiple 预处理技术. 下面是一些最常用的:

小写字母. 将所有单词小写. 将所有单词小写. The meaning of a word does not change regardless of its position in the sentence.
n克. 考虑所有的组 n 单词排成一行作为新名词，称为字格. 这样，像“white house”这样的情况就会被考虑在内，并被添加到词汇表中.
阻止. Identify prefixes 和 suffixes of words to isolate them from their root. This way, words 就像 “play,” “played,” or “player” are represented by the word “play.词干提取可以在保留单词含义的同时减少词汇表中的单词数量 , 但是它大大减慢了预处理速度，因为它必须应用于语料库中的每个单词.
停止词. Do not take into account groups of words lacking in meaning or utility. 这些包括冠词和介词，但也可能包括对我们具体案例研究没有用的单词, 比如一些常用的动词.
词频率-逆文档频率(tf-idf). 使用tf-idf系数，而不是注意矩阵中每个单元格中每个单词的频率. 它由两个数字相乘组成:
- tf—the frequency of a given term or word in a text, 和
- idf——文档总数除以包含给定术语的文档数的对数.
tf–idf is a measure of how frequently a word is used in the corpus. 能够将单词细分成组, it is important to underst和 not only which words appear in each text, but also which words appear frequently in one text but not at all in others.

The following figure shows some simple examples of these 预处理技术语料库的原始文本在哪里被修改，以生成相关的和可管理的单词列表.

The "lowercase letters" technique transforms the sentence "The White House.变成单词表:“the”，“white”，“house”. The "字格" technique transforms it into a longer list: "the", “白色”, “房子”, “白色”, “白宫”. “词干”技术将句子“足球运动员踢了一场好比赛。." into this list: "the", "football", "play", "a", "good", "game". “停止词”技术将其转化为一个更短的列表:“football”，“play”，“good”，“game”。. — 文本预处理技术的例子

Now we’ll demonstrate how to apply some of these techniques in Python. 一旦我们把语料库用数学表示出来, 我们需要通过应用无监督机器学习算法来识别正在讨论的主题. 在这种情况下, “unsupervised” means that the algorithm doesn’t have any predefined topic labels, 比如“科幻小说”,来应用于它的输出.

聚类语料库, 我们可以从几种算法中进行选择, 包括非负矩阵分解(NMF), 稀疏主成分分析, 潜狄利克雷分配(LDA). 我们将重点介绍LDA，因为它在社交媒体上取得了良好的效果，被科学界广泛使用, 医学科学, 政治科学, 软件工程.

LDA是一种无监督主题分解模型:它根据文本包含的单词和某个单词属于某个主题的概率对文本进行分组. LDA算法输出主题词分布. 有了这些信息, 我们可以根据最可能与主题相关的单词来定义主题. Once we have identified the main topics 和 their associated words, 我们可以知道哪个或哪个主题适用于每个文本.

Consider the following corpus composed of five short sentences (all taken from 纽约时报 标题):

corpus = [ "Rafael Nadal Joins Roger Federer in Missing U.S. 开放”,
          “纳达尔退出澳网”，
          "拜登宣布应对病毒措施"
          “拜登的病毒计划面临现实”，
          “拜登的病毒计划在哪里?”

The algorithm should clearly identify one topic related to politics 和 coronavirus, 第二个与纳达尔和网球有关.

在Python中应用策略

In order to detect the topics, we must import the necessary libraries. Python has some useful libraries for NLP 和 machine learning, including NLTK 和 Scikit-learn (sklearn).

从sklearn.feature_extraction.文本导入CountVectorizer
从sklearn.feature_extraction.文本导入TfidfTransformer
从sklearn.分解导入LatentDirichletAllocation作为LDA
从nltk.语料库导入停词

使用 CountVectorizer (), we generate the matrix that denotes the frequency of the words of each text using CountVectorizer (). 请注意，如果您包含诸如 stop_words 为了包含停顿词， ngram_range 包括 n克,或 小写= True 将所有字符转换为小写.

count_vect = count_ectorizer (stop_words=stopwords.词(英语)、小写= True)
X_counts = count_vect.fit_transform(主体)
x_counts.todense ()

matrix([[0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0],
        [0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0],
        [1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
        [0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1],
        [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1]], dtype=int64)

To define the vocabulary of our corpus, we can simply use the attribute .get_feature_names ():

count_vect.get_feature_names ()

(“宣布”, “澳大利亚”, 拜登的, 费德勒的, “连接”, “措施”, “满足”, “失踪”, “纳达尔”, “开放”,  “计划”, “计划”, “拉斐尔。”, “现实”, “罗杰”, “站”, “病毒”)

Then, we perform the tf–idf calculations with the sklearn function:

tfidf_transformer = TfidfTransformer()
X_tfidf = tfidf_transformer.fit_transform (x_counts)

In order to perform the LDA decomposition, we have to define the number of topics. In this simple case, we know there are two topics or “dimensions.“但在一般情况下，这是一个超参数需要一些调整, which could be done using algorithms 就像 r和om search or grid search:

尺寸= 2
lda = lda (n_components = dimension)
Lda_array = lda.fit_transform (x_tfidf)
lda_array

数组([[0.8516198 , 0.1483802 ],
       [0.82359501, 0.17640499],
       [0.18072751, 0.81927249],
       [0.1695452 , 0.8304548 ],
       [0.18072805, 0.81927195]])

LDA是一种概率方法. 在这里，我们可以看到五个标题分别属于两个主题的概率. 我们可以看到，前两篇文章更有可能属于第一个主题，后三篇文章更有可能属于第二个主题, 正如预期的.

最后, if we want to underst和 what these two topics are about, 我们可以看到每个主题中最重要的单词:

组件= [lda ..Components_ [i] for i in range(len(lda).components_)))
特征= count_vect.get_feature_names ()
important_words = [sorted(features, key = lambda x: components[j][features.index(x)], reverse = True)[:3] for j in range(len(components))]
important_words
[[“开放”， “纳达尔”， “拉斐尔。”]， 
['virus'， 拜登的， “措施”]]

正如预期的, LDA正确地将与网球锦标赛和纳达尔有关的单词分配到第一个主题，将与拜登和病毒有关的单词分配到第二个主题.

大规模分析和真实世界用例

A large-scale analysis of topic modeling can be seen in this 纸; I studied the main news topics during the 2016 US presidential election 和 observed the topics some mass media—就像 the 纽约时报 和 Fox News—included in their coverage, such as corruption 和 immigration. 在本文中, 我还分析了大众媒体内容与选举结果之间的相关性和因果关系.

主题建模在学术界之外也被广泛用于发现存在于大量文本集合中的隐藏主题模式. 例如, 它可以用于推荐系统或在调查中确定客户/用户在谈论什么, 在反馈形式中, 或者在社交媒体上.

Toptal 工程博客向胡安·曼纽尔·奥尔蒂斯·德·萨拉特 for reviewing the code samples presented in this article.

主题建模推荐阅读

改进了Twitter的主题建模
艾博年，费德里科和埃斯特班·费尔斯坦. “改进了Twitter的主题建模 Through 社区 Pooling.(2021年12月20日): arXiv: 2201.00690 [cs.IR]

为公共卫生分析Twitter
保罗，迈克尔和马克·德雷兹. “You Are What You Tweet: 为公共卫生分析Twitter.2021年8月3日.

在推特上分类政治倾向
科恩，拉维夫和德里克·鲁斯. “在推特上分类政治倾向: It’s Not Easy!2021年8月3日.

使用关系主题模型捕获耦合
格瑟斯，马尔科姆和丹尼斯·波希瓦尼克. “用关系主题模型捕捉面向对象软件系统中类之间的耦合.2010年10月25日.

关于总博客的进一步阅读:

了解基本知识

什么是Python中的主题建模?
主题建模使用统计和机器学习模型来自动检测文本文档中的主题.
主题建模的用途是什么?
主题建模用于不同的任务, 比如在社交媒体上发现趋势和新闻, 检测假用户, 个性化消息推荐, 以及信息流的特征.
主题建模是有监督的还是无监督的?
There are multiple supervised 和 unsupervised topic modeling techniques. Some use a labeled document data set to classify articles. 其他人则分析单词出现的频率，以推断语料库中的潜在主题.
主题建模与文本分类相同吗?
不，它们不一样. 文本分类是一种监督学习任务，它将文本分类到预定义的组中. 与此形成鲜明对比的是, topic modeling does not necessarily need a labeled data set.

费德里科•阿尔巴内塞

验证专家 在工程

布宜诺斯艾利斯，阿根廷

2019年1月9日成为会员

作者简介

Federico is a developer 和 data scientist who has worked at Facebook, 他在哪里做了机器学习模型预测. 他是Python专家和大学讲师. 他的博士研究方向是图形机器学习.

作者都是各自领域经过审查的专家，并撰写他们有经验的主题. All of our content is peer reviewed 和 validated by Toptal experts in the same field.

专业知识

Python 数据科学 NLP

以前在

雇佣费德里科•

世界级的文章，每周发一次.

加入总冠军^® 社区.

聘请开发人员 or 申请成为发展商

费德里科•阿尔巴内塞

专业知识

一袋文字

预处理和主题聚类模型

在Python中应用策略

大规模分析和真实世界用例

主题建模推荐阅读

关于总博客的进一步阅读:

了解基本知识

什么是Python中的主题建模?

主题建模的用途是什么?

主题建模是有监督的还是无监督的?

主题建模与文本分类相同吗?

标签

费德里科•阿尔巴内塞

作者简介

专业知识

费德里科•阿尔巴内塞

使用 an LLM API As an Intelligent Virtual Assistant for Python Development

Toptal开发者

By费德里科•阿尔巴内塞

专业知识

一袋文字

预处理和主题聚类模型

在Python中应用策略

大规模分析和真实世界用例

主题建模推荐阅读

关于总博客的进一步阅读:

了解基本知识

什么是Python中的主题建模?

主题建模的用途是什么?

主题建模是有监督的还是无监督的?

主题建模与文本分类相同吗?

标签

作者简介

专业知识

Toptal开发者

费德里科•阿尔巴内塞