如何从dataframe里选取特定的一个文本来计算tf—idf

muran · 发表于 2019-10-18 23:24:08

本帖最后由 muran 于 2019-10-18 23:39 编辑

我现在有一个csv文件里面有3列，分别是：日期，新闻标题，和回报。想从前200个里选取特定的一天，计算某个字在那天新闻标题里的tf—idf。
我现在用loc找到了那个日期和字，但是不知道怎么算它的tf—idf
df = pd.read_csv("数据.csv")
documents=df.head(200)
document_words = [doc.split() for doc in documents]

vocab = sorted(set(sum(document_words, [])))

vocab_dict = {k: i + 1 for i, k in enumerate(vocab)}
print('All words:\n', vocab, "\n")
print('Position of all the words:\n', vocab_dict, "\n")

X_tf = np.zeros((len(documents), len(vocab)), dtype=int)

for i, doc in enumerate(document_words):
for word in doc:
X_tf[i, vocab_dict[word] - 1] += 1
X_tf=df.loc[df["date"]=="2008-01-07"],[df["headlines"]=="apple"] #不知道这里有没有成功找到
print(X_tf) #这里它有一个错误：tuple indices must be integers or slices, not tuple我不知道要怎么改。。。

idf = np.log(X_tf.shape[0]/X_tf.astype(bool).sum(axis=0))
print(idf)

X_tfidf = X_tf * idf
print(X_tfidf)

最后出来的应该要是一个数而不是matrix
求解，谢谢!

sheeboard · 发表于 2019-10-22 18:24:51

是统计词频吗？请上传个测试文件。

		自动登录	找回密码
密码			立即注册

[求助] 如何从dataframe里选取特定的一个文本来计算tf—idf

回帖奖励 +1 点威望