• 元宇宙:本站分享元宇宙相关资讯,资讯仅代表作者观点与平台立场无关,仅供参考.

整理了25个Python文本处理案例,收藏!

  • AI科技大本营
  • 2022年2月08日09时

作者 | 周萝卜

来源 | 萝卜大杂烩

Python 处理文本是一项非常常见的功能,本文整理了多种文本提取及NLP相关的案例,还是非常用心的
  • 提取 PDF 内容

  • 提取 Word 内容

  • 提取 Web 网页内容

  • 读取 Json 数据

  • 读取 CSV 数据

  • 删除字符串中的标点符号

  • 使用 NLTK 删除停用词

  • 使用 TextBlob 更正拼写

  • 使用 NLTK 和 TextBlob 的词标记化

  • 使用 NLTK 提取句子单词或短语的词干列表

  • 使用 NLTK 进行句子或短语词形还原

  • 使用 NLTK 从文本文件中查找每个单词的频率

  • 从语料库中创建词云

  • NLTK 词法散布图

  • 使用 countvectorizer 将文本转换为数字

  • 使用 TF-IDF 创建文档术语矩阵

  • 为给定句子生成 N-gram

  • 使用带有二元组的 sklearn CountVectorize 词汇规范

  • 使用 TextBlob 提取名词短语

  • 如何计算词-词共现矩阵

  • 使用 TextBlob 进行情感分析

  • 使用 Goslate 进行语言翻译

  • 使用 TextBlob 进行语言检测和翻译

  • 使用 TextBlob 获取定义和同义词

  • 使用 TextBlob 获取反义词列表

1提取 PDF 内容

#pipinstallPyPDF2安装PyPDF2
importPyPDF2
fromPyPDF2importPdfFileReader

#Creatingapdffileobject.
pdf=open("test.pdf","rb")

#Creatingpdfreaderobject.
pdf_reader=PyPDF2.PdfFileReader(pdf)

#Checkingtotalnumberofpagesinapdffile.
print("TotalnumberofPages:",pdf_reader.numPages)

#Creatingapageobject.
page=pdf_reader.getPage(200)

#Extractdatafromaspecificpagenumber.
print(page.extractText())

#Closingtheobject.
pdf.close()

2提取 Word 内容

#pipinstallpython-docx安装python-docx


importdocx


defmain():
try:
doc=docx.Document('test.docx')#Creatingwordreaderobject.
data=""
fullText=[]
forparaindoc.paragraphs:
fullText.append(para.text)
data='\n'.join(fullText)

print(data)

exceptIOError:
print('Therewasanerroropeningthefile!')
return


if__name__=='__main__':
main()

3提取 Web 网页内容

#pipinstallbs4安装bs4

fromurllib.requestimportRequest,urlopen
frombs4importBeautifulSoup

req=Request('http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1',
headers={'User-Agent':'Mozilla/5.0'})

webpage=urlopen(req).read()

#Parsing
soup=BeautifulSoup(webpage,'html.parser')

#Formatingtheparsedhtmlfile
strhtm=soup.prettify()

#Printfirst500lines
print(strhtm[:500])

#Extractmetatagvalue
print(soup.title.string)
print(soup.find('meta',attrs={'property':'og:description'}))

#Extractanchortagvalue
forxinsoup.find_all('a'):
print(x.string)

#ExtractParagraphtagvalue
forxinsoup.find_all('p'):
print(x.text)

4读取 Json 数据

importrequests
importjson

r=requests.get("https://support.oneskyapp.com/hc/en-us/article_attachments/202761727/example_2.json")
res=r.json()

#Extractspecificnodecontent.
print(res['quiz']['sport'])

#Dumpdataasstring
data=json.dumps(res)
print(data)

5读取 CSV 数据

importcsv

withopen('test.csv','r')ascsv_file:
reader=csv.reader(csv_file)
next(reader)#Skipfirstrow
forrowinreader:
print(row)

6删除字符串中的标点符号

importre
importstring

data="Stuningevenforthenon-gamer:Thissoundtrackwasbeautiful!\
ItpaintstheseneryinyourmindsowellIwouldrecomend\
iteventopeoplewhohatevid.gamemusic!IhaveplayedthegameChrono\
CrossbutoutofallofthegamesIhaveeverplayedithasthebestmusic!\
Itbacksawayfromcrudekeyboardingandtakesafresherstepwithgrate\
guitarsandsoulfulorchestras.\
Itwouldimpressanyonewhocarestolisten!"


#Methood1:Regex
#Removethespecialcharatersfromthereadstring.
no_specials_string=re.sub('[!#?,.:";]','',data)
print(no_specials_string)


#Methood2:translate()
#Raketranslatorobject
translator=str.maketrans('','',string.punctuation)
data=data.translate(translator)
print(data)

7使用 NLTK 删除停用词

fromnltk.corpusimportstopwords


data=['Stuningevenforthenon-gamer:Thissoundtrackwasbeautiful!\
ItpaintstheseneryinyourmindsowellIwouldrecomend\
iteventopeoplewhohatevid.gamemusic!IhaveplayedthegameChrono\
CrossbutoutofallofthegamesIhaveeverplayedithasthebestmusic!\
Itbacksawayfromcrudekeyboardingandtakesafresherstepwithgrate\
guitarsandsoulfulorchestras.\
Itwouldimpressanyonewhocarestolisten!'
]

#Removestopwords
stopwords=set(stopwords.words('english'))

output=[]
forsentenceindata:
temp_list=[]
forwordinsentence.split():
ifword.lower()notinstopwords:
temp_list.append(word)
output.append(''.join(temp_list))


print(output)

8使用 TextBlob 更正拼写

fromtextblobimportTextBlob

data="Naturallanguageisacantralpartofourdaytodaylife,andit'ssoantrestingtoworkonanyproblemrelatedtolangages."

output=TextBlob(data).correct()
print(output)

9使用 NLTK 和 TextBlob 的词标记化

importnltk
fromtextblobimportTextBlob


data="Naturallanguageisacentralpartofourdaytodaylife,andit'ssointerestingtoworkonanyproblemrelatedtolanguages."

nltk_output=nltk.word_tokenize(data)
textblob_output=TextBlob(data).words

print(nltk_output)
print(textblob_output)
Output:
['Natural', 'language', 'is', 'a', 'central', 'part', 'of', 'our', 'day', 'to', 'day', 'life', ',', 'and', 'it', "'s", 'so', 'interesting', 'to', 'work', 'on', 'any', 'problem', 'related', 'to', 'languages', '.']
['Natural', 'language', 'is', 'a', 'central', 'part', 'of', 'our', 'day', 'to', 'day', 'life', 'and', 'it', "'s", 'so', 'interesting', 'to', 'work', 'on', 'any', 'problem', 'related', 'to', 'languages']

10使用 NLTK 提取句子单词或短语的词干列表

fromnltk.stemimportPorterStemmer

st=PorterStemmer()
text=['Wheredidhelearntodancelikethat?',
'Hiseyesweredancingwithhumor.',
'Sheshookherheadanddancedaway',
'Alexwasanexcellentdancer.']

output=[]
forsentenceintext:
output.append("".join([st.stem(i)foriinsentence.split()]))

foriteminoutput:
print(item)

print("-"*50)
print(st.stem('jumping'),st.stem('jumps'),st.stem('jumped'))
Output:
where did he learn to danc like that?
hi eye were danc with humor.
she shook her head and danc away
alex wa an excel dancer.
--------------------------------------------------
jump jump jump

11使用 NLTK 进行句子或短语词形还原

fromnltk.stemimportWordNetLemmatizer

wnl=WordNetLemmatizer()
text=['Shegrippedthearmrestashepassedtwocarsatatime.',
'Hercarwasinfullview.',
'Anumberofcarscarriedoutofstatelicenseplates.']

output=[]
forsentenceintext:
output.append("".join([wnl.lemmatize(i)foriinsentence.split()]))

foriteminoutput:
print(item)

print("*"*10)
print(wnl.lemmatize('jumps','n'))
print(wnl.lemmatize('jumping','v'))
print(wnl.lemmatize('jumped','v'))

print("*"*10)
print(wnl.lemmatize('saddest','a'))
print(wnl.lemmatize('happiest','a'))
print(wnl.lemmatize('easiest','a'))
Output:
She gripped the armrest a he passed two car at a time.
Her car wa in full view.
A number of car carried out of state license plates.
**********
jump
jump
jump
**********
sad
happy
easy

12使用 NLTK 从文本文件中查找每个单词的频率

importnltk
fromnltk.corpusimportwebtext
fromnltk.probabilityimportFreqDist

nltk.download('webtext')
wt_words=webtext.words('testing.txt')
data_analysis=nltk.FreqDist(wt_words)

#Let'stakethespecificwordsonlyiftheirfrequencyisgreaterthan3.
filter_words=dict([(m,n)form,nindata_analysis.items()iflen(m)>3])

forkeyinsorted(filter_words):
print("%s:%s"%(key,filter_words[key]))

data_analysis=nltk.FreqDist(filter_words)

data_analysis.plot(25,cumulative=False)
Output:
[nltk_data] Downloading package webtext to
[nltk_data] C:\Users\amit\AppData\Roaming\nltk_data...
[nltk_data] Unzipping corpora\webtext.zip.
1989: 1
Accessing: 1
Analysis: 1
Anyone: 1
Chapter: 1
Coding: 1
Data: 1
...

13从语料库中创建词云

importnltk
fromnltk.corpusimportwebtext
fromnltk.probabilityimportFreqDist
fromwordcloudimportWordCloud
importmatplotlib.pyplotasplt

nltk.download('webtext')
wt_words=webtext.words('testing.txt')#Sampledata
data_analysis=nltk.FreqDist(wt_words)

filter_words=dict([(m,n)form,nindata_analysis.items()iflen(m)>3])

wcloud=WordCloud().generate_from_frequencies(filter_words)

#Plottingthewordcloud
plt.imshow(wcloud,interpolation="bilinear")

plt.axis("off")
(-0.5,399.5,199.5,-0.5)
plt.show()

14NLTK 词法散布图

importnltk
fromnltk.corpusimportwebtext
fromnltk.probabilityimportFreqDist
fromwordcloudimportWordCloud
importmatplotlib.pyplotasplt

words=['data','science','dataset']

nltk.download('webtext')
wt_words=webtext.words('testing.txt')#Sampledata

points=[(x,y)forxinrange(len(wt_words))
foryinrange(len(words))ifwt_words[x]==words[y]]

ifpoints:
x,y=zip(*points)
else:
x=y=()

plt.plot(x,y,"rx",scalex=.1)
plt.yticks(range(len(words)),words,color="b")
plt.ylim(-1,len(words))
plt.title("LexicalDispersionPlot")
plt.xlabel("WordOffset")
plt.show()

15使用 countvectorizer 将文本转换为数字

importpandasaspd
fromsklearn.feature_extraction.textimportCountVectorizer

#Sampledataforanalysis
data1="Javaisalanguageforprogrammingthatdevelopsasoftwareforseveralplatforms.AcompiledcodeorbytecodeonJavaapplicationcanrunonmostoftheoperatingsystemsincludingLinux,Macoperatingsystem,andLinux.MostofthesyntaxofJavaisderivedfromtheC++andClanguages."
data2="Pythonsupportsmultipleprogrammingparadigmsandcomesupwithalargestandardlibrary,paradigmsincludedareobject-oriented,imperative,functionalandprocedural."
data3="Goistypedstaticallycompiledlanguage.ItwascreatedbyRobertGriesemer,KenThompson,andRobPikein2009.Thislanguageoffersgarbagecollection,concurrencyofCSP-style,memorysafety,andstructuraltyping."

df1=pd.DataFrame({'Java':[data1],'Python':[data2],'Go':[data2]})

#Initialize
vectorizer=CountVectorizer()
doc_vec=vectorizer.fit_transform(df1.iloc[0])

#CreatedataFrame
df2=pd.DataFrame(doc_vec.toarray().transpose(),
index=vectorizer.get_feature_names())

#Changecolumnheaders
df2.columns=df1.columns
print(df2)
Output:
Go Java Python
and 2 2 2
application 0 1 0
are 1 0 1
bytecode 0 1 0
can 0 1 0
code 0 1 0
comes 1 0 1
compiled 0 1 0
derived 0 1 0
develops 0 1 0
for 0 2 0
from 0 1 0
functional 1 0 1
imperative 1 0 1
...

16使用 TF-IDF 创建文档术语矩阵

importpandasaspd
fromsklearn.feature_extraction.textimportTfidfVectorizer

#Sampledataforanalysis
data1="Javaisalanguageforprogrammingthatdevelopsasoftwareforseveralplatforms.AcompiledcodeorbytecodeonJavaapplicationcanrunonmostoftheoperatingsystemsincludingLinux,Macoperatingsystem,andLinux.MostofthesyntaxofJavaisderivedfromtheC++andClanguages."
data2="Pythonsupportsmultipleprogrammingparadigmsandcomesupwithalargestandardlibrary,paradigmsincludedareobject-oriented,imperative,functionalandprocedural."
data3="Goistypedstaticallycompiledlanguage.ItwascreatedbyRobertGriesemer,KenThompson,andRobPikein2009.Thislanguageoffersgarbagecollection,concurrencyofCSP-style,memorysafety,andstructuraltyping."

df1=pd.DataFrame({'Java':[data1],'Python':[data2],'Go':[data2]})

#Initialize
vectorizer=TfidfVectorizer()
doc_vec=vectorizer.fit_transform(df1.iloc[0])

#CreatedataFrame
df2=pd.DataFrame(doc_vec.toarray().transpose(),
index=vectorizer.get_feature_names())

#Changecolumnheaders
df2.columns=df1.columns
print(df2)
Output:
Go Java Python
and 0.323751 0.137553 0.323751
application 0.000000 0.116449 0.000000
are 0.208444 0.000000 0.208444
bytecode 0.000000 0.116449 0.000000
can 0.000000 0.116449 0.000000
code 0.000000 0.116449 0.000000
comes 0.208444 0.000000 0.208444
compiled 0.000000 0.116449 0.000000
derived 0.000000 0.116449 0.000000
develops 0.000000 0.116449 0.000000
for 0.000000 0.232898 0.000000
...

17为给定句子生成 N-gram

NLTK
importnltk
fromnltk.utilimportngrams

#Functiontogeneraten-gramsfromsentences.
defextract_ngrams(data,num):
n_grams=ngrams(nltk.word_tokenize(data),num)
return[''.join(grams)forgramsinn_grams]

data='Aclassisablueprintfortheobject.'

print("1-gram:",extract_ngrams(data,1))
print("2-gram:",extract_ngrams(data,2))
print("3-gram:",extract_ngrams(data,3))
print("4-gram:",extract_ngrams(data,4))
TextBlob
fromtextblobimportTextBlob

#Functiontogeneraten-gramsfromsentences.
defextract_ngrams(data,num):
n_grams=TextBlob(data).ngrams(num)
return[''.join(grams)forgramsinn_grams]

data='Aclassisablueprintfortheobject.'

print("1-gram:",extract_ngrams(data,1))
print("2-gram:",extract_ngrams(data,2))
print("3-gram:",extract_ngrams(data,3))
print("4-gram:",extract_ngrams(data,4))
Output:
1-gram: ['A', 'class', 'is', 'a', 'blueprint', 'for', 'the', 'object']
2-gram: ['A class', 'class is', 'is a', 'a blueprint', 'blueprint for', 'for the', 'the object']
3-gram: ['A class is', 'class is a', 'is a blueprint', 'a blueprint for', 'blueprint for the', 'for the object']
4-gram: ['A class is a', 'class is a blueprint', 'is a blueprint for', 'a blueprint for the', 'blueprint for the object']

18使用带有二元组的 sklearn CountVectorize 词汇规范

importpandasaspd
fromsklearn.feature_extraction.textimportCountVectorizer

#Sampledataforanalysis
data1="Machinelanguageisalow-levelprogramminglanguage.Itiseasilyunderstoodbycomputersbutdifficulttoreadbypeople.Thisiswhypeopleusehigherlevelprogramminglanguages.Programswritteninhigh-levellanguagesarealsoeithercompiledand/orinterpretedintomachinelanguagesothatcomputerscanexecutethem."
data2="Assemblylanguageisarepresentationofmachinelanguage.Inotherwords,eachassemblylanguageinstructiontranslatestoamachinelanguageinstruction.Thoughassemblylanguagestatementsarereadable,thestatementsarestilllow-level.Adisadvantageofassemblylanguageisthatitisnotportable,becauseeachplatformcomeswithaparticularAssemblyLanguage"

df1=pd.DataFrame({'Machine':[data1],'Assembly':[data2]})

#Initialize
vectorizer=CountVectorizer(ngram_range=(2,2))
doc_vec=vectorizer.fit_transform(df1.iloc[0])

#CreatedataFrame
df2=pd.DataFrame(doc_vec.toarray().transpose(),
index=vectorizer.get_feature_names())

#Changecolumnheaders
df2.columns=df1.columns
print(df2)
Output:
Assembly Machine
also either 0 1
and or 0 1
are also 0 1
are readable 1 0
are still 1 0
assembly language 5 0
because each 1 0
but difficult 0 1
by computers 0 1
by people 0 1
can execute 0 1
...

19使用 TextBlob 提取名词短语

fromtextblobimportTextBlob

#Extractnoun
blob=TextBlob("CanadaisacountryinthenorthernpartofNorthAmerica.")

fornounsinblob.noun_phrases:
print(nouns)
Output:
canada
northern part
america

20如何计算词-词共现矩阵

importnumpyasnp
importnltk
fromnltkimportbigrams
importitertools
importpandasaspd


defgenerate_co_occurrence_matrix(corpus):
vocab=set(corpus)
vocab=list(vocab)
vocab_index={word:ifori,wordinenumerate(vocab)}

#Createbigramsfromallwordsincorpus
bi_grams=list(bigrams(corpus))

#Frequencydistributionofbigrams((word1,word2),num_occurrences)
bigram_freq=nltk.FreqDist(bi_grams).most_common(len(bi_grams))

#Initialiseco-occurrencematrix
#co_occurrence_matrix[current][previous]
co_occurrence_matrix=np.zeros((len(vocab),len(vocab)))

#Loopthroughthebigramstakingthecurrentandpreviousword,
#andthenumberofoccurrencesofthebigram.
forbigraminbigram_freq:
current=bigram[0][1]
previous=bigram[0][0]
count=bigram[1]
pos_current=vocab_index[current]
pos_previous=vocab_index[previous]
co_occurrence_matrix[pos_current][pos_previous]=count
co_occurrence_matrix=np.matrix(co_occurrence_matrix)

#returnthematrixandtheindex
returnco_occurrence_matrix,vocab_index


text_data=[['Where','Python','is','used'],
['What','is','Python''used','in'],
['Why','Python','is','best'],
['What','companies','use','Python']]

#Createonelistusingmanylists
data=list(itertools.chain.from_iterable(text_data))
matrix,vocab_index=generate_co_occurrence_matrix(data)


data_matrix=pd.DataFrame(matrix,index=vocab_index,
columns=vocab_index)
print(data_matrix)
Output:
best use What Where ... in is Python used
best 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 1.0
use 0.0 0.0 0.0 0.0 ... 0.0 1.0 0.0 0.0
What 1.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0
Where 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0
Pythonused 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 1.0
Why 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 1.0
companies 0.0 1.0 0.0 1.0 ... 1.0 0.0 0.0 0.0
in 0.0 0.0 0.0 0.0 ... 0.0 0.0 1.0 0.0
is 0.0 0.0 1.0 0.0 ... 0.0 0.0 0.0 0.0
Python 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0
used 0.0 0.0 1.0 0.0 ... 0.0 0.0 0.0 0.0

[11 rows x 11 columns]

21使用 TextBlob 进行情感分析

fromtextblobimportTextBlob


defsentiment(polarity):
ifblob.sentiment.polarity<0:
print("Negative")
elifblob.sentiment.polarity>0:
print("Positive")
else:
print("Neutral")


blob=TextBlob("Themoviewasexcellent!")
print(blob.sentiment)
sentiment(blob.sentiment.polarity)

blob=TextBlob("Themoviewasnotbad.")
print(blob.sentiment)
sentiment(blob.sentiment.polarity)

blob=TextBlob("Themoviewasridiculous.")
print(blob.sentiment)
sentiment(blob.sentiment.polarity)
Output:
Sentiment(polarity=1.0, subjectivity=1.0)
Positive
Sentiment(polarity=0.3499999999999999, subjectivity=0.6666666666666666)
Positive
Sentiment(polarity=-0.3333333333333333, subjectivity=1.0)
Negative

22使用 Goslate 进行语言翻译

importgoslate

text="Commentvas-tu?"

gs=goslate.Goslate()

translatedText=gs.translate(text,'en')
print(translatedText)

translatedText=gs.translate(text,'zh')
print(translatedText)

translatedText=gs.translate(text,'de')
print(translatedText)

23使用 TextBlob 进行语言检测和翻译

fromtextblobimportTextBlob

blob=TextBlob("Commentvas-tu?")

print(blob.detect_language())

print(blob.translate(to='es'))
print(blob.translate(to='en'))
print(blob.translate(to='zh'))
Output:
fr
Como estas tu?
How are you?
你好吗?

24使用 TextBlob 获取定义和同义词

fromtextblobimportTextBlob
fromtextblobimportWord

text_word=Word('safe')

print(text_word.definitions)

synonyms=set()
forsynsetintext_word.synsets:
forlemmainsynset.lemmas():
synonyms.add(lemma.name())

print(synonyms)
Output:
['strongbox where valuables can be safely kept', 'a ventilated or refrigerated cupboard for securing provisions from pests', 'contraceptive device consisting of a sheath of thin rubber or latex that is worn over the penis during intercourse', 'free from danger or the risk of harm', '(of an undertaking) secure from risk', 'having reached a base without being put out', 'financially sound']
{'secure', 'rubber', 'good', 'safety', 'safe', 'dependable', 'condom', 'prophylactic'}

25使用 TextBlob 获取反义词列表

fromtextblobimportTextBlob
fromtextblobimportWord

text_word=Word('safe')

antonyms=set()
forsynsetintext_word.synsets:
forlemmainsynset.lemmas():
iflemma.antonyms():
antonyms.add(lemma.antonyms()[0].name())

print(antonyms)
Output:
{'dangerous', 'out'}

资讯

谷歌使出禁用2G大招

技术

干货满满的python实战项目!

技术

Python写了一个网页版的P图软件

技术

11款可替代top命令的工具!


分享

点收藏

点点赞

点在看

Copyright © 2021.Company 元宇宙YITB.COM All rights reserved.元宇宙YITB.COM