作者 | 周萝卜
来源 | 萝卜大杂烩
提取 PDF 内容
提取 Word 内容
提取 Web 网页内容
读取 Json 数据
读取 CSV 数据
删除字符串中的标点符号
使用 NLTK 删除停用词
使用 TextBlob 更正拼写
使用 NLTK 和 TextBlob 的词标记化
使用 NLTK 提取句子单词或短语的词干列表
使用 NLTK 进行句子或短语词形还原
使用 NLTK 从文本文件中查找每个单词的频率
从语料库中创建词云
NLTK 词法散布图
使用 countvectorizer 将文本转换为数字
使用 TF-IDF 创建文档术语矩阵
为给定句子生成 N-gram
使用带有二元组的 sklearn CountVectorize 词汇规范
使用 TextBlob 提取名词短语
如何计算词-词共现矩阵
使用 TextBlob 进行情感分析
使用 Goslate 进行语言翻译
使用 TextBlob 进行语言检测和翻译
使用 TextBlob 获取定义和同义词
使用 TextBlob 获取反义词列表
#pipinstallPyPDF2安装PyPDF2
importPyPDF2
fromPyPDF2importPdfFileReader
#Creatingapdffileobject.
pdf=open("test.pdf","rb")
#Creatingpdfreaderobject.
pdf_reader=PyPDF2.PdfFileReader(pdf)
#Checkingtotalnumberofpagesinapdffile.
print("TotalnumberofPages:",pdf_reader.numPages)
#Creatingapageobject.
page=pdf_reader.getPage(200)
#Extractdatafromaspecificpagenumber.
print(page.extractText())
#Closingtheobject.
pdf.close()
#pipinstallpython-docx安装python-docx
importdocx
defmain():
try:
doc=docx.Document('test.docx')#Creatingwordreaderobject.
data=""
fullText=[]
forparaindoc.paragraphs:
fullText.append(para.text)
data='\n'.join(fullText)
print(data)
exceptIOError:
print('Therewasanerroropeningthefile!')
return
if__name__=='__main__':
main()
#pipinstallbs4安装bs4
fromurllib.requestimportRequest,urlopen
frombs4importBeautifulSoup
req=Request('http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1',
headers={'User-Agent':'Mozilla/5.0'})
webpage=urlopen(req).read()
#Parsing
soup=BeautifulSoup(webpage,'html.parser')
#Formatingtheparsedhtmlfile
strhtm=soup.prettify()
#Printfirst500lines
print(strhtm[:500])
#Extractmetatagvalue
print(soup.title.string)
print(soup.find('meta',attrs={'property':'og:description'}))
#Extractanchortagvalue
forxinsoup.find_all('a'):
print(x.string)
#ExtractParagraphtagvalue
forxinsoup.find_all('p'):
print(x.text)
importrequests
importjson
r=requests.get("https://support.oneskyapp.com/hc/en-us/article_attachments/202761727/example_2.json")
res=r.json()
#Extractspecificnodecontent.
print(res['quiz']['sport'])
#Dumpdataasstring
data=json.dumps(res)
print(data)
importcsv
withopen('test.csv','r')ascsv_file:
reader=csv.reader(csv_file)
next(reader)#Skipfirstrow
forrowinreader:
print(row)
importre
importstring
data="Stuningevenforthenon-gamer:Thissoundtrackwasbeautiful!\
ItpaintstheseneryinyourmindsowellIwouldrecomend\
iteventopeoplewhohatevid.gamemusic!IhaveplayedthegameChrono\
CrossbutoutofallofthegamesIhaveeverplayedithasthebestmusic!\
Itbacksawayfromcrudekeyboardingandtakesafresherstepwithgrate\
guitarsandsoulfulorchestras.\
Itwouldimpressanyonewhocarestolisten!"
#Methood1:Regex
#Removethespecialcharatersfromthereadstring.
no_specials_string=re.sub('[!#?,.:";]','',data)
print(no_specials_string)
#Methood2:translate()
#Raketranslatorobject
translator=str.maketrans('','',string.punctuation)
data=data.translate(translator)
print(data)
fromnltk.corpusimportstopwords
data=['Stuningevenforthenon-gamer:Thissoundtrackwasbeautiful!\
ItpaintstheseneryinyourmindsowellIwouldrecomend\
iteventopeoplewhohatevid.gamemusic!IhaveplayedthegameChrono\
CrossbutoutofallofthegamesIhaveeverplayedithasthebestmusic!\
Itbacksawayfromcrudekeyboardingandtakesafresherstepwithgrate\
guitarsandsoulfulorchestras.\
Itwouldimpressanyonewhocarestolisten!']
#Removestopwords
stopwords=set(stopwords.words('english'))
output=[]
forsentenceindata:
temp_list=[]
forwordinsentence.split():
ifword.lower()notinstopwords:
temp_list.append(word)
output.append(''.join(temp_list))
print(output)
fromtextblobimportTextBlob
data="Naturallanguageisacantralpartofourdaytodaylife,andit'ssoantrestingtoworkonanyproblemrelatedtolangages."
output=TextBlob(data).correct()
print(output)
importnltk
fromtextblobimportTextBlob
data="Naturallanguageisacentralpartofourdaytodaylife,andit'ssointerestingtoworkonanyproblemrelatedtolanguages."
nltk_output=nltk.word_tokenize(data)
textblob_output=TextBlob(data).words
print(nltk_output)
print(textblob_output)
['Natural', 'language', 'is', 'a', 'central', 'part', 'of', 'our', 'day', 'to', 'day', 'life', ',', 'and', 'it', "'s", 'so', 'interesting', 'to', 'work', 'on', 'any', 'problem', 'related', 'to', 'languages', '.']
['Natural', 'language', 'is', 'a', 'central', 'part', 'of', 'our', 'day', 'to', 'day', 'life', 'and', 'it', "'s", 'so', 'interesting', 'to', 'work', 'on', 'any', 'problem', 'related', 'to', 'languages']
fromnltk.stemimportPorterStemmer
st=PorterStemmer()
text=['Wheredidhelearntodancelikethat?',
'Hiseyesweredancingwithhumor.',
'Sheshookherheadanddancedaway',
'Alexwasanexcellentdancer.']
output=[]
forsentenceintext:
output.append("".join([st.stem(i)foriinsentence.split()]))
foriteminoutput:
print(item)
print("-"*50)
print(st.stem('jumping'),st.stem('jumps'),st.stem('jumped'))
where did he learn to danc like that?
hi eye were danc with humor.
she shook her head and danc away
alex wa an excel dancer.
--------------------------------------------------
jump jump jump
fromnltk.stemimportWordNetLemmatizer
wnl=WordNetLemmatizer()
text=['Shegrippedthearmrestashepassedtwocarsatatime.',
'Hercarwasinfullview.',
'Anumberofcarscarriedoutofstatelicenseplates.']
output=[]
forsentenceintext:
output.append("".join([wnl.lemmatize(i)foriinsentence.split()]))
foriteminoutput:
print(item)
print("*"*10)
print(wnl.lemmatize('jumps','n'))
print(wnl.lemmatize('jumping','v'))
print(wnl.lemmatize('jumped','v'))
print("*"*10)
print(wnl.lemmatize('saddest','a'))
print(wnl.lemmatize('happiest','a'))
print(wnl.lemmatize('easiest','a'))
She gripped the armrest a he passed two car at a time.
Her car wa in full view.
A number of car carried out of state license plates.
**********
jump
jump
jump
**********
sad
happy
easy
importnltk
fromnltk.corpusimportwebtext
fromnltk.probabilityimportFreqDist
nltk.download('webtext')
wt_words=webtext.words('testing.txt')
data_analysis=nltk.FreqDist(wt_words)
#Let'stakethespecificwordsonlyiftheirfrequencyisgreaterthan3.
filter_words=dict([(m,n)form,nindata_analysis.items()iflen(m)>3])
forkeyinsorted(filter_words):
print("%s:%s"%(key,filter_words[key]))
data_analysis=nltk.FreqDist(filter_words)
data_analysis.plot(25,cumulative=False)
[nltk_data] Downloading package webtext to
[nltk_data] C:\Users\amit\AppData\Roaming\nltk_data...
[nltk_data] Unzipping corpora\webtext.zip.
1989: 1
Accessing: 1
Analysis: 1
Anyone: 1
Chapter: 1
Coding: 1
Data: 1
...
importnltk
fromnltk.corpusimportwebtext
fromnltk.probabilityimportFreqDist
fromwordcloudimportWordCloud
importmatplotlib.pyplotasplt
nltk.download('webtext')
wt_words=webtext.words('testing.txt')#Sampledata
data_analysis=nltk.FreqDist(wt_words)
filter_words=dict([(m,n)form,nindata_analysis.items()iflen(m)>3])
wcloud=WordCloud().generate_from_frequencies(filter_words)
#Plottingthewordcloud
plt.imshow(wcloud,interpolation="bilinear")
plt.axis("off")
(-0.5,399.5,199.5,-0.5)
plt.show()
importnltk
fromnltk.corpusimportwebtext
fromnltk.probabilityimportFreqDist
fromwordcloudimportWordCloud
importmatplotlib.pyplotasplt
words=['data','science','dataset']
nltk.download('webtext')
wt_words=webtext.words('testing.txt')#Sampledata
points=[(x,y)forxinrange(len(wt_words))
foryinrange(len(words))ifwt_words[x]==words[y]]
ifpoints:
x,y=zip(*points)
else:
x=y=()
plt.plot(x,y,"rx",scalex=.1)
plt.yticks(range(len(words)),words,color="b")
plt.ylim(-1,len(words))
plt.title("LexicalDispersionPlot")
plt.xlabel("WordOffset")
plt.show()
importpandasaspd
fromsklearn.feature_extraction.textimportCountVectorizer
#Sampledataforanalysis
data1="Javaisalanguageforprogrammingthatdevelopsasoftwareforseveralplatforms.AcompiledcodeorbytecodeonJavaapplicationcanrunonmostoftheoperatingsystemsincludingLinux,Macoperatingsystem,andLinux.MostofthesyntaxofJavaisderivedfromtheC++andClanguages."
data2="Pythonsupportsmultipleprogrammingparadigmsandcomesupwithalargestandardlibrary,paradigmsincludedareobject-oriented,imperative,functionalandprocedural."
data3="Goistypedstaticallycompiledlanguage.ItwascreatedbyRobertGriesemer,KenThompson,andRobPikein2009.Thislanguageoffersgarbagecollection,concurrencyofCSP-style,memorysafety,andstructuraltyping."
df1=pd.DataFrame({'Java':[data1],'Python':[data2],'Go':[data2]})
#Initialize
vectorizer=CountVectorizer()
doc_vec=vectorizer.fit_transform(df1.iloc[0])
#CreatedataFrame
df2=pd.DataFrame(doc_vec.toarray().transpose(),
index=vectorizer.get_feature_names())
#Changecolumnheaders
df2.columns=df1.columns
print(df2)
Go Java Python
and 2 2 2
application 0 1 0
are 1 0 1
bytecode 0 1 0
can 0 1 0
code 0 1 0
comes 1 0 1
compiled 0 1 0
derived 0 1 0
develops 0 1 0
for 0 2 0
from 0 1 0
functional 1 0 1
imperative 1 0 1
...
importpandasaspd
fromsklearn.feature_extraction.textimportTfidfVectorizer
#Sampledataforanalysis
data1="Javaisalanguageforprogrammingthatdevelopsasoftwareforseveralplatforms.AcompiledcodeorbytecodeonJavaapplicationcanrunonmostoftheoperatingsystemsincludingLinux,Macoperatingsystem,andLinux.MostofthesyntaxofJavaisderivedfromtheC++andClanguages."
data2="Pythonsupportsmultipleprogrammingparadigmsandcomesupwithalargestandardlibrary,paradigmsincludedareobject-oriented,imperative,functionalandprocedural."
data3="Goistypedstaticallycompiledlanguage.ItwascreatedbyRobertGriesemer,KenThompson,andRobPikein2009.Thislanguageoffersgarbagecollection,concurrencyofCSP-style,memorysafety,andstructuraltyping."
df1=pd.DataFrame({'Java':[data1],'Python':[data2],'Go':[data2]})
#Initialize
vectorizer=TfidfVectorizer()
doc_vec=vectorizer.fit_transform(df1.iloc[0])
#CreatedataFrame
df2=pd.DataFrame(doc_vec.toarray().transpose(),
index=vectorizer.get_feature_names())
#Changecolumnheaders
df2.columns=df1.columns
print(df2)
Go Java Python
and 0.323751 0.137553 0.323751
application 0.000000 0.116449 0.000000
are 0.208444 0.000000 0.208444
bytecode 0.000000 0.116449 0.000000
can 0.000000 0.116449 0.000000
code 0.000000 0.116449 0.000000
comes 0.208444 0.000000 0.208444
compiled 0.000000 0.116449 0.000000
derived 0.000000 0.116449 0.000000
develops 0.000000 0.116449 0.000000
for 0.000000 0.232898 0.000000
...
importnltk
fromnltk.utilimportngrams
#Functiontogeneraten-gramsfromsentences.
defextract_ngrams(data,num):
n_grams=ngrams(nltk.word_tokenize(data),num)
return[''.join(grams)forgramsinn_grams]
data='Aclassisablueprintfortheobject.'
print("1-gram:",extract_ngrams(data,1))
print("2-gram:",extract_ngrams(data,2))
print("3-gram:",extract_ngrams(data,3))
print("4-gram:",extract_ngrams(data,4))
fromtextblobimportTextBlob
#Functiontogeneraten-gramsfromsentences.
defextract_ngrams(data,num):
n_grams=TextBlob(data).ngrams(num)
return[''.join(grams)forgramsinn_grams]
data='Aclassisablueprintfortheobject.'
print("1-gram:",extract_ngrams(data,1))
print("2-gram:",extract_ngrams(data,2))
print("3-gram:",extract_ngrams(data,3))
print("4-gram:",extract_ngrams(data,4))
1-gram: ['A', 'class', 'is', 'a', 'blueprint', 'for', 'the', 'object']
2-gram: ['A class', 'class is', 'is a', 'a blueprint', 'blueprint for', 'for the', 'the object']
3-gram: ['A class is', 'class is a', 'is a blueprint', 'a blueprint for', 'blueprint for the', 'for the object']
4-gram: ['A class is a', 'class is a blueprint', 'is a blueprint for', 'a blueprint for the', 'blueprint for the object']
importpandasaspd
fromsklearn.feature_extraction.textimportCountVectorizer
#Sampledataforanalysis
data1="Machinelanguageisalow-levelprogramminglanguage.Itiseasilyunderstoodbycomputersbutdifficulttoreadbypeople.Thisiswhypeopleusehigherlevelprogramminglanguages.Programswritteninhigh-levellanguagesarealsoeithercompiledand/orinterpretedintomachinelanguagesothatcomputerscanexecutethem."
data2="Assemblylanguageisarepresentationofmachinelanguage.Inotherwords,eachassemblylanguageinstructiontranslatestoamachinelanguageinstruction.Thoughassemblylanguagestatementsarereadable,thestatementsarestilllow-level.Adisadvantageofassemblylanguageisthatitisnotportable,becauseeachplatformcomeswithaparticularAssemblyLanguage"
df1=pd.DataFrame({'Machine':[data1],'Assembly':[data2]})
#Initialize
vectorizer=CountVectorizer(ngram_range=(2,2))
doc_vec=vectorizer.fit_transform(df1.iloc[0])
#CreatedataFrame
df2=pd.DataFrame(doc_vec.toarray().transpose(),
index=vectorizer.get_feature_names())
#Changecolumnheaders
df2.columns=df1.columns
print(df2)
Assembly Machine
also either 0 1
and or 0 1
are also 0 1
are readable 1 0
are still 1 0
assembly language 5 0
because each 1 0
but difficult 0 1
by computers 0 1
by people 0 1
can execute 0 1
...
fromtextblobimportTextBlob
#Extractnoun
blob=TextBlob("CanadaisacountryinthenorthernpartofNorthAmerica.")
fornounsinblob.noun_phrases:
print(nouns)
canada
northern part
america
importnumpyasnp
importnltk
fromnltkimportbigrams
importitertools
importpandasaspd
defgenerate_co_occurrence_matrix(corpus):
vocab=set(corpus)
vocab=list(vocab)
vocab_index={word:ifori,wordinenumerate(vocab)}
#Createbigramsfromallwordsincorpus
bi_grams=list(bigrams(corpus))
#Frequencydistributionofbigrams((word1,word2),num_occurrences)
bigram_freq=nltk.FreqDist(bi_grams).most_common(len(bi_grams))
#Initialiseco-occurrencematrix
#co_occurrence_matrix[current][previous]
co_occurrence_matrix=np.zeros((len(vocab),len(vocab)))
#Loopthroughthebigramstakingthecurrentandpreviousword,
#andthenumberofoccurrencesofthebigram.
forbigraminbigram_freq:
current=bigram[0][1]
previous=bigram[0][0]
count=bigram[1]
pos_current=vocab_index[current]
pos_previous=vocab_index[previous]
co_occurrence_matrix[pos_current][pos_previous]=count
co_occurrence_matrix=np.matrix(co_occurrence_matrix)
#returnthematrixandtheindex
returnco_occurrence_matrix,vocab_index
text_data=[['Where','Python','is','used'],
['What','is','Python''used','in'],
['Why','Python','is','best'],
['What','companies','use','Python']]
#Createonelistusingmanylists
data=list(itertools.chain.from_iterable(text_data))
matrix,vocab_index=generate_co_occurrence_matrix(data)
data_matrix=pd.DataFrame(matrix,index=vocab_index,
columns=vocab_index)
print(data_matrix)
best use What Where ... in is Python used
best 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 1.0
use 0.0 0.0 0.0 0.0 ... 0.0 1.0 0.0 0.0
What 1.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0
Where 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0
Pythonused 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 1.0
Why 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 1.0
companies 0.0 1.0 0.0 1.0 ... 1.0 0.0 0.0 0.0
in 0.0 0.0 0.0 0.0 ... 0.0 0.0 1.0 0.0
is 0.0 0.0 1.0 0.0 ... 0.0 0.0 0.0 0.0
Python 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0
used 0.0 0.0 1.0 0.0 ... 0.0 0.0 0.0 0.0
[11 rows x 11 columns]
fromtextblobimportTextBlob
defsentiment(polarity):
ifblob.sentiment.polarity<0:
print("Negative")
elifblob.sentiment.polarity>0:
print("Positive")
else:
print("Neutral")
blob=TextBlob("Themoviewasexcellent!")
print(blob.sentiment)
sentiment(blob.sentiment.polarity)
blob=TextBlob("Themoviewasnotbad.")
print(blob.sentiment)
sentiment(blob.sentiment.polarity)
blob=TextBlob("Themoviewasridiculous.")
print(blob.sentiment)
sentiment(blob.sentiment.polarity)
Sentiment(polarity=1.0, subjectivity=1.0)
Positive
Sentiment(polarity=0.3499999999999999, subjectivity=0.6666666666666666)
Positive
Sentiment(polarity=-0.3333333333333333, subjectivity=1.0)
Negative
importgoslate
text="Commentvas-tu?"
gs=goslate.Goslate()
translatedText=gs.translate(text,'en')
print(translatedText)
translatedText=gs.translate(text,'zh')
print(translatedText)
translatedText=gs.translate(text,'de')
print(translatedText)
fromtextblobimportTextBlob
blob=TextBlob("Commentvas-tu?")
print(blob.detect_language())
print(blob.translate(to='es'))
print(blob.translate(to='en'))
print(blob.translate(to='zh'))
fr
Como estas tu?
How are you?
你好吗?
fromtextblobimportTextBlob
fromtextblobimportWord
text_word=Word('safe')
print(text_word.definitions)
synonyms=set()
forsynsetintext_word.synsets:
forlemmainsynset.lemmas():
synonyms.add(lemma.name())
print(synonyms)
['strongbox where valuables can be safely kept', 'a ventilated or refrigerated cupboard for securing provisions from pests', 'contraceptive device consisting of a sheath of thin rubber or latex that is worn over the penis during intercourse', 'free from danger or the risk of harm', '(of an undertaking) secure from risk', 'having reached a base without being put out', 'financially sound']
{'secure', 'rubber', 'good', 'safety', 'safe', 'dependable', 'condom', 'prophylactic'}
fromtextblobimportTextBlob
fromtextblobimportWord
text_word=Word('safe')
antonyms=set()
forsynsetintext_word.synsets:
forlemmainsynset.lemmas():
iflemma.antonyms():
antonyms.add(lemma.antonyms()[0].name())
print(antonyms)
{'dangerous', 'out'}
分享
点收藏
点点赞
点在看
文章转发自AI科技大本营微信公众号,版权归其所有。文章内容不代表本站立场和任何投资暗示。
Copyright © 2021.Company 元宇宙YITB.COM All rights reserved.元宇宙YITB.COM