各種公用事業應用中的詞分析和 N 元素模型

在單詞級別對自然語言特徵進行建模並生成頻率圖。

如何使 AI 應用程式不僅僅是乙個非常複雜的歸檔系統？使這些應用程式能夠根據它們在輸入資料中識別的模式執行操作，並基於基於此類模式構建的模型生成輸出。

在第 1 部分中，我以自然語言為例，向您展示了如何計算 n-yuan 模型、字母序列和其他文字字元的頻率矩陣。在本教程中，我將帶您深入了解 n-meta 模型和 n-meta 模型統計的奇妙世界。 Python 以其資料科學和統計工具而聞名於世。上一教程中使用的自然語言工具包庫（NLTK）提供了一些有用的工具，用於使用 Mattplotlib（乙個用於資料的圖形視覺化庫）。為了讓您簡要了解各種可能性，以下列表從空間 n 元模型圖中的文字正文生成了 50 個最常用的字母圖。為圖形輸出新增或修改的行將突出顯示。

plot_ngrams.py

import sysimport pprintfrom nltk.util import ngramsfrom nltk.tokenize import regexptokenizerfrom nltk.probability import freqdist#set up a tokenizer that captures only lowercase letters and spaces#this requires that input has been preprocessed to lowercase all letterstokenizer = regexptokenizer("[a-z ]")def count_ngrams(input_fp, frequencies, order, buffer_size=1024): '''read the text content of a file and keep a running count of how often each bigram (sequence of two) characters appears. arguments: input_fp --file pointer with input text frequencies --mapping from each bigram to its counted frequency buffer_size --incremental quantity of text to be read at a time, in bytes (1024 if not otherwise specified) returns: nothing ''' #read the first chunk of text, and set all letters to lowercase text = input_fp.read(buffer_size).lower() #loop over the file while there is text to read while text: #this step is needed to collapse runs of space characters into one text = ' '.join(text.split())spans = tokenizer.span_tokenize(text) tokens = (text[begin : end] for (begin, end) in spans) for bigram in ngrams(tokens, order): #increment the count for the bigram. automatically handles any #bigram not seen before. the join expression turns 2 separate #single-character strings into one 2-character string frequencies[''.join(bigram)] = 1 #read the next chunk of text, and set all letters to lowercase text = input_fp.read(buffer_size).lower() returnif __name__ == '__main__': #initialize the frequency distribution, a subclass of collections.counter frequencies = freqdist() #the order of the ngrams is the first command line argument ngram_order = int(sys.ar**[1]) #pull the input data from the console count_ngrams(sys.stdin, frequencies, ngram_order) #generate pop-up window with frequency plot of 50 most common n-grams frequencies.plot(50)

您需要安裝 matplotlib 才能執行此**。

pip install matplotlib

然後，您可以執行此列表，假設您使用的是帶有 GUI 的系統，例如 Windows、Mac 或 Linux 桌面。

python plot_ngrams.py 3 < bigbraineddata1.txt

這將生成乙個帶有圖表的彈出視窗，如下所示。

瀏覽 NLTK 文件，了解與資料工具整合的更多示例，並瀏覽 matplotlib 文件，詳細了解這個功能強大、用途廣泛的繪圖工具包。您還可以檢視 IBM Developer 教程 Introduction to Data-Science Tools in Bluemix Part 4，它是一系列教程的一部分，但也可以單獨檢視。到目前為止，我已將 n 元模型的使用限制為字母序列。正如您將在下乙個教程中看到的那樣，從這種 n-meta 模型中可以獲得許多有趣的結果，但許多應用程式只關注整個單詞的 n-meta 模型。這不需要對我已經描述的各種方法進行太多調整。以下列表類似於繪圖 ngrampy 與此類似，但它適合處理單詞，而不是單個字元。與此更改相關的行將突出顯示。

plot_word_ngrams.py

import sysimport pprintfrom nltk.util import ngramsfrom nltk.tokenize import regexptokenizerfrom nltk.probability import freqdist#set up a tokenizer that only captures words#requires that input has been preprocessed to lowercase all letterstokenizer = regexptokenizer("[a-z]+")def count_ngrams(input_fp, frequencies, order, buffer_size=1024): '''read the text content of a file and keep a running count of how often each bigram (sequence of two) characters appears. arguments: input_fp --file pointer with input text frequencies --mapping from each bigram to its counted frequency buffer_size --incremental quantity of text to be read at a time, in bytes (1024 if not otherwise specified) returns: nothing ''' #read the first chunk of text, and set all letters to lowercase text = input_fp.read(buffer_size).lower() #loop over the file while there is text to read while text: spans = tokenizer.span_tokenize(text) tokens = (text[begin : end] for (begin, end) in spans) for bigram in ngrams(tokens, order): #increment the count for the bigram. automatically handles any #bigram not seen before. frequencies[bigram] += 1 #read the next chunk of text, and set all letters to lowercase text = input_fp.read(buffer_size).lower() returnif __name__ == '__main__': #initialize the frequency distribution, a subclass of collections.counter frequencies = freqdist() #the order of the ngrams is the first command line argument ngram_order = int(sys.ar**[1]) #pull the input data from the console count_ngrams(sys.stdin, frequencies, ngram_order) #generate pop-up window with frequency plot of 50 most common n-grams plot_width = 50 title = 'plot of the {}most common {}grams'.format(plot_width, ngram_order) frequencies.plot(plot_width, title=title)

讓我們從二元模型開始。

python plot_word_ngrams.py 2 < bigbraineddata1.txt

生成的圖表如下所示。

不幸的是，圖表水平軸上的專案標籤被截斷了。有幾種方法可以調整這一點，但坦率地說，NLTK 使它們更加繁瑣。在本教程中，我們將使用繪圖視窗的互動功能。

請注意上圖底部的一行圖示。右起第二個圖示是用於繪圖設定的圖示。單擊此圖示將顯示一系列滑塊，其中乙個滑塊用於調整底部間距。向右滑動此滑塊可顯示更多水平軸標籤。您可能還需要調整整體視窗大小，以避免過度擠壓主影象。完成此類調整後，將顯示以下內容。

用於分隔 n 元素模型的括號、問號和逗號以可忽略不計的方式儲存以供分析。出乎意料的是，“of”是最常用的二進位單詞模型，出現了 27 次。另一方面，有 11 個二進位模型最多出現 3 次。大多數更常見的二元模型是不太重要的常用詞的組合，但值得注意的是，“機器學習”條目排在第三位。這突出了我使用的是乙個小型專業語料庫的問題。

切換到 n-meta 模型這個詞確實突出了小選定語料庫帶來的一些統計問題，所以是時候向外擴充套件以涵蓋更全面的東西了。根據維基百科，美國國家語料庫（ANC）是“美國英語的文字語料庫，包含自 1990 年以來生成的 2200 萬個書面和口頭資料單詞。目前，ANC 包含多種型別，包括以前未包含在以前語料庫（例如英國國家語料庫）中的新興型別，例如電子郵件、推文和 Web 資料。 ”

ANC 還包含有用的注釋（例如，詞性），但我們目前不需要它們。這裡還有乙個子集，即開放美國國家語料庫（OANC），其中包含 1500 萬個開放使用且不受限制的當代美國英語單詞。我** oanc 10.1，這是乙個大型 zip 存檔，其中包含乙個帶有注釋內容的 XML 檔案和乙個純文字檔案。在處理時，我首先只從zip中提取文字檔案。

unzip /users/uche/downloads/oanc-1.0.1-utf8.zip "*.txt" -x "*/license.txt" "*/readme.txt"

這將是oanc生成包含文字檔案的目錄結構。我使用以下 linux 命令（也適用於 Mac）將它們連線成乙個 97 MB 的大文字檔案。

find oanc -name "*.txt" -exec cat {}oanc.txt

現在，我們可以看看一些在統計學上更有用的單詞的 n 元模型頻率圖。

python plot_word_ngrams.py 2 < oanc.txt

這是乙個經過視覺調整的圖表。

獲勝者仍然是“的”，大約有 100,000 次出現，但前 3 名中的其他 4 個非常合理。您可能會注意到前 10 個結果中的一些異常，尤其是“is s”和“don t”。這是由於我們簡化了標記單詞的方法。任何非字母字元都被視為單詞分隔符，包括撇號。這意味著縮寫形式被拆分為多個部分，並且這種拆分不會將其恢復到原始的單詞組合。

我們可以使用更智慧型的分詞器來解決這個問題，也許可以使用專門保留撇號作為單詞一部分的分詞器。由於多種原因，這可能很難處理，例如撇號在某些 ASCII 文字表示法中用作單引號。處理此類問題超出了本教程的範圍。

讓我們來看看 OANC 的三元。

python plot_word_ngrams.py 3 < oanc.txt

這是乙個經過視覺調整的圖表。

這種收縮變形直接顯示在頂部，有超過 12,000 個“我不”的例項。毫無疑問，在美式英語語料庫的三元模型中，“美國”佔主導地位，每個單詞出現的頻率都比它低。

為了更完整，這裡有乙個來自 OANC 的一元模型，它基本上只是乙個簡單的詞頻計數。

python plot_word_ngrams.py 1 < oanc.txt

這是乙個經過視覺調整的圖表。

在本教程的 github 儲存庫中，我已將 OANC. 包含在其中txt. 供您自己使用。

當然，現在有了更豐富的語料庫，了解字母 n 元模型也很有趣。讓我們從字母三元模型開始。

python plot_ngrams.py 3 < oanc.txt

這裡的結果有些令人驚訝。讓我們嘗試按第 5 位順序排列的 n-meta 模型。

python plot_ngrams.py 5 < oanc.txt

常用詞佔主導地位，以簡短的常用詞結尾的“s”複數等模式也是如此。

python plot_ngrams.py 7 < oanc.txt

此圖表需要很長時間才能生成，並且肯定會占用一定數量的可用記憶體。但是，值得一提的是，與明智的表達相關的術語以及具有不同空間配置的短詞的典型組合。最後，我們看到“因為”和“思考”。

N元模型源於我們開始思考機器如何生成熟悉的語言，但這些模型有很多用途。例如，您可以嘗試通過比較 n 元模型統計資料來對語言進行分類。對於以“sz”序列為特徵的 n-yuan 模型，捷克語包含的頻率高於英語，而對於以“gb”和“kp”序列為特徵的 n-yuan 模型，我的母語伊博語包含高頻率。有了這樣的統計資料，甚至可以分析英式英語拼寫和美式英語拼寫等方言。您甚至可以使用 n 元模型統計來識別特定作者，但這項任務要困難得多。

許多研究人員，包括谷歌的團隊，正在對從各種語言中提取的語言使用n元模型統計。 Google Books Ngram Viewer 源自 Google Books 的大量材料。它有乙個複雜的查詢引擎，可以根據源書的出版日期檢視單詞的 n 元模型統計資料多年來的演變。

通過觀察單詞之間的關係，可以提供與關注字母不同的視角，但這兩種方法都很有用，可以應用於不同的場景。

在本教程中，我們使用了組織良好、簡潔的語言體系，這是非常重要的一步。 ANC 是眾多選項之一，這包括英語以外的許多語言的語料庫。維基百科上有乙個有用的列表。現在您已經知道如何在字母和單詞級別編譯文字統計資訊，您可以開始習如何在生成語言中使用它們。我們將在本系列的最後乙個教程中討論這個問題。

各種公用事業應用中的詞分析和 N 元素模型

相關問題答案

Google 應用中使用者友好的創新搜尋欄已移至底部

應用分身 iOS 17 2 中的新功能

適用於 iPhone 16 的應用程式分身，帶來更安全的個人和職業生活

開發移動應用程式時應採取的安全措施

無論應用程式多麼“頑固”，都很難承受以下解除安裝方法

各種公用事業應用中的詞分析和 N 元素模型

相關問題答案

Google 應用中使用者友好的創新 搜尋欄已移至底部

應用分身 iOS 17 2 中的新功能

適用於 iPhone 16 的應用程式分身，帶來更安全的個人和職業生活

開發移動應用程式時應採取的安全措施

無論應用程式多麼“頑固”，都很難承受以下解除安裝方法

Google 應用中使用者友好的創新搜尋欄已移至底部