自然語言處理擴充套件了 NER 以使用 spaCy 標記新實體

標記單詞的順序 - 簡明扼要。

figure 1: colour-coded recognised entities

本文假設讀者對從文字中提取實體有一些概念，並希望了解有關新的自定義實體識別的最新技術以及如何使用它們的更多資訊。但是，如果您不熟悉 NER 問題，請在此處閱讀。

話雖如此，本文的目的是描述如何使用 Spacy 的預訓練自然語言處理（NLP）核心模型來學習識別新實體。 Spacy 現有的核心 NLP 模型經過訓練，可以識別各種實體，如圖 2 所示。

figure 2: existing entity recognised by spacy core models (source)

儘管如此，使用者可能希望構建自己的實體來解決問題需求。在這種情況下，預先存在的實體使自己變得不充分，因此，需要訓練 NLP 模型來完成這項工作。多虧了 spacy 的文件和預訓練模型，這並不是很困難。

如果您不想進一步閱讀，但想學習如何使用它，請使用這個 jupyter 筆記本 - 它是獨立的。無論如何，我也建議你看一看。

就像任何需要輸入和輸出來學習的監督學習演算法一樣，同樣，這裡的輸入是文字，輸出是根據 biloo 編碼的，如圖 3 所示。然而，儘管存在不同的方案，但 Ratinov 和 Roth 表明，最小開始、輸入、輸出（IOB）方案比明確標記邊界標記的 Biloo 方案更難學習。 spacy 提供了乙個 iob 編碼示例，我發現它與提供的引數一致。因此，從這裡開始，任何提到注釋方案的內容都將是 biloo。

figure 3: biluo scheme

下圖顯示了 biloo 編碼實體的簡短示例。

2023 Post Sprint Contest 有三種可能的方法可以使用 biloo 解決方案對您進行編碼。一種方法是建立乙個 spacy-doc 表單文字字串，並將從 doc 中提取的標記儲存在由新行分隔的文字檔案中。然後根據 BILOO 方案對每個代幣進行代幣化。人們可以建立自己的代幣並將其代幣化，但這可能會降低效能——稍後會詳細介紹。下面介紹如何標記資料，以便以後在 NER 訓練中使用。

import spacyimport numpy as npimport pandas as pdnlp = spacy.load('en')text = ("when sebastian thrun started working on self-driving cars at " "google in 2007, few people outside of the company took him " "seriously. “i can tell you very senior ceos of major american " "car companies would shake my hand and turn away because i wasn’t " "worth talking to,” said thrun, in an interview with recode earlier " "this week.")doc = nlp(text)words = labels = for token in doc: words.append(token.text) labels.append('o') # as most of token will be non-entity (out). replace this later with actual entity a/c the scheme.df = pd.dataframe()df.to_csv('ner-token-per-line.biluo', index=false) # biluo in extension to indicate the type of encoding, it is ok to keep csv

上面的片段使注釋更容易，但這不能直接輸入到空間模型中進行學習。儘管如此，Spacy 提供的另乙個模組 GoldParse 解析了模型接受的儲存格式。使用下面擷取的內容從儲存的檔案中讀取資料，並將其解析為模型接受的形式。

dpath = 'ner-token-per-line.biluo'df = pd.read_csv(dpath, sep=',')words = df.word.valuesents = df.label.valuestext = ' '.join(words)from spacy.gold import goldparsedoc = nlp.make_doc(text)g = goldparse(doc, entities=ents)x = [doc]y = [g]

另一種方法是對每個實體標籤使用偏移索引，即將實體的開頭和結尾（即實體的開頭、內部和最後部分組合起來）的索引與標籤一起提供，例如：

text = ("when sebastian thrun started working on self-driving cars at " "google in 2007, few people outside of the company took him " "seriously. “i can tell you very senior ceos of major american " "car companies would shake my hand and turn away because i wasn’t " "worth talking to,” said thrun, in an interview with recode earlier " "this week.")g = x = [text]y = [g]

第三個與第乙個類似，只是在這裡我們可以修復我們自己的標記並標記它們，而不是使用 NLP 模型生成標記然後標記它們。然而，在我的實驗中，雖然這也可以工作，但我發現這會降低效能。儘管如此，您還是可以通過以下方法做到這一點：

# words = for example space splitted tokens spaces = [true]*len(words)spaces[-1] = false # so remove space in lastdoc = doc(nlp.vocab, words=words, spaces=spaces) # custom docg = goldparse(doc, entities=ents)x = [doc]y = [g]

在對資料進行預處理並準備好進行訓練後，我們需要進一步向模型管道新增新實體的詞彙表。核心 Spacy 模型有三個管道：tagger、parser 和 ner。此外，我們需要禁用 tokenizer 和 parser 管道，因為我們只會訓練 ner 管道，儘管所有其他管道都可以同時訓練。單擊此處了解更多資訊。

add_ents = ['dated'] # the new entity# piplines in core pretrained model are tagger, parser, ner. create new if blank model is to be trained using `spacy.blank('en')` else get the existing one.if "ner" not in nlp.pipe_names: ner = nlp.create_pipe("ner") # "architecture": "ensemble" **cnn ensemble, bow # nlp.add_pipe(ner)else: ner = nlp.get_pipe("ner")prev_ents = ner.move_names # all the existing entities recognised by the modelprint('[existing entities] = ', ner.move_names)for ent in add_ents: ner.add_label(ent) new_ents = ner.move_names# print('[all entities] = ', ner.move_names)print('[new entities] = ', list(set(new_ents) -set(prev_ents)))## trainingmodel = none # since we are training a fresh model not a s**ed modeln_iter = 20with nlp.disable_pipes(*other_pipes): # only train ner # optimizer = nlp.begin_training() if model is none: optimizer = nlp.begin_training() else: optimizer = nlp.resume_training() for i in range(n_iter): losses = {}nlp.update(x, y, sgd=optimizer, drop=0.0, losses=losses) # nlp.entity.update(d, g) print("losses", losses)

在這裡，對於訓練，下降被指定為 00 故意過度擬合模型，並表明它可以學習識別所有新實體。

模型訓練結果：

doc = nlp(text)for ent in doc.ents: print(ent.text, ent.label_) # output sebastian thrun persongoogle org2007 dateamerican norpthrun personrecode orgearlier this week. dated

Spacy 還提供了一種生成顏色編碼實體的方法（如圖 1 所示），可以使用以下程式碼片段在 Web 瀏覽器或膝上型電腦中檢視這些實體：

from spacy import displacydisplacy.render(doc, style="ent") # if from notebook else displacy.serve(doc, style="ent") generally

這裡提供的訓練乙個新實體的過程可能看起來有點容易，但它確實有乙個警告。在訓練時，新訓練的模型可能會忘記識別舊實體，因此強烈建議將一些文字與先前訓練的實體中的實體混合在一起，除非舊實體對問題不重要。其次，學習更具體的實體可能比學習廣義的實體更好。

我們看到，開始學習新實體並不難，但人們確實需要嘗試不同的注釋技術，並選擇最適合給定問題的注釋技術。

本文進一步擴充套件了 Spacy 在此處提供的示例。

可以在此 jupyter 筆記本中訪問整個塊。自述檔案還包含如何安裝 spacy 庫，以及如何在安裝和載入預訓練模型期間除錯錯誤問題。

閱讀 Akbik 等人撰寫的這篇文章。它應該有助於理解序列標記背後的演算法，即多個單詞實體。

自然語言處理擴充套件了 NER 以使用 spaCy 標記新實體

相關問題答案

什麼是自然語言處理

自然語言處理中的注意

自然語言處理分析與應用系統，讓計算機與人之間的交流無障礙

自然語言處理第 2 部分：識別文字中的個人身份資訊

基於遷移學習的自然語言生成演算法研究習

自然語言處理擴充套件了 NER 以使用 spaCy 標記新實體

相關問題答案

什麼是自然語言處理

自然語言處理中的注意

自然語言處理分析與應用系統，讓計算機與人之間的交流無障礙

自然語言處理第 2 部分：識別文字中的個人身份資訊

基於遷移學習的自然語言生成演算法研究 習

基於遷移學習的自然語言生成演算法研究習