Как зарегистрировать пользовательские компоненты в файле SpaCy config.cfg?Python

Программы на Python
Ответить Пред. темаСлед. тема
Anonymous
 Как зарегистрировать пользовательские компоненты в файле SpaCy config.cfg?

Сообщение Anonymous »

Как говорится в заголовке:
Я, кажется, следил за документацией, как описано, и я смотрел по всему сети для полезного ответа, но до сих пор не нашел много. Любая помощь очень ценится! Спасибо! < /P>
Я запускаю команду: < /p>
python -m spacy debug config config.cfg --code 'matcher.py' --code 'sentence.py'

< /code>
и < /p>
python -m spacy train 'config.cfg' --output 'config\' --code 'sentence.py' --code 'matcher.py'

Оба получают одну и ту же ошибку:
ValueError: [E002] Can't find factory for 'sentence_splitter' for language English (en). This usually happens when spaCy calls `nlp.create_pipe` with a
custom component name that's not registered on the current language class. If you're using a Transformer, make sure to install 'spacy-transformers'. I
f you're using a custom component, make sure you've added the decorator `@Language.component` (for function components) or `@Language.factory` (for class components).

Available factories: attribute_ruler, tok2vec, merge_noun_chunks, merge_entities, merge_subtokens, token_splitter, doc_cleaner, parser, beam_parser, le
mmatizer, trainable_lemmatizer, entity_linker, entity_ruler, tagger, morphologizer, ner, beam_ner, senter, sentencizer, spancat, spancat_singlelabel, span_finder, future_entity_ruler, span_ruler, textcat, textcat_multilabel, matcher, en.lemmatizer
< /code>
Вот мой файл конфигурации: < /p>
[paths]
train = "output_data.spacy"
dev = "output_data.spacy"
vectors = null
init_tok2vec = null

[system]
gpu_allocator = null
seed = 0

[nlp]
lang = "en"
pipeline = ["tok2vec","ner","tagger","sentence_splitter", "parser", "senter","attribute_ruler","matcher","lemmatizer","spacytextblob"]
disabled = ["senter", "tagger", & q u o t ; a t t r i b u t e _ r u l e r & q u o t ; , & q u o t ; s p a c y t e x t b l o b & q u o t ; ] < b r / > b e f o r e _ c r e a t i o n = n u l l < b r / > a f t e r _ c r e a t i o n = n u l l < b r / > a f t e r _ p i p e l i n e _ c r e a t i o n = n u l l < b r / > b a t c h _ s i z e = 2 5 6 < b r / > t o k e n i z e r = { & q u o t ; @ t o k e n i z e r s & q u o t ; : & q u o t ; s p a c y . T o k e n i z e r . v 1 & q u o t ; } < b r / > < b r / > [ c o m p o n e n t s ] < b r / > < b r / > [ c o m p o n e n t s . s e n t e n c e _ s p l i t t e r ] < b r / > f a c t o r y = & q u o t ; s e n t e n c e _ s p l i t t e r & q u o t ; < b r / > < b r / > < b r / > < b r / > [ c o m p o n e n t s . a t t r i b u t e _ r u l e r ] < b r / > f a c t o r y = & q u o t ; a t t r i b u t e _ r u l e r & q u o t ; < b r / > s c o r e r = { & q u o t ; @ s c o r e r s & q u o t ; : & q u o t ; s p a c y . a t t r i b u t e _ r u l e r _ s c o r e r . v 1 & q u o t ; } < b r / > v a l i d a t e = f a l s e < b r / > < b r / > [ c o m p o n e n t s . l e m m a t i z e r ] < b r / > f a c t o r y = & q u o t ; l e m m a t i z e r & q u o t ; < b r / > m o d e = & q u o t ; r u l e & q u o t ; < b r / > m o d e l = n u l l < b r / > o v e r w r i t e = f a l s e < b r / > s c o r e r = { & q u o t ; @ s c o r e r s & q u o t ; : & q u o t ; s p a c y . l e m m a t i z e r _ s c o r e r . v 1 & q u o t ; } < b r / > < b r / > [ c o m p o n e n t s . n e r ] < b r / > f a c t o r y = & q u o t ; n e r & q u o t ; < b r / > i n c o r r e c t _ s p a n s _ k e y = n u l l < b r / > m o v e s = n u l l < b r / > s c o r e r = { & q u o t ; @ s c o r e r s & q u o t ; : & q u o t ; s p a c y . n e r _ s c o r e r . v 1 & q u o t ; } < b r / > u p d a t e _ w i t h _ o r a c l e _ c u t _ s i z e = 1 0 0 < b r / > < b r / > [ c o m p o n e n t s . n e r . m o d e l ] < b r / > @ a r c h i t e c t u r e s = & q u o t ; s p a c y . T r a n s i t i o n B a s e d P a r s e r . v 2 & q u o t ; < b r / > s t a t e _ t y p e = & q u o t ; n e r & q u o t ; < b r / > e x t r a _ s t a t e _ t o k e n s = f a l s e < b r / > h i d d e n _ w i d t h = 6 4 < b r / > m a x o u t _ p i e c e s = 2 < b r / > u s e _ u p p e r = t r u e < b r / > n O = n u l l < b r / > < b r / > [ c o m p o n e n t s . n e r . m o d e l . t o k 2 v e c ] < b r / > @ a r c h i t e c t u r e s = & q u o t ; s p a c y . T o k 2 V e c . v 2 & q u o t ; < b r / > < b r / > [ c o m p o n e n t s . n e r . m o d e l . t o k 2 v e c . e m b e d ] < b r / > @ a r c h i t e c t u r e s = & q u o t ; s p a c y . M u l t i H a s h E m b e d . v 2 & q u o t ; < b r / > w i d t h = 9 6 < b r / > a t t r s = [ & q u o t ; N O R M & q u o t ; , & q u o t ; P R E F I X & q u o t ; , & q u o t ; S U F F I X & q u o t ; , & q u o t ; S H A P E & q u o t ; ] < b r / > r o w s = [ 5 0 0 0 , 1 0 0 0 , 2 5 0 0 , 2 5 0 0 ] < b r / > i n c l u d e _ s t a t i c _ v e c t o r s = t r u e < b r / > < b r / > [ c o m p o n e n t s . n e r . m o d e l . t o k 2 v e c . e n c o d e ] < b r / > @ a r c h i t e c t u r e s = & q u o t ; s p a c y . M a x o u t W i n d o w E n c o d e r . v 2 & q u o t ; < b r / > w i d t h = 9 6 < b r / > d e p t h = 4 < b r / > w i n d o w _ s i z e = 1 < b r / > m a x o u t _ p i e c e s = 3 < b r / > < b r / > [ c o m p o n e n t s . p a r s e r ] < b r / > f a c t o r y = & q u o t ; p a r s e r & q u o t ; < b r / > l e a r n _ t o k e n s = f a l s e < b r / > m i n _ a c t i o n _ f r e q = 3 0 < b r / > m o v e s = n u l l < b r / > s c o r e r = { & q u o t ; @ s c o r e r s & q u o t ; : & q u o t ; s p a c y . p a r s e r _ s c o r e r . v 1 & q u o t ; } < b r / > u p d a t e _ w i t h _ o r a c l e _ c u t _ s i z e = 1 0 0 < b r / > < b r / > [ c o m p o n e n t s . p a r s e r . m o d e l ] < b r / > @ a r c h i t e c t u r e s = & q u o t ; s p a c y . T r a n s i t i o n B a s e d P a r s e r . v 2 & q u o t ; < b r / > s t a t e _ t y p e = & q u o t ; p a r s e r & q u o t ; < b r / > e x t r a _ s t a t e _ t o k e n s = f a l s e < b r / > h i d d e n _ w i d t h = 6 4 < b r / > m a x o u t _ p i e c e s = 2 < b r / > u s e _ u p p e r = t r u e < b r / > n O = n u l l < b r / > < b r / > [ c o m p o n e n t s . p a r s e r . m o d e l . t o k 2 v e c ] < b r / > @ a r c h i t e c t u r e s = & q u o t ; s p a c y . T o k 2 V e c L i s t e n e r . v 1 & q u o t ; < b r / > w i d t h = $ { c o m p o n e n t s . t o k 2 v e c . m o d e l . e n c o d e : w i d t h } < b r / > u p s t r e a m = & q u o t ; t o k 2 v e c & q u o t ; < b r / > < b r / > [ c o m p o n e n t s . s e n t e r ] < b r / > f a c t o r y = & q u o t ; s e n t e r & q u o t ; < b r / > o v e r w r i t e = f a l s e < b r / > s c o r e r = { & q u o t ; @ s c o r e r s & q u o t ; : & q u o t ; s p a c y . s e n t e r _ s c o r e r . v 1 & q u o t ; } < b r / > < b r / > [ c o m p o n e n t s . s e n t e r . m o d e l ] < b r / > @ a r c h i t e c t u r e s = & q u o t ; s p a c y . T a g g e r . v 2 & q u o t ; < b r / > n O = n u l l < b r / > n o r m a l i z e = f a l s e < b r / > < b r / > [ c o m p o n ents.senter.model.tok2vec]
@architectures = "spacy.Tok2Vec.v2"

[components.senter.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = 16
attrs = ["NORM","PREFIX","SUFFIX","SHAPE","SPACY"]
rows = [1000,500,500,500,50]
include_static_vectors = true

[components.senter.model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 16
depth = 2
window_size = 1
maxout_pieces = 2

[components.spacytextblob]
factory = "spacytextblob"
blob_only = false
custom_blob = null

[components.tagger]
factory = "tagger"
label_smoothing = 0.0
neg_prefix = "!"
overwrite = false
scorer = {"@scorers":"spacy.tagger_scorer.v1"}

[components.tagger.model]
@architectures = "spacy.Tagger.v2"
nO = null
normalize = false

[components.tagger.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode:width}
upstream = "tok2vec"

[components.matcher]
factory = "matcher"

[components.tok2vec]
factory = "tok2vec"

[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v2"

[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = ${components.tok2vec.model.encode:width}
attrs = ["NORM","PREFIX","SUFFIX","SHAPE","SPACY","IS_SPACE"]
rows = [5000,1000,2500,2500,50,50]
include_static_vectors = true

[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 96
depth = 4
window_size = 1
maxout_pieces = 3

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
gold_preproc = false
max_length = 0
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
gold_preproc = false
max_length = 0
limit = 0
augmenter = null

[training]
train_corpus = "corpora.train"
dev_corpus = "corpora.dev"
seed = ${system:seed}
gpu_allocator = ${system:gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 5000
max_epochs = 0
max_steps = 100000
eval_frequency = 1000
frozen_components = []
before_to_disk = null
annotating_components = []
before_update = null

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = true
eps = 0.00000001
learn_rate = 0.001

[training.score_weights]
tag_acc = 0.16
dep_uas = 0.0
dep_las = 0.16
dep_las_per_type = null
sents_p = null
sents_r = null
sents_f = 0.02
lemma_acc = 0.5
ents_f = 0.16
ents_p = 0.0
ents_r = 0.0
ents_per_type = null
speed = 0.0

[pretraining]

[initialize]
vocab_data = null
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
before_init = null
after_init = null

[initialize.components]

[initialize.components.ner]

[initialize.components.ner.labels]
@readers = "spacy.read_labels.v1"
path = "corpus/labels/ner.json"
require = false

[initialize.components.parser]

[initialize.components.parser.labels]
@readers = "spacy.read_labels.v1"
path = "corpus/labels/parser.json"
require = false

[initialize.components.tagger]

[initialize.components.tagger.labels]
@readers = "spacy.read_labels.v1"
path = "corpus/labels/tagger.json"
require = false

[initialize.lookups]
@misc = "spacy.LookupsDataLoader.v1"
lang = ${nlp.lang}
tables = ["lexeme_norm"]

[initialize.tokenizer]

файлsentence.py
import spacy
from spacy.language import Language
import re

@Language.component("sentence_splitter") # stateless
def sentence_splitter(doc):
start = 0
i = 0
# print("Processing custom_sentence_splitter_improved")
#delimiter_pattern = re.compile(r"(\r?\n)+|(\n)+") # This is the magic regex
delimiter_pattern = re.compile(r"(\r?\n\s*)+|(\n\s*)+")
while i < len(doc):
if delimiter_pattern.fullmatch(doc.text):
# print(f"Found delimiter '{doc.text}' at position {i}")
for token in doc[start:i]:
token.sent_start = False
doc.sent_start = True
start = i + 1

# Skip consecutive occurrences of '\r' and '\n'
while i + 1 < len(doc) and delimiter_pattern.fullmatch(doc[i + 1].text):
doc[i + 1].sent_start = False
i += 1
else:
doc.sent_start = False
i += 1

for token in doc[start:]:
token.sent_start = False

return doc

# Used to add the custom component to the pipeline
nlp = spacy.load("en_core_web_lg")
nlp.add_pipe("sentence_splitter", name="sentence_splitter", after='ner')

matcher.py file
import spacy
from spacy.language import Language
import re
from spacy.matcher import Matcher
from spacy.tokens import Token

@Language.factory("matcher")# stateful
def create_template_matcher(nlp, name):
return TemplateMatcher(nlp.vocab)

class TemplateMatcher:
def __init__(self, vocab):
# Define multiple patterns
patterns1 = [blar blar blar ]
patterns2 = [blar blar blar ]
patterns3 = [blar blar blar ]
patterns4 = [blar blar blar ]

Token.set_extension("templates", default=False, force=True) # Register a new token extension to flag matched patterns
self.matcher = Matcher(vocab)
self.matcher.add("patterns1", patterns1)
self.matcher.add("patterns2", patterns2)
self.matcher.add("patterns3", patterns3)
self.matcher.add("patterns4", patterns4)

def __call__(self, doc):
matches = self.matcher(doc)
for match_id, start, end in matches:
for token in doc[start:end]:
token._.templates = True
return doc

# Used to add the custom component to the pipeline
nlp = spacy.load("en_core_web_lg")
nlp.add_pipe("matcher", name="matcher", after ='parser')


Подробнее здесь: https://stackoverflow.com/questions/767 ... g-cfg-file
Реклама
Ответить Пред. темаСлед. тема

Быстрый ответ

Изменение регистра текста: 
Смайлики
:) :( :oops: :roll: :wink: :muza: :clever: :sorry: :angel: :read: *x)
Ещё смайлики…
   
К этому ответу прикреплено по крайней мере одно вложение.

Если вы не хотите добавлять вложения, оставьте поля пустыми.

Максимально разрешённый размер вложения: 15 МБ.

  • Похожие темы
    Ответы
    Просмотры
    Последнее сообщение
  • Spacy nlp = spacy.load("en_core_web_lg")
    Anonymous » » в форуме Python
    0 Ответы
    21 Просмотры
    Последнее сообщение Anonymous
  • Почему Presidio с Spacy NLP Engine не распознает организации и Pesel, пока Spacy делает?
    Anonymous » » в форуме Python
    0 Ответы
    15 Просмотры
    Последнее сообщение Anonymous
  • Можно ли переносить старые модели Spacy в будущие версии Spacy?
    Anonymous » » в форуме Python
    0 Ответы
    6 Просмотры
    Последнее сообщение Anonymous
  • Можно сгенерировать CFG + Callgraph в одном файле?
    Anonymous » » в форуме C++
    0 Ответы
    5 Просмотры
    Последнее сообщение Anonymous
  • Как использовать отложенные компоненты в Flutter (с акцентом на компоненты только для активов)?
    Anonymous » » в форуме Android
    0 Ответы
    2 Просмотры
    Последнее сообщение Anonymous

Вернуться в «Python»