Как зарегистрировать пользовательские компоненты в файле SpaCy config.cfg? - Цифровое Кемерово

Как зарегистрировать пользовательские компоненты в файле SpaCy config.cfg? ⇐ Python

Ответить Пред. тема След. тема

1 сообщение • Страница 1 из 1

Anonymous

Как зарегистрировать пользовательские компоненты в файле SpaCy config.cfg?

Цитата

Сообщение Anonymous » 24 янв 2025, 18:53

Как говорится в заголовке:
Я, кажется, следил за документацией, как описано, и я смотрел по всему сети для полезного ответа, но до сих пор не нашел много. Любая помощь очень ценится! Спасибо! < /P>
Я запускаю команду: < /p>
python -m spacy debug config config.cfg --code 'matcher.py' --code 'sentence.py'

< /code>
и < /p>
python -m spacy train 'config.cfg' --output 'config\' --code 'sentence.py' --code 'matcher.py'

Оба получают одну и ту же ошибку:
ValueError: [E002] Can't find factory for 'sentence_splitter' for language English (en). This usually happens when spaCy calls `nlp.create_pipe` with a
custom component name that's not registered on the current language class. If you're using a Transformer, make sure to install 'spacy-transformers'. I
f you're using a custom component, make sure you've added the decorator `@Language.component` (for function components) or `@Language.factory` (for class components).

Available factories: attribute_ruler, tok2vec, merge_noun_chunks, merge_entities, merge_subtokens, token_splitter, doc_cleaner, parser, beam_parser, le
mmatizer, trainable_lemmatizer, entity_linker, entity_ruler, tagger, morphologizer, ner, beam_ner, senter, sentencizer, spancat, spancat_singlelabel, span_finder, future_entity_ruler, span_ruler, textcat, textcat_multilabel, matcher, en.lemmatizer
< /code>
Вот мой файл конфигурации: < /p>
[paths]
train = "output_data.spacy"
dev = "output_data.spacy"
vectors = null
init_tok2vec = null

[system]
gpu_allocator = null
seed = 0

[nlp]
lang = "en"
pipeline = ["tok2vec","ner","tagger","sentence_splitter", "parser", "senter","attribute_ruler","matcher","lemmatizer","spacytextblob"]
disabled = ["senter", "tagger", & q u o t ; a t t r i b u t e _ r u l e r & q u o t ; , & q u o t ; s p a c y t e x t b l o b & q u o t ; ] < b r / > b e f o r e _ c r e a t i o n = n u l l < b r / > a f t e r _ c r e a t i o n = n u l l < b r / > a f t e r _ p i p e l i n e _ c r e a t i o n = n u l l < b r / > b a t c h _ s i z e = 2 5 6 < b r / > t o k e n i z e r = { & q u o t ; @ t o k e n i z e r s & q u o t ; : & q u o t ; s p a c y . T o k e n i z e r . v 1 & q u o t ; } < b r / > < b r / > [ c o m p o n e n t s ] < b r / > < b r / > [ c o m p o n e n t s . s e n t e n c e _ s p l i t t e r ] < b r / > f a c t o r y = & q u o t ; s e n t e n c e _ s p l i t t e r & q u o t ; < b r / > < b r / > < b r / > < b r / > [ c o m p o n e n t s . a t t r i b u t e _ r u l e r ] < b r / > f a c t o r y = & q u o t ; a t t r i b u t e _ r u l e r & q u o t ; < b r / > s c o r e r = { & q u o t ; @ s c o r e r s & q u o t ; : & q u o t ; s p a c y . a t t r i b u t e _ r u l e r _ s c o r e r . v 1 & q u o t ; } < b r / > v a l i d a t e = f a l s e < b r / > < b r / > [ c o m p o n e n t s . l e m m a t i z e r ] < b r / > f a c t o r y = & q u o t ; l e m m a t i z e r & q u o t ; < b r / > m o d e = & q u o t ; r u l e & q u o t ; < b r / > m o d e l = n u l l < b r / > o v e r w r i t e = f a l s e < b r / > s c o r e r = { & q u o t ; @ s c o r e r s & q u o t ; : & q u o t ; s p a c y . l e m m a t i z e r _ s c o r e r . v 1 & q u o t ; } < b r / > < b r / > [ c o m p o n e n t s . n e r ] < b r / > f a c t o r y = & q u o t ; n e r & q u o t ; < b r / > i n c o r r e c t _ s p a n s _ k e y = n u l l < b r / > m o v e s = n u l l < b r / > s c o r e r = { & q u o t ; @ s c o r e r s & q u o t ; : & q u o t ; s p a c y . n e r _ s c o r e r . v 1 & q u o t ; } < b r / > u p d a t e _ w i t h _ o r a c l e _ c u t _ s i z e = 1 0 0 < b r / > < b r / > [ c o m p o n e n t s . n e r . m o d e l ] < b r / > @ a r c h i t e c t u r e s = & q u o t ; s p a c y . T r a n s i t i o n B a s e d P a r s e r . v 2 & q u o t ; < b r / > s t a t e _ t y p e = & q u o t ; n e r & q u o t ; < b r / > e x t r a _ s t a t e _ t o k e n s = f a l s e < b r / > h i d d e n _ w i d t h = 6 4 < b r / > m a x o u t _ p i e c e s = 2 < b r / > u s e _ u p p e r = t r u e < b r / > n O = n u l l < b r / > < b r / > [ c o m p o n e n t s . n e r . m o d e l . t o k 2 v e c ] < b r / > @ a r c h i t e c t u r e s = & q u o t ; s p a c y . T o k 2 V e c . v 2 & q u o t ; < b r / > < b r / > [ c o m p o n e n t s . n e r . m o d e l . t o k 2 v e c . e m b e d ] < b r / > @ a r c h i t e c t u r e s = & q u o t ; s p a c y . M u l t i H a s h E m b e d . v 2 & q u o t ; < b r / > w i d t h = 9 6 < b r / > a t t r s = [ & q u o t ; N O R M & q u o t ; , & q u o t ; P R E F I X & q u o t ; , & q u o t ; S U F F I X & q u o t ; , & q u o t ; S H A P E & q u o t ; ] < b r / > r o w s = [ 5 0 0 0 , 1 0 0 0 , 2 5 0 0 , 2 5 0 0 ] < b r / > i n c l u d e _ s t a t i c _ v e c t o r s = t r u e < b r / > < b r / > [ c o m p o n e n t s . n e r . m o d e l . t o k 2 v e c . e n c o d e ] < b r / > @ a r c h i t e c t u r e s = & q u o t ; s p a c y . M a x o u t W i n d o w E n c o d e r . v 2 & q u o t ; < b r / > w i d t h = 9 6 < b r / > d e p t h = 4 < b r / > w i n d o w _ s i z e = 1 < b r / > m a x o u t _ p i e c e s = 3 < b r / > < b r / > [ c o m p o n e n t s . p a r s e r ] < b r / > f a c t o r y = & q u o t ; p a r s e r & q u o t ; < b r / > l e a r n _ t o k e n s = f a l s e < b r / > m i n _ a c t i o n _ f r e q = 3 0 < b r / > m o v e s = n u l l < b r / > s c o r e r = { & q u o t ; @ s c o r e r s & q u o t ; : & q u o t ; s p a c y . p a r s e r _ s c o r e r . v 1 & q u o t ; } < b r / > u p d a t e _ w i t h _ o r a c l e _ c u t _ s i z e = 1 0 0 < b r / > < b r / > [ c o m p o n e n t s . p a r s e r . m o d e l ] < b r / > @ a r c h i t e c t u r e s = & q u o t ; s p a c y . T r a n s i t i o n B a s e d P a r s e r . v 2 & q u o t ; < b r / > s t a t e _ t y p e = & q u o t ; p a r s e r & q u o t ; < b r / > e x t r a _ s t a t e _ t o k e n s = f a l s e < b r / > h i d d e n _ w i d t h = 6 4 < b r / > m a x o u t _ p i e c e s = 2 < b r / > u s e _ u p p e r = t r u e < b r / > n O = n u l l < b r / > < b r / > [ c o m p o n e n t s . p a r s e r . m o d e l . t o k 2 v e c ] < b r / > @ a r c h i t e c t u r e s = & q u o t ; s p a c y . T o k 2 V e c L i s t e n e r . v 1 & q u o t ; < b r / > w i d t h = $ { c o m p o n e n t s . t o k 2 v e c . m o d e l . e n c o d e : w i d t h } < b r / > u p s t r e a m = & q u o t ; t o k 2 v e c & q u o t ; < b r / > < b r / > [ c o m p o n e n t s . s e n t e r ] < b r / > f a c t o r y = & q u o t ; s e n t e r & q u o t ; < b r / > o v e r w r i t e = f a l s e < b r / > s c o r e r = { & q u o t ; @ s c o r e r s & q u o t ; : & q u o t ; s p a c y . s e n t e r _ s c o r e r . v 1 & q u o t ; } < b r / > < b r / > [ c o m p o n e n t s . s e n t e r . m o d e l ] < b r / > @ a r c h i t e c t u r e s = & q u o t ; s p a c y . T a g g e r . v 2 & q u o t ; < b r / > n O = n u l l < b r / > n o r m a l i z e = f a l s e < b r / > < b r / > [ c o m p o n ents.senter.model.tok2vec]
@architectures = "spacy.Tok2Vec.v2"

[components.senter.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = 16
attrs = ["NORM","PREFIX","SUFFIX","SHAPE","SPACY"]
rows = [1000,500,500,500,50]
include_static_vectors = true

[components.senter.model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 16
depth = 2
window_size = 1
maxout_pieces = 2

[components.spacytextblob]
factory = "spacytextblob"
blob_only = false
custom_blob = null

[components.tagger]
factory = "tagger"
label_smoothing = 0.0
neg_prefix = "!"
overwrite = false
scorer = {"@scorers":"spacy.tagger_scorer.v1"}

[components.tagger.model]
@architectures = "spacy.Tagger.v2"
nO = null
normalize = false

[components.tagger.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode:width}
upstream = "tok2vec"

[components.matcher]
factory = "matcher"

[components.tok2vec]
factory = "tok2vec"

[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v2"

[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = ${components.tok2vec.model.encode:width}
attrs = ["NORM","PREFIX","SUFFIX","SHAPE","SPACY","IS_SPACE"]
rows = [5000,1000,2500,2500,50,50]
include_static_vectors = true

[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 96
depth = 4
window_size = 1
maxout_pieces = 3

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
gold_preproc = false
max_length = 0
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
gold_preproc = false
max_length = 0
limit = 0
augmenter = null

[training]
train_corpus = "corpora.train"
dev_corpus = "corpora.dev"
seed = ${system:seed}
gpu_allocator = ${system:gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 5000
max_epochs = 0
max_steps = 100000
eval_frequency = 1000
frozen_components = []
before_to_disk = null
annotating_components = []
before_update = null

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = true
eps = 0.00000001
learn_rate = 0.001

[training.score_weights]
tag_acc = 0.16
dep_uas = 0.0
dep_las = 0.16
dep_las_per_type = null
sents_p = null
sents_r = null
sents_f = 0.02
lemma_acc = 0.5
ents_f = 0.16
ents_p = 0.0
ents_r = 0.0
ents_per_type = null
speed = 0.0

[pretraining]

[initialize]
vocab_data = null
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
before_init = null
after_init = null

[initialize.components]

[initialize.components.ner]

[initialize.components.ner.labels]
@readers = "spacy.read_labels.v1"
path = "corpus/labels/ner.json"
require = false

[initialize.components.parser]

[initialize.components.parser.labels]
@readers = "spacy.read_labels.v1"
path = "corpus/labels/parser.json"
require = false

[initialize.components.tagger]

[initialize.components.tagger.labels]
@readers = "spacy.read_labels.v1"
path = "corpus/labels/tagger.json"
require = false

[initialize.lookups]
@misc = "spacy.LookupsDataLoader.v1"
lang = ${nlp.lang}
tables = ["lexeme_norm"]

[initialize.tokenizer]

файлsentence.py
import spacy
from spacy.language import Language
import re

@Language.component("sentence_splitter") # stateless
def sentence_splitter(doc):
start = 0
i = 0
# print("Processing custom_sentence_splitter_improved")
#delimiter_pattern = re.compile(r"(\r?\n)+|(\n)+") # This is the magic regex
delimiter_pattern = re.compile(r"(\r?\n\s*)+|(\n\s*)+")
while i < len(doc):
if delimiter_pattern.fullmatch(doc.text):
# print(f"Found delimiter '{doc.text}' at position {i}")
for token in doc[start:i]:
token.sent_start = False
doc.sent_start = True
start = i + 1

# Skip consecutive occurrences of '\r' and '\n'
while i + 1 < len(doc) and delimiter_pattern.fullmatch(doc[i + 1].text):
doc[i + 1].sent_start = False
i += 1
else:
doc.sent_start = False
i += 1

for token in doc[start:]:
token.sent_start = False

return doc

# Used to add the custom component to the pipeline
nlp = spacy.load("en_core_web_lg")
nlp.add_pipe("sentence_splitter", name="sentence_splitter", after='ner')

matcher.py file
import spacy
from spacy.language import Language
import re
from spacy.matcher import Matcher
from spacy.tokens import Token

@Language.factory("matcher")# stateful
def create_template_matcher(nlp, name):
return TemplateMatcher(nlp.vocab)

class TemplateMatcher:
def __init__(self, vocab):
# Define multiple patterns
patterns1 = [blar blar blar ]
patterns2 = [blar blar blar ]
patterns3 = [blar blar blar ]
patterns4 = [blar blar blar ]

Token.set_extension("templates", default=False, force=True) # Register a new token extension to flag matched patterns
self.matcher = Matcher(vocab)
self.matcher.add("patterns1", patterns1)
self.matcher.add("patterns2", patterns2)
self.matcher.add("patterns3", patterns3)
self.matcher.add("patterns4", patterns4)

def __call__(self, doc):
matches = self.matcher(doc)
for match_id, start, end in matches:
for token in doc[start:end]:
token._.templates = True
return doc

# Used to add the custom component to the pipeline
nlp = spacy.load("en_core_web_lg")
nlp.add_pipe("matcher", name="matcher", after ='parser')

Подробнее здесь: https://stackoverflow.com/questions/767 ... g-cfg-file

Реклама

1737734000

Anonymous

 Как говорится в заголовке:
Я, кажется, следил за документацией, как описано, и я смотрел по всему сети для полезного ответа, но до сих пор не нашел много. Любая помощь очень ценится! Спасибо! < /P>
Я запускаю команду: < /p>
python -m spacy debug config config.cfg --code 'matcher.py' --code 'sentence.py'

< /code>
и < /p>
python -m spacy train 'config.cfg' --output 'config\' --code 'sentence.py' --code 'matcher.py'

Оба получают одну и ту же ошибку:
ValueError: [E002] Can't find factory for 'sentence_splitter' for language English (en). This usually happens when spaCy calls `nlp.create_pipe` with a
custom component name that's not registered on the current language class. If you're using a Transformer, make sure to install 'spacy-transformers'.  I
f you're using a custom component, make sure you've added the decorator `@Language.component` (for function components) or `@Language.factory` (for class components).

Available factories: attribute_ruler, tok2vec, merge_noun_chunks, merge_entities, merge_subtokens, token_splitter, doc_cleaner, parser, beam_parser, le
mmatizer, trainable_lemmatizer, entity_linker, entity_ruler, tagger, morphologizer, ner, beam_ner, senter, sentencizer, spancat, spancat_singlelabel, span_finder, future_entity_ruler, span_ruler, textcat, textcat_multilabel, matcher, en.lemmatizer
< /code>
Вот мой файл конфигурации:  < /p>
[paths]
train = "output_data.spacy"
dev = "output_data.spacy"
vectors = null
init_tok2vec = null

[system]
gpu_allocator = null
seed = 0

[nlp]
lang = "en"
pipeline = ["tok2vec","ner","tagger","sentence_splitter", "parser", "senter","attribute_ruler","matcher","lemmatizer","spacytextblob"]
disabled = ["senter", "tagger",  & q u o t ; a t t r i b u t e _ r u l e r & q u o t ; , & q u o t ; s p a c y t e x t b l o b & q u o t ; ] < b r   / > b e f o r e _ c r e a t i o n   =   n u l l < b r   / > a f t e r _ c r e a t i o n   =   n u l l < b r   / > a f t e r _ p i p e l i n e _ c r e a t i o n   =   n u l l < b r   / > b a t c h _ s i z e   =   2 5 6 < b r   / > t o k e n i z e r   =   { & q u o t ; @ t o k e n i z e r s & q u o t ; : & q u o t ; s p a c y . T o k e n i z e r . v 1 & q u o t ; } < b r   / > < b r   / > [ c o m p o n e n t s ] < b r   / > < b r   / > [ c o m p o n e n t s . s e n t e n c e _ s p l i t t e r ] < b r   / > f a c t o r y   =   & q u o t ; s e n t e n c e _ s p l i t t e r & q u o t ; < b r   / > < b r   / > < b r   / > < b r   / > [ c o m p o n e n t s . a t t r i b u t e _ r u l e r ] < b r   / > f a c t o r y   =   & q u o t ; a t t r i b u t e _ r u l e r & q u o t ; < b r   / > s c o r e r   =   { & q u o t ; @ s c o r e r s & q u o t ; : & q u o t ; s p a c y . a t t r i b u t e _ r u l e r _ s c o r e r . v 1 & q u o t ; } < b r   / > v a l i d a t e   =   f a l s e < b r   / > < b r   / > [ c o m p o n e n t s . l e m m a t i z e r ] < b r   / > f a c t o r y   =   & q u o t ; l e m m a t i z e r & q u o t ; < b r   / > m o d e   =   & q u o t ; r u l e & q u o t ; < b r   / > m o d e l   =   n u l l < b r   / > o v e r w r i t e   =   f a l s e < b r   / > s c o r e r   =   { & q u o t ; @ s c o r e r s & q u o t ; : & q u o t ; s p a c y . l e m m a t i z e r _ s c o r e r . v 1 & q u o t ; } < b r   / > < b r   / > [ c o m p o n e n t s . n e r ] < b r   / > f a c t o r y   =   & q u o t ; n e r & q u o t ; < b r   / > i n c o r r e c t _ s p a n s _ k e y   =   n u l l < b r   / > m o v e s   =   n u l l < b r   / > s c o r e r   =   { & q u o t ; @ s c o r e r s & q u o t ; : & q u o t ; s p a c y . n e r _ s c o r e r . v 1 & q u o t ; } < b r   / > u p d a t e _ w i t h _ o r a c l e _ c u t _ s i z e   =   1 0 0 < b r   / > < b r   / > [ c o m p o n e n t s . n e r . m o d e l ] < b r   / > @ a r c h i t e c t u r e s   =   & q u o t ; s p a c y . T r a n s i t i o n B a s e d P a r s e r . v 2 & q u o t ; < b r   / > s t a t e _ t y p e   =   & q u o t ; n e r & q u o t ; < b r   / > e x t r a _ s t a t e _ t o k e n s   =   f a l s e < b r   / > h i d d e n _ w i d t h   =   6 4 < b r   / > m a x o u t _ p i e c e s   =   2 < b r   / > u s e _ u p p e r   =   t r u e < b r   / > n O   =   n u l l < b r   / > < b r   / > [ c o m p o n e n t s . n e r . m o d e l . t o k 2 v e c ] < b r   / > @ a r c h i t e c t u r e s   =   & q u o t ; s p a c y . T o k 2 V e c . v 2 & q u o t ; < b r   / > < b r   / > [ c o m p o n e n t s . n e r . m o d e l . t o k 2 v e c . e m b e d ] < b r   / > @ a r c h i t e c t u r e s   =   & q u o t ; s p a c y . M u l t i H a s h E m b e d . v 2 & q u o t ; < b r   / > w i d t h   =   9 6 < b r   / > a t t r s   =   [ & q u o t ; N O R M & q u o t ; , & q u o t ; P R E F I X & q u o t ; , & q u o t ; S U F F I X & q u o t ; , & q u o t ; S H A P E & q u o t ; ] < b r   / > r o w s   =   [ 5 0 0 0 , 1 0 0 0 , 2 5 0 0 , 2 5 0 0 ] < b r   / > i n c l u d e _ s t a t i c _ v e c t o r s   =   t r u e < b r   / > < b r   / > [ c o m p o n e n t s . n e r . m o d e l . t o k 2 v e c . e n c o d e ] < b r   / > @ a r c h i t e c t u r e s   =   & q u o t ; s p a c y . M a x o u t W i n d o w E n c o d e r . v 2 & q u o t ; < b r   / > w i d t h   =   9 6 < b r   / > d e p t h   =   4 < b r   / > w i n d o w _ s i z e   =   1 < b r   / > m a x o u t _ p i e c e s   =   3 < b r   / > < b r   / > [ c o m p o n e n t s . p a r s e r ] < b r   / > f a c t o r y   =   & q u o t ; p a r s e r & q u o t ; < b r   / > l e a r n _ t o k e n s   =   f a l s e < b r   / > m i n _ a c t i o n _ f r e q   =   3 0 < b r   / > m o v e s   =   n u l l < b r   / > s c o r e r   =   { & q u o t ; @ s c o r e r s & q u o t ; : & q u o t ; s p a c y . p a r s e r _ s c o r e r . v 1 & q u o t ; } < b r   / > u p d a t e _ w i t h _ o r a c l e _ c u t _ s i z e   =   1 0 0 < b r   / > < b r   / > [ c o m p o n e n t s . p a r s e r . m o d e l ] < b r   / > @ a r c h i t e c t u r e s   =   & q u o t ; s p a c y . T r a n s i t i o n B a s e d P a r s e r . v 2 & q u o t ; < b r   / > s t a t e _ t y p e   =   & q u o t ; p a r s e r & q u o t ; < b r   / > e x t r a _ s t a t e _ t o k e n s   =   f a l s e < b r   / > h i d d e n _ w i d t h   =   6 4 < b r   / > m a x o u t _ p i e c e s   =   2 < b r   / > u s e _ u p p e r   =   t r u e < b r   / > n O   =   n u l l < b r   / > < b r   / > [ c o m p o n e n t s . p a r s e r . m o d e l . t o k 2 v e c ] < b r   / > @ a r c h i t e c t u r e s   =   & q u o t ; s p a c y . T o k 2 V e c L i s t e n e r . v 1 & q u o t ; < b r   / > w i d t h   =   $ { c o m p o n e n t s . t o k 2 v e c . m o d e l . e n c o d e : w i d t h } < b r   / > u p s t r e a m   =   & q u o t ; t o k 2 v e c & q u o t ; < b r   / > < b r   / > [ c o m p o n e n t s . s e n t e r ] < b r   / > f a c t o r y   =   & q u o t ; s e n t e r & q u o t ; < b r   / > o v e r w r i t e   =   f a l s e < b r   / > s c o r e r   =   { & q u o t ; @ s c o r e r s & q u o t ; : & q u o t ; s p a c y . s e n t e r _ s c o r e r . v 1 & q u o t ; } < b r   / > < b r   / > [ c o m p o n e n t s . s e n t e r . m o d e l ] < b r   / > @ a r c h i t e c t u r e s   =   & q u o t ; s p a c y . T a g g e r . v 2 & q u o t ; < b r   / > n O   =   n u l l < b r   / > n o r m a l i z e   =   f a l s e < b r   / > < b r   / > [ c o m p o n ents.senter.model.tok2vec]
@architectures = "spacy.Tok2Vec.v2"

[components.senter.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = 16
attrs = ["NORM","PREFIX","SUFFIX","SHAPE","SPACY"]
rows = [1000,500,500,500,50]
include_static_vectors = true

[components.senter.model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 16
depth = 2
window_size = 1
maxout_pieces = 2

[components.spacytextblob]
factory = "spacytextblob"
blob_only = false
custom_blob = null

[components.tagger]
factory = "tagger"
label_smoothing = 0.0
neg_prefix = "!"
overwrite = false
scorer = {"@scorers":"spacy.tagger_scorer.v1"}

[components.tagger.model]
@architectures = "spacy.Tagger.v2"
nO = null
normalize = false

[components.tagger.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode:width}
upstream = "tok2vec"

[components.matcher]
factory = "matcher"

[components.tok2vec]
factory = "tok2vec"

[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v2"

[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = ${components.tok2vec.model.encode:width}
attrs = ["NORM","PREFIX","SUFFIX","SHAPE","SPACY","IS_SPACE"]
rows = [5000,1000,2500,2500,50,50]
include_static_vectors = true

[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 96
depth = 4
window_size = 1
maxout_pieces = 3

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
gold_preproc = false
max_length = 0
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
gold_preproc = false
max_length = 0
limit = 0
augmenter = null

[training]
train_corpus = "corpora.train"
dev_corpus = "corpora.dev"
seed = ${system:seed}
gpu_allocator = ${system:gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 5000
max_epochs = 0
max_steps = 100000
eval_frequency = 1000
frozen_components = []
before_to_disk = null
annotating_components = []
before_update = null

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = true
eps = 0.00000001
learn_rate = 0.001

[training.score_weights]
tag_acc = 0.16
dep_uas = 0.0
dep_las = 0.16
dep_las_per_type = null
sents_p = null
sents_r = null
sents_f = 0.02
lemma_acc = 0.5
ents_f = 0.16
ents_p = 0.0
ents_r = 0.0
ents_per_type = null
speed = 0.0

[pretraining]

[initialize]
vocab_data = null
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
before_init = null
after_init = null

[initialize.components]

[initialize.components.ner]

[initialize.components.ner.labels]
@readers = "spacy.read_labels.v1"
path = "corpus/labels/ner.json"
require = false

[initialize.components.parser]

[initialize.components.parser.labels]
@readers = "spacy.read_labels.v1"
path = "corpus/labels/parser.json"
require = false

[initialize.components.tagger]

[initialize.components.tagger.labels]
@readers = "spacy.read_labels.v1"
path = "corpus/labels/tagger.json"
require = false

[initialize.lookups]
@misc = "spacy.LookupsDataLoader.v1"
lang = ${nlp.lang}
tables = ["lexeme_norm"]

[initialize.tokenizer]

[b]файлsentence.py[/b]
import spacy
from spacy.language import Language
import re

@Language.component("sentence_splitter")  # stateless
def sentence_splitter(doc):
start = 0
i = 0
# print("Processing custom_sentence_splitter_improved")
#delimiter_pattern = re.compile(r"(\r?\n)+|(\n)+")  # This is the magic regex
delimiter_pattern = re.compile(r"(\r?\n\s*)+|(\n\s*)+")
while i < len(doc):
if delimiter_pattern.fullmatch(doc[i].text):
# print(f"Found delimiter '{doc[i].text}' at position {i}")
for token in doc[start:i]:
token.sent_start = False
doc[i].sent_start = True
start = i + 1

# Skip consecutive occurrences of '\r' and '\n'
while i + 1 < len(doc) and delimiter_pattern.fullmatch(doc[i + 1].text):
doc[i + 1].sent_start = False
i += 1
else:
doc[i].sent_start = False
i += 1

for token in doc[start:]:
token.sent_start = False

return doc

# Used to add the custom component to the pipeline
nlp = spacy.load("en_core_web_lg")
nlp.add_pipe("sentence_splitter", name="sentence_splitter", after='ner')

[b] matcher.py file [/b] 
import spacy
from spacy.language import Language
import re
from spacy.matcher import Matcher
from spacy.tokens import Token

@Language.factory("matcher")# stateful
def create_template_matcher(nlp, name):
return TemplateMatcher(nlp.vocab)

class TemplateMatcher:
def __init__(self, vocab):
# Define multiple patterns
patterns1 = [blar blar blar ]
patterns2 = [blar blar blar ]
patterns3 = [blar blar blar ]
patterns4 = [blar blar blar ]

Token.set_extension("templates", default=False, force=True)  # Register a new token extension to flag matched patterns
self.matcher = Matcher(vocab)
self.matcher.add("patterns1", patterns1)
self.matcher.add("patterns2", patterns2)
self.matcher.add("patterns3", patterns3)
self.matcher.add("patterns4", patterns4)

def __call__(self, doc):
matches = self.matcher(doc)
for match_id, start, end in matches:
for token in doc[start:end]:
token._.templates = True
return doc

# Used to add the custom component to the pipeline
nlp = spacy.load("en_core_web_lg")
nlp.add_pipe("matcher", name="matcher", after ='parser')
 

Подробнее здесь: [url]https://stackoverflow.com/questions/76707941/how-to-register-custom-components-in-a-spacy-config-cfg-file[/url]

Ответить Пред. тема След. тема

1 сообщение • Страница 1 из 1

Быстрый ответ

Заголовок:

Имя пользователя:

Изменение регистра текста:

Смайлики

Ещё смайлики…

К этому ответу прикреплено по крайней мере одно вложение.

Если вы не хотите добавлять вложения, оставьте поля пустыми. Можно прикреплять файлы, перетаскивая их в окно сообщения.

Максимально разрешённый размер вложения: 15 МБ.

Имя файла:

Комментарий к файлу:

Имя файла	Комментарий к файлу	Размер	Статус

Похожие темы

Ответы

Просмотры

Последнее сообщение

Spacy nlp = spacy.load("en_core_web_lg")

Последнее сообщение Anonymous « 22 сен 2024, 13:08
Добавлено в форуме Python

Anonymous » 22 сен 2024, 13:08 » в форуме Python

У меня уже загружен SpaCy, но каждый раз, когда я пытаюсь выполнить команду nlp = spacy.load( en_core_web_lg ), я получаю следующую ошибку:

OSError: Can't find model 'en_core_web_lg'. It doesn't seem to be a shortcut link, a Python package or a...

0 Ответы

21 Просмотры

Последнее сообщение Anonymous
22 сен 2024, 13:08
Почему Presidio с Spacy NLP Engine не распознает организации и Pesel, пока Spacy делает?

Последнее сообщение Anonymous « 03 апр 2025, 10:02
Добавлено в форуме Python

Anonymous » 03 апр 2025, 10:02 » в форуме Python

Я использую Spacy с моделью PL_CORE_NEWS_LG для извлечения именованных объектов из польского текста. Он правильно обнаруживает как организации (ORG), так и имена людей (PER):
import spacy

nlp = spacy.load( pl_core_news_lg )
text = Jan Kowalski...

0 Ответы

15 Просмотры

Последнее сообщение Anonymous
03 апр 2025, 10:02
Можно ли переносить старые модели Spacy в будущие версии Spacy?

Последнее сообщение Anonymous « 22 июн 2025, 14:27
Добавлено в форуме Python

Anonymous » 22 июн 2025, 14:27 » в форуме Python

Последние версии Spacy имеют лучшую производительность и совместимость для ускорения графического процессора на устройствах Apple, но у меня есть существующий проект, который зависит от Spacy 3.1.4 и некоторых конкретных поведения моделей 3.1.0 (Web...

0 Ответы

6 Просмотры

Последнее сообщение Anonymous
22 июн 2025, 14:27
Можно сгенерировать CFG + Callgraph в одном файле?

Последнее сообщение Anonymous « 18 июл 2025, 17:23
Добавлено в форуме C++

Anonymous » 18 июл 2025, 17:23 » в форуме C++

Требование:
Я хочу CFG + Callgraph для модуля, как:

int main() { [] {} (); }

Когда я пытаюсь сгенерировать CFG + Callgraph, используя эти два варианта:
$ clang++ -S -emit-llvm Main.C
$ opt -enable-new-pm=0 -dot-cfg -dot-callgraph Main.ll...

0 Ответы

5 Просмотры

Последнее сообщение Anonymous
18 июл 2025, 17:23
Как использовать отложенные компоненты в Flutter (с акцентом на компоненты только для активов)?

Последнее сообщение Anonymous « 04 сен 2025, 22:29
Добавлено в форуме Android

Anonymous » 04 сен 2025, 22:29 » в форуме Android

У меня есть приложение Flutter, имеющее размер около 850 МБ (AAB). Большинство из этих 850 МБ- это видеофайлы активов (750 МБ).
Решение для публикации его в магазине Google Play- это использование отложенных компонентов (или для использования...

0 Ответы

2 Просмотры

Последнее сообщение Anonymous
04 сен 2025, 22:29

Вернуться в «Python»

Programmiererforum