4000-520-616
欢迎来到免疫在线!(蚂蚁淘生物旗下平台)  请登录 |  免费注册 |  询价篮
主营:原厂直采,平行进口,授权代理(蚂蚁淘为您服务)
咨询热线电话
4000-520-616
当前位置: 首页 > 新闻动态 >
热卖商品
新闻详情
GitHub - jacobfrey36/HanLP: 中文分词 词性标注 命名实体识别...
来自 : github 发布时间:2021-03-25

词性标注

词性标注器的输入是单词,输出是每个单词的词性标签。

 tagger = hanlp.load(hanlp.pretrained.pos.PTB_POS_RNN_FASTTEXT_EN) tagger([[\'I\', \'banked\', \'2\', \'dollars\', \'in\', \'a\', \'bank\', \'.\'], [\'Is\', \'this\', \'the\', \'future\', \'of\', \'chamber\', \'music\', \'?\']])[[\'PRP\', \'VBD\', \'CD\', \'NNS\', \'IN\', \'DT\', \'NN\', \'.\'],  [\'VBZ\', \'DT\', \'DT\', \'NN\', \'IN\', \'NN\', \'NN\', \'.\']]

词性标注同样支持多语种,取决于你加载的是哪个模型(注意变量名后面的ZH)。

 tagger = hanlp.load(hanlp.pretrained.pos.CTB5_POS_RNN_FASTTEXT_ZH) tagger([\'我\', \'的\', \'希望\', \'是\', \'希望\', \'和平\'])[\'PN\', \'DEG\', \'NN\', \'VC\', \'VV\', \'NN\']

注意到句子中两个 希望的词性各不相同,第一个是名词而第二个是动词。关于词性标签,请参考《自然语言处理入门》第七章,或等待正式文档。这个标注器使用了fasttext[^fasttext] 作为嵌入层,所以免疫于OOV。

命名实体识别

命名实体识别模块的输入是单词列表,输出是命名实体的边界和类别。

 recognizer = hanlp.load(hanlp.pretrained.ner.CONLL03_NER_BERT_BASE_UNCASED_EN) recognizer([\"President\", \"Obama\", \"is\", \"speaking\", \"at\", \"the\", \"White\", \"House\"])[(\'Obama\', \'PER\', 1, 2), (\'White House\', \'LOC\', 6, 8)]

中文命名实体识别是字符级模型,所以不要忘了用 list将字符串转换为字符列表。至于输出,格式为 (entity, type, begin, end)。

 recognizer = hanlp.load(hanlp.pretrained.ner.MSRA_NER_BERT_BASE_ZH) recognizer([list(\'上海华安工业(集团)公司董事长谭旭光和秘书张晚霞来到美国纽约现代艺术博物馆参观。\'), list(\'萨哈夫说,伊拉克将同联合国销毁伊拉克大规模杀伤性武器特别委员会继续保持合作。\')])[[(\'上海华安工业(集团)公司\', \'NT\', 0, 12), (\'谭旭光\', \'NR\', 15, 18), (\'张晚霞\', \'NR\', 21, 24), (\'美国\', \'NS\', 26, 28), (\'纽约现代艺术博物馆\', \'NS\', 28, 37)],  [(\'萨哈夫\', \'NR\', 0, 3), (\'伊拉克\', \'NS\', 5, 8), (\'联合国销毁伊拉克大规模杀伤性武器特别委员会\', \'NT\', 10, 31)]]

这里的 MSRA_NER_BERT_BASE_ZH 是基于 BERT[^bert]的最准确的模型,你可以浏览该模型的评测指标:

$ cat ~/.hanlp/ner/ner_bert_base_msra_20200104_185735/test.log 20-01-04 18:55:02 INFO Evaluation results for test.tsv - loss: 1.4949 - f1: 0.9522 - speed: 113.37 sample/sec processed 177342 tokens with 5268 phrases; found: 5316 phrases; correct: 5039.accuracy: 99.37%; precision: 94.79%; recall: 95.65%; FB1: 95.22 NR: precision: 96.39%; recall: 97.83%; FB1: 97.10 1357 NS: precision: 96.70%; recall: 95.79%; FB1: 96.24 2610 NT: precision: 89.47%; recall: 93.13%; FB1: 91.27 1349

依存句法分析

句法分析是NLP的核心任务,在许多硬派的学者和面试官看来,不懂句法分析的人称不上NLP研究者或工程师。然而通过HanLP,只需两行代码即可完成句法分析。

 syntactic_parser = hanlp.load(hanlp.pretrained.dep.PTB_BIAFFINE_DEP_EN) print(syntactic_parser([(\'Is\', \'VBZ\'), (\'this\', \'DT\'), (\'the\', \'DT\'), (\'future\', \'NN\'), (\'of\', \'IN\'), (\'chamber\', \'NN\'), (\'music\', \'NN\'), (\'?\', \'.\')]))1 Is _ VBZ _ _ 4 cop _ _2 this _ DT _ _ 4 nsubj _ _3 the _ DT _ _ 4 det _ _4 future _ NN _ _ 0 root _ _5 of _ IN _ _ 4 prep _ _6 chamber _ NN _ _ 7 nn _ _7 music _ NN _ _ 5 pobj _ _8 ? _ . _ _ 4 punct _ _

句法分析器的输入是单词列表及词性列表,输出是 CoNLL-X 格式[^conllx]的句法树,用户可通过 CoNLLSentence 类来操作句法树。一个中文例子:

 syntactic_parser = hanlp.load(hanlp.pretrained.dep.CTB7_BIAFFINE_DEP_ZH) print(syntactic_parser([(\'蜡烛\', \'NN\'), (\'两\', \'CD\'), (\'头\', \'NN\'), (\'烧\', \'VV\')]))1 蜡烛 _ NN _ _ 4 nsubj _ _2 两 _ CD _ _ 3 nummod _ _3 头 _ NN _ _ 4 dep _ _4 烧 _ VV _ _ 0 root _ _

关于句法标签,请参考《自然语言处理入门》第11章,或等待正式文档。

语义依存分析

语义分析结果为一个有向无环图,称为语义依存图(Semantic Dependency Graph)。图中的节点为单词,边为语义依存弧,边上的标签为语义关系。

 semantic_parser = hanlp.load(hanlp.pretrained.sdp.SEMEVAL15_PAS_BIAFFINE_EN) print(semantic_parser([(\'Is\', \'VBZ\'), (\'this\', \'DT\'), (\'the\', \'DT\'), (\'future\', \'NN\'), (\'of\', \'IN\'), (\'chamber\', \'NN\'), (\'music\', \'NN\'), (\'?\', \'.\')]))1 Is _ VBZ _ _ 0 ROOT _ _2 this _ DT _ _ 1 verb_ARG1 _ _3 the _ DT _ _ 0 ROOT _ _4 future _ NN _ _ 1 verb_ARG2 _ _4 future _ NN _ _ 3 det_ARG1 _ _4 future _ NN _ _ 5 prep_ARG1 _ _5 of _ IN _ _ 0 ROOT _ _6 chamber _ NN _ _ 0 ROOT _ _7 music _ NN _ _ 5 prep_ARG2 _ _7 music _ NN _ _ 6 noun_ARG1 _ _8 ? _ . _ _ 0 ROOT _ _

HanLP实现了最先进的biaffine[^biaffine] 模型,支持任意语种的语义依存分析:

 semantic_parser = hanlp.load(SEMEVAL16_NEWS_BIAFFINE_ZH) print(semantic_parser([(\'蜡烛\', \'NN\'), (\'两\', \'CD\'), (\'头\', \'NN\'), (\'烧\', \'VV\')]))1 蜡烛 _ NN _ _ 3 Poss _ _1 蜡烛 _ NN _ _ 4 Pat _ _2 两 _ CD _ _ 3 Quan _ _3 头 _ NN _ _ 4 Loc _ _4 烧 _ VV _ _ 0 Root _ _

输出依然是 CoNLLSentence 格式,只不过这次是一个图,图中任意节点可以有零个或任意多个中心词,比如 蜡烛 有两个中心词 (ID 3 和 4)。语义依存关系可参考《中文语义依存分析语料库》,或等待正式文档。

流水线

既然句法和语义分析依赖于词性标注,而词性标注又依赖于分词。如果有一种类似于计算图的机制自动将这些模块串联起来就好了。HanLP设计的流水线可以灵活地将多个组件(统计模型或规则系统)组装起来:

pipeline = hanlp.pipeline() \\ .append(hanlp.utils.rules.split_sentence, output_key=\'sentences\') \\ .append(tokenizer, output_key=\'tokens\') \\ .append(tagger, output_key=\'part_of_speech_tags\') \\ .append(syntactic_parser, input_key=(\'tokens\', \'part_of_speech_tags\'), output_key=\'syntactic_dependencies\') \\ .append(semantic_parser, input_key=(\'tokens\', \'part_of_speech_tags\'), output_key=\'semantic_dependencies\')

注意流水线的第一级管道是一个普通的Python函数 split_sentence,用来将文本拆分为句子。而input_key和output_key指定了这些管道的连接方式,你可以将这条流水线打印出来观察它的结构:

 pipeline[None- LambdaComponent- sentences, sentences- NgramConvTokenizer- tokens, tokens- RNNPartOfSpeechTagger- part_of_speech_tags, (\'tokens\', \'part_of_speech_tags\')- BiaffineDependencyParser- syntactic_dependencies, (\'tokens\', \'part_of_speech_tags\')- BiaffineSemanticDependencyParser- semantic_dependencies]

这次,就像你在日常工作中最常见的场景一样,我们一次性输入一整篇文章 text:

 print(pipeline(text)) \"sentences\": [ \"Jobs and Wozniak co-founded Apple in 1976 to sell Wozniak\'s Apple I personal computer.\", \"Together the duo gained fame and wealth a year later with the Apple II.\" \"tokens\": [ [\"Jobs\", \"and\", \"Wozniak\", \"co-founded\", \"Apple\", \"in\", \"1976\", \"to\", \"sell\", \"Wozniak\", \"\'s\", \"\", \"Apple\", \"I\", \"personal\", \"computer\", \".\"], [\"Together\", \"the\", \"duo\", \"gained\", \"fame\", \"and\", \"wealth\", \"a\", \"year\", \"later\", \"with\", \"the\", \"Apple\", \"II\", \".\"] \"part_of_speech_tags\": [ [\"NNS\", \"CC\", \"NNP\", \"VBD\", \"NNP\", \"IN\", \"CD\", \"TO\", \"VB\", \"NNP\", \"POS\", \"``\", \"NNP\", \"PRP\", \"JJ\", \"NN\", \".\"], [\"IN\", \"DT\", \"NN\", \"VBD\", \"NN\", \"CC\", \"NN\", \"DT\", \"NN\", \"RB\", \"IN\", \"DT\", \"NNP\", \"NNP\", \".\"] \"syntactic_dependencies\": [ [[4, \"nsubj\"], [1, \"cc\"], [1, \"conj\"], [0, \"root\"], [4, \"dobj\"], [4, \"prep\"], [6, \"pobj\"], [9, \"aux\"], [4, \"xcomp\"], [16, \"poss\"], [10, \"possessive\"], [16, \"punct\"], [16, \"nn\"], [16, \"nn\"], [16, \"amod\"], [9, \"dobj\"], [4, \"punct\"]], [[4, \"advmod\"], [3, \"det\"], [4, \"nsubj\"], [0, \"root\"], [4, \"dobj\"], [5, \"cc\"], [5, \"conj\"], [9, \"det\"], [10, \"npadvmod\"], [4, \"advmod\"], [4, \"prep\"], [14, \"det\"], [14, \"nn\"], [11, \"pobj\"], [4, \"punct\"]] \"semantic_dependencies\": [ [[[2], [\"coord_ARG1\"]], [[4, 9], [\"verb_ARG1\", \"verb_ARG1\"]], [[2], [\"coord_ARG2\"]], [[6, 8], [\"prep_ARG1\", \"comp_MOD\"]], [[4], [\"verb_ARG2\"]], [[0], [\"ROOT\"]], [[6], [\"prep_ARG2\"]], [[0], [\"ROOT\"]], [[8], [\"comp_ARG1\"]], [[11], [\"poss_ARG2\"]], [[0], [\"ROOT\"]], [[0], [\"ROOT\"]], [[0], [\"ROOT\"]], [[0], [\"ROOT\"]], [[0], [\"ROOT\"]], [[9, 11, 12, 14, 15], [\"verb_ARG3\", \"poss_ARG1\", \"punct_ARG1\", \"noun_ARG1\", \"adj_ARG1\"]], [[0], [\"ROOT\"]]], [[[0], [\"ROOT\"]], [[0], [\"ROOT\"]], [[1, 2, 4], [\"adj_ARG1\", \"det_ARG1\", \"verb_ARG1\"]], [[1, 10], [\"adj_ARG1\", \"adj_ARG1\"]], [[6], [\"coord_ARG1\"]], [[4], [\"verb_ARG2\"]], [[6], [\"coord_ARG2\"]], [[0], [\"ROOT\"]], [[8], [\"det_ARG1\"]], [[9], [\"noun_ARG1\"]], [[0], [\"ROOT\"]], [[0], [\"ROOT\"]], [[0], [\"ROOT\"]], [[11, 12, 13], [\"prep_ARG2\", \"det_ARG1\", \"noun_ARG1\"]], [[0], [\"ROOT\"]]]}

中文处理和英文一模一样,事实上,HanLP2.0认为所有人类语言都是统一的符号系统:

 print(pipeline(text)) \"sentences\": [ \"HanLP是一系列模型与算法组成的自然语言处理工具包,目标是普及自然语言处理在生产环境中的应用。\", \"HanLP具备功能完善、性能高效、架构清晰、语料时新、可自定义的特点。\", \"内部算法经过工业界和学术界考验,配套书籍《自然语言处理入门》已经出版。\" \"tokens\": [ [\"HanLP\", \"是\", \"一\", \"系列\", \"模型\", \"与\", \"算法\", \"组成\", \"的\", \"自然\", \"语言\", \"处理\", \"工具包\", \",\", \"目标\", \"是\", \"普及\", \"自然\", \"语言\", \"处理\", \"在\", \"生产\", \"环境\", \"中\", \"的\", \"应用\", \"。\"], [\"HanLP\", \"具备\", \"功能\", \"完善\", \"、\", \"性能\", \"高效\", \"、\", \"架构\", \"清晰\", \"、\", \"语料\", \"时\", \"新\", \"、\", \"可\", \"自\", \"定义\", \"的\", \"特点\", \"。\"], [\"内部\", \"算法\", \"经过\", \"工业界\", \"和\", \"学术界\", \"考验\", \",\", \"配套\", \"书籍\", \"《\", \"自然\", \"语言\", \"处理\", \"入门\", \"》\", \"已经\", \"出版\", \"。\"] \"part_of_speech_tags\": [ [\"NR\", \"VC\", \"CD\", \"M\", \"NN\", \"CC\", \"NN\", \"VV\", \"DEC\", \"NN\", \"NN\", \"VV\", \"NN\", \"PU\", \"NN\", \"VC\", \"VV\", \"NN\", \"NN\", \"VV\", \"P\", \"NN\", \"NN\", \"LC\", \"DEG\", \"NN\", \"PU\"], [\"NR\", \"VV\", \"NN\", \"VA\", \"PU\", \"NN\", \"VA\", \"PU\", \"NN\", \"VA\", \"PU\", \"NN\", \"LC\", \"VA\", \"PU\", \"VV\", \"P\", \"VV\", \"DEC\", \"NN\", \"PU\"], [\"NN\", \"NN\", \"P\", \"NN\", \"CC\", \"NN\", \"NN\", \"PU\", \"VV\", \"NN\", \"PU\", \"NN\", \"NN\", \"NN\", \"NN\", \"PU\", \"AD\", \"VV\", \"PU\"] \"syntactic_dependencies\": [ [[2, \"top\"], [0, \"root\"], [4, \"nummod\"], [11, \"clf\"], [7, \"conj\"], [7, \"cc\"], [8, \"nsubj\"], [11, \"rcmod\"], [8, \"cpm\"], [11, \"nn\"], [12, \"nsubj\"], [2, \"ccomp\"], [12, \"dobj\"], [2, \"punct\"], [16, \"top\"], [2, \"conj\"], [16, \"ccomp\"], [19, \"nn\"], [20, \"nsubj\"], [17, \"conj\"], [26, \"assmod\"], [23, \"nn\"], [24, \"lobj\"], [21, \"plmod\"], [21, \"assm\"], [20, \"dobj\"], [2, \"punct\"]], [[2, \"nsubj\"], [0, \"root\"], [4, \"nsubj\"], [20, \"rcmod\"], [4, \"punct\"], [7, \"nsubj\"], [4, \"conj\"], [4, \"punct\"], [10, \"nsubj\"], [4, \"conj\"], [4, \"punct\"], [13, \"lobj\"], [14, \"loc\"], [4, \"conj\"], [4, \"punct\"], [18, \"mmod\"], [18, \"advmod\"], [4, \"conj\"], [4, \"cpm\"], [2, \"dobj\"], [2, \"punct\"]], [[2, \"nn\"], [18, \"nsubj\"], [18, \"prep\"], [6, \"conj\"], [6, \"cc\"], [7, \"nn\"], [3, \"pobj\"], [18, \"punct\"], [10, \"rcmod\"], [15, \"nn\"], [15, \"punct\"], [15, \"nn\"], [15, \"nn\"], [15, \"nn\"], [18, \"nsubj\"], [15, \"punct\"], [18, \"advmod\"], [0, \"root\"], [18, \"punct\"]] \"semantic_dependencies\": [ [[[2], [\"Exp\"]], [[0], [\"Aft\"]], [[4], [\"Quan\"]], [[0], [\"Aft\"]], [[8], [\"Poss\"]], [[7], [\"mConj\"]], [[8], [\"Datv\"]], [[11], [\"rProd\"]], [[8], [\"mAux\"]], [[11], [\"Desc\"]], [[12], [\"Datv\"]], [[2], [\"dClas\"]], [[2, 12], [\"Clas\", \"Cont\"]], [[2, 12], [\"mPunc\", \"mPunc\"]], [[16], [\"Exp\"]], [[17], [\"mMod\"]], [[2], [\"eSucc\"]], [[19], [\"Desc\"]], [[20], [\"Pat\"]], [[26], [\"rProd\"]], [[23], [\"mPrep\"]], [[23], [\"Desc\"]], [[20], [\"Loc\"]], [[23], [\"mRang\"]], [[0], [\"Aft\"]], [[16], [\"Clas\"]], [[16], [\"mPunc\"]]], [[[2], [\"Poss\"]], [[0], [\"Aft\"]], [[4], [\"Exp\"]], [[0], [\"Aft\"]], [[4], [\"mPunc\"]], [[0], [\"Aft\"]], [[4], [\"eCoo\"]], [[4, 7], [\"mPunc\", \"mPunc\"]], [[0], [\"Aft\"]], [[0], [\"Aft\"]], [[7, 10], [\"mPunc\", \"mPunc\"]], [[0], [\"Aft\"]], [[12], [\"mTime\"]], [[0], [\"Aft\"]], [[14], [\"mPunc\"]], [[0], [\"Aft\"]], [[0], [\"Aft\"]], [[20], [\"Desc\"]], [[18], [\"mAux\"]], [[0], [\"Aft\"]], [[0], [\"Aft\"]]], [[[2], [\"Desc\"]], [[7, 9, 18], [\"Exp\", \"Agt\", \"Exp\"]], [[4], [\"mPrep\"]], [[0], [\"Aft\"]], [[6], [\"mPrep\"]], [[7], [\"Datv\"]], [[0], [\"Aft\"]], [[7], [\"mPunc\"]], [[7], [\"eCoo\"]], [[0], [\"Aft\"]], [[0], [\"Aft\"]], [[13], [\"Desc\"]], [[0], [\"Aft\"]], [[0], [\"Aft\"]], [[0], [\"Aft\"]], [[0], [\"Aft\"]], [[18], [\"mTime\"]], [[0], [\"Aft\"]], [[18], [\"mPunc\"]]]}

输出为一个json化的 dict,大部分用户应当很熟悉。

请发挥你的想象力和创造力,在流水线中加入更多预处理和后处理管道(包括词典、正则等)。记住,任意普通的Python函数都可以作为一级管道。使用 pipeline.save(\'zh.json\') 将流水线序列化并部署到生产服务器。

训练你自己的模型

写深度学习模型一点都不难,难的是复现较高的准确率。下列代码展示了如何在MSR语料库上训练一个 97% F1 的中文分词模型。

tokenizer = NgramConvTokenizer()save_dir = \'data/model/cws/convseg-msr-nocrf-noembed\'tokenizer.fit(SIGHAN2005_MSR_TRAIN, SIGHAN2005_MSR_VALID, save_dir, word_embed={\'class_name\': \'HanLP Word2VecEmbedding\', \'config\': { \'trainable\': True, \'filepath\': CONVSEG_W2V_NEWS_TENSITE_CHAR, \'expand_vocab\': False, \'lowercase\': False, optimizer=tf.keras.optimizers.Adam(learning_rate=0.001, epsilon=1e-8, clipnorm=5), epochs=100, window_size=0, metrics=\'f1\', weight_norm=True)tokenizer.evaluate(SIGHAN2005_MSR_TEST, save_dir=save_dir)

训练和评测日志如下所示。

Train for 783 steps, validate for 87 stepsEpoch 1/100783/783 [==============================] - 177s 226ms/step - loss: 15.6354 - f1: 0.8506 - val_loss: 9.9109 - val_f1: 0.9081Epoch 2/100236/783 [======== .....................] - ETA: 1:41 - loss: 9.0359 - f1: 0.912619-12-28 20:55:59 INFO Trained 100 epochs in 3 h 55 m 42 s, each epoch takes 2 m 21 s19-12-28 20:56:06 INFO Evaluation results for msr_test_gold.utf8 - loss: 3.6579 - f1: 0.9715 - speed: 1173.80 sample/sec

类似地,你可以训练一个情感分析模型来判断酒店评论的情感极性。

save_dir = \'data/model/classification/chnsenticorp_bert_base\'classifier = TransformerClassifier(TransformerTextTransform(y_column=0))classifier.fit(CHNSENTICORP_ERNIE_TRAIN, CHNSENTICORP_ERNIE_VALID, save_dir, transformer=\'chinese_L-12_H-768_A-12\')classifier.load(save_dir)print(classifier(\'前台客房服务态度非常好!早餐很丰富,房价很干净。再接再厉!\'))classifier.evaluate(CHNSENTICORP_ERNIE_TEST, save_dir=save_dir)

由于语料库一般领域相关,且BERT模型体积较大,HanLP不准备发布那么多预训练文本分类模型。

欲了解更多训练脚本,请参考 tests/train。更多的使用案例可以在 tests/demo中找到。文档,RESTful API都在开发中。

引用

如果你在研究中使用了HanLP,请按如下格式引用:

@software{hanlp2, author = {Han He}, title = {{HanLP: Han Language Processing}}, year = {2020}, url = {https://github.com/hankcs/HanLP},}

License

HanLP 的授权协议为 Apache License 2.0,可免费用做商业用途。请在产品说明中附加HanLP的链接和授权协议。HanLP受版权法保护,侵权必究。

自然语义(青岛)科技有限公司

HanLP从v1.7版起独立运作,由自然语义(青岛)科技有限公司作为项目主体,主导后续版本的开发,并拥有后续版本的版权。

大快搜索

HanLP v1.3~v1.65版由大快搜索主导开发,继续完全开源,大快搜索拥有相关版权。

上海林原公司

HanLP 早期得到了上海林原公司的大力支持,并拥有1.28及前序版本的版权,相关版本也曾在上海林原公司网站发布。

References

[^fasttext]: A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov, “Bag of Tricks for Efficient Text Classification,” vol. cs.CL. 07-Jul-2016.

[^bert]: J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” arXiv.org, vol. cs.CL. 10-Oct-2018.bert 

[^biaffine]: T. Dozat and C. D. Manning, “Deep Biaffine Attention for Neural Dependency Parsing.,” ICLR, 2017.

[^conllx]: Buchholz, S., Marsi, E. (2006, June). CoNLL-X shared task on multilingual dependency parsing. In Proceedings of the tenth conference on computational natural language learning (pp. 149-164). Association for Computational Linguistics.

本文链接: http://jlpcorp.immuno-online.com/view-720732.html

发布于 : 2021-03-25 阅读(0)
公司介绍
品牌分类
联络我们
服务热线:4000-520-616
(限工作日9:00-18:00)
QQ :1570468124
手机:18915418616
官网:http://