Product
OverviewVideo​Graphic​Document​
Enterprise
Story
LETR / TECH noteNews / Notice​
Pricing
En
한국어English日本語日本語
User guide
Getting started
한국어English日本語
한국어English日本語
NER's present and future: 03. Future development direction and goals
2024-07-04

‍

This post has been updated to match the latest trends as of 2023, so please refer to the article below.

NER's Present and Future Ver. 2: Korean NER Data Set Summary

‍

‍'NER's present and future' This content, which is the third topic in the series 'Future development direction and goals'We have prepared content about It continues from the first topic, “From Concepts to Various Approaches,” and the second topic, “Model Structure and Data Set Status,” so I recommend checking it out and reading it.

  • 'NER's present and future: 01. From concepts to diverse approaches' Go see
  • 'NER's present and future: 02. Model structure and data set status' Go see
  • ‍

    ‍

    Development direction of the NER model

    In practice, the most effective method is to obtain better results by further training an existing model.

     

    For the LETR team, we chose the ner_ontonotes_bert_mult model* from the deeppavlov library for the following reasons:

    1. It supports the largest number of languages (104); 2. While it has the most diverse classes (18), 3. The data processing speed is ok, 4. The reproduction rate is also noticeably high, 5. This is because it is not difficult to use, and practitioners can adapt quickly.

    Also, the model's embedding size is 700 MB, the model size is 1.4 GB, and an f1-score of 88.8 was recorded based on the Ontonotes data set*. (In addition, Deeppavlov also has the characteristic of providing separate NER models specialized in Russian and Vietnamese.)

     

    For the same reason as above Proactively develop the NER model in the direction of further learning deeppavlov's ner_ontonotes_bert_mult model (hereafter, base model)I was able to do it.

    ‍

    ‍

    ‍Necessity of organizing a Korean NER data set

    Appropriate data is essential for model training, but the NER data set in Korean is still insufficient. In particular, there is no Korean NER data set with 18 NE types using the Ontonotes method used by the base model, which the LETR team requires. So first Propose a configuration plan for the Korean NER data setDo it, and go further Proposing a new direction for the Korean NER modelI want to do it.

     

    How to secure original data

    1. Preowned data

    - TED* Corpus

    - Collection of Korean-British contracts

    - English (English): 100,000 sentences

    - AI HUB*: Korean-English parallel corpus of 160 sentences

    2. Data that can be obtained in the future

    - AI HUB: 10,000 sentences of Korean conversation, 270,000 sentences of emotional conversation

    - 3 million sentences built through a data construction support project* for artificial intelligence learning

    ‍

    ‍

    ‍Procedure for organizing data

     

    In order to improve the efficiency of data organization, we first chose a method of NER using the existing model and then inspecting it by the operator. However, in order to do this, it is necessary to reconstruct the data so that it is suitable for the operator to inspect, and the data that has been inspected must be solved again in a form suitable for the model. Specifically, the data will be organized in the following order.

     

    1. NER as an existing model

    ‍2. Data purification (The process of excluding sentences without NE)

    ‍The Korean language is less precise than the previous model. Therefore, in the NER model, NE may be included in the sentence that there is no NE, so the following two methods are used.

    (1) (In the case of multilingual data) Cross-check the corresponding language pairs by NER

    (2) (Optional) Check with crowdsourced (label each sentence with or without NE)

    ‍3. Data processing

    Data is processed into a form that can be crowdsourced.

    ‍4. Primary worker inspection with crowd-sourcing

    ‍5. Second inspection by the manager

    ‍6. Solve processed sentences in a form that can be fed to the model

    ‍

    ‍

    The specific form of data

     

    1. Tagging system and NE type of the Korean NER data set you want to configure

    The tagging system for Korean NER datasets also follows Ontonotes's rules. Using the BIO tagging system, NE is classified into 18 categories as shown in the table below.

    ‍

    2. Types of data that can be fed to the model

    The types of data that can be fed to the model are as follows.

    As shown above, they all consist of text data. Tags and tokens are separated by white spaces (white spaces), and between sentences are separated by empty lines (empty lines).

    The data set is divided into train, test, and vailing, and their ratio consists of 8:1:1.

    ‍

    3. The type of data used during inspection

    Information about the type of object is placed in square brackets (< >) before and after the object name.

    (example)

    Hello? My name is <PERSON>Young-hee</PERSON>. My birthday is <DATE>October 26th</DATE>. I live in <GPE>Seoul</GPE>. I<LANGUAGE>'m a Korean</LANGUAGE> <NORP>who speaks Korean</NORP>.

    ‍

    ‍

    Calculating the target number of data sets

    When 41,969 sentences were extracted from various fields such as media, culture, science, anthropology, philosophy, and economics, individual names were recognized in 2,453 sentences. If you take this as a ratio, it's 5.8%. (However, keep in mind that this is the ratio in written sentences, and the ratio in colloquial language may vary.)

     

    In other words, if we simply assume that about 5% of the sentences in the entire corpus have object names, we can estimate that 250,000 sentences contain object names out of approximately 5 million sentences. Therefore, it aims to consist of 250,000 sentences containing object names.

    ‍

     

    At the end

    As stated earlier, NER plays a very important role in information retrieval, so active research is being carried out in the field of natural language processing. In particular, since the names of people, organizations, and regions can be automatically detected, translation quality is improved by preventing translation errors, but user satisfaction can also be greatly increased through customized translation according to the field.

     

    However, despite this, the NER data set specific to the Korean language is still insufficient. Therefore, in order to overcome the limitations of the scarce amount of data, the LETR team built a Korean-centered data set and built a higher-performance NER Korean language model learned based on this to enable more accurate and natural translation.

    ‍

    Of course, machine translation at the level of a professional translator won't be possible right away. However, as we continue to advance technology, I believe that soon we will create a better world where everyone we dream can communicate without language barriers.

    ‍

    ‍

    ‍

    * ner_ontonotes_bert_mult model from the deeppavlov library, https://github.com/deepmipt/DeepPavlov/blob/master/deeppavlov/configs/ner/ner_ontonotes_bert_mult.json
    * Ontonotes data set, https://catalog.ldc.upenn.edu/LDC2013T19
    * TED (https://www.ted.com), https://ko.wikipedia.org/wiki/TED
    * AI HUB (https://aihub.or.kr),https://ko.wikipedia.org/wiki/AI_Hub
    * Data construction support project for artificial intelligence learning: The core project of the Digital New Deal 'Data Dam' organized by the Ministry of Science and ICT and the Korea Intelligent Information Society Promotion Agency, Twig Farm Selected as Executing Agency for 'Building Data for AI Learning' Project
    * Weighted average (weighted arithmetic average), https://kostat.go.kr/understand/info/info_lge/1/detail_lang.action?bmode=detail_lang&pageNo=1&keyWord=0&cd=SL4468&sTt =

    ‍

    ‍

    NER's present and future

  • NER's present and future: 01. From concepts to diverse approaches
  • NER's present and future: 02. Model structure and data set status
  • NER's present and future: 03. Future development direction and goals
  • ‍


    ✏️콘텐츠 번역&현지화, 한 곳에서 해결하세요.
    회원 가입 및 로그인

    앱으로 편하게 보고 싶다면?
    iOS 앱 구경하러 가기
    안드로이드 앱으로 보러가기

    View all blogs

    View featured notes

    LETR note
    Introducing the Universe Matching Translator and AI Dubbing Technology
    2025-06-30
    WORKS note
    Leveraging VTT Solutions for Video Re-creation
    2025-06-27
    LETR note
    Comparing Google Gemini and LETR WORKS Persona chatbots
    2024-12-19
    User Guide
    Partnership
    Twigfarm Co.,Ltd.
    Company registration number : 556-81-00254  |  Mail-order sales number : 2021- Seoul Jongno -1929
    CEO : Sunho Baek  |  Personal information manager : Hyuntaek Park
    Seoul head office : (03187) 6F, 6,Jong-ro, Jongno-gu,Seoul, Republic of Korea
    Gwangju branch : (61472) 203,193-22, Geumnam-ro,Dong-gu,Gwangju, Republic of Korea
    Singapore asia office : (048581) 16 RAFFLES QUAY #33-07 HONG LEONG BUILDING SINGAPORE
    Family site
    TwigfarmLETR LABSheybunny
    Terms of use
    |
    Privacy policy
    ⓒ 2024 LETR WORKS. All rights reserved.