Product
OverviewVideo​Graphic​Document​
Enterprise
Story
LETR / TECH noteNews / Notice​
Pricing
En
한국어English日本語日本語
User guide
Getting started
한국어English日本語
한국어English日本語
NER's Present and Future Ver. 2: Korean NER Data Set Summary
2024-07-04

‍

This article updates the series “NER's Present and Future,” which was published on this blog in 2021, according to the latest trends.

‍

NER's present and future: 01. From concepts to diverse approaches

NER's present and future: 02. Model structure and data set status

NER's present and future: 03. Future development direction and goals

‍

‍

Getting started

‍

‍As interest in corpus has grown in recent years, the number of Korean NER (NER) Named Entity Recognition (NER) datasets (Datasets) has increased. The biggest difference from before is the presence of tagsets (analytical cover category). Currently, as the tag set standards of the Korea Information and Communication Technology Association (hereinafter TTA) have become common, most Korean NER data is created according to the 15 major categories or 150 subcategories of the TTA tag set.

‍

By the way, maybe an article in the previous series NER's present and future: 01. From concepts to diverse approachesIf you remember, I think you'll have one question.

“What? Animals, plants, materials, terms... isn't this domainized NE (specific field name) even if you just look at it?”

Yes, that's right. However, since the LETR team is researching machine translators, the last article did not focus on domainized NE. Perhaps if someone dealing with the medical side of the text data looked at it, they would have “puck-pounded” on the chest, saying that domain-related NE is more important. As an excuse, I think it's because each person needs different data depending on the job and field they are dealing with.

‍

If so, let me explain a little more about why the LETR team had no choice but to prefer generic NE (generic object name). In fact, to be honest, the bottom line is that generic NE is more difficult to handle in machine translation. Of course, it's also heartbreaking for domain-interpreted NE if the results are mistranslated, but since Out-of-Vocabulary (terms not in the vocabulary dictionary, OOV) comes up more often when it comes to terms in this field of expertise, the dictionary in the first place*It's faster to apply it.

(If I were to do some enlightening promotion here, the LETR team is already able to overcome these limitations*We are developing and servicing a translator using.)

* Translation Dictionary (TD): A kind of 'jargon dictionary', a customized database built on previously translated documents. When translating new documents, you can greatly improve translation quality by improving the consistency and accuracy of the translation by referring to it.

‍

But generic NE can't do that. It's easy to understand even if you think of just one simple example. What happens if a person's name is “Yuri,” and you put “Yuri” in every sentence where “glass” appears? On the other hand, in the case of “hydrogenated polyisobutene,” you can confidently put “Hydrogenated Polyisobutene” into every sentence where this word appears.

In other words, NER's role in processing terms that are risky because dictionaries are applied in batches, such as human names and organization names, is bound to grow.

‍

‍

Korean NER dataset

‍

Now let's get down to the 'Korean NER Dataset', which is the main point of this article. As a quick reminder, you can forget about the Korean NER dataset I introduced in the previous post. For example, the Naver NER dataset contains automatically generated sentences, so there are many errors in the Korean sentences themselves. (Related GitHub pageI found and confirmed comments referring to this issue and stakeholders' responses to them.)

‍

As mentioned at the beginning of this article, a lot of data is organized into tag sets according to TTA categories. Of course, there are cases where each person organizes tag sets according to their own convenience. In any case, most of the datasets currently available to the public use three types of tagsets.

‍

The first, four-category tag set consists of organization name, person name, product name, and work name.

‍

Second, it is a set of 15 categories of tags that follow the TTA major classification criteria.

‍

Third, it's a set of 150 tags that follow the TTA subclassification criteria.

‍

Please refer to the following National Institute of Korean Language research report for definitions, detailed explanations, and examples of each subcategory criteria due to volume.

2021 object name analysis and object linkage corpus study analysis

‍

‍

While finishing

‍

While researching for this update, we discovered one unusual thing. Twigfarm, which deals with machine translation, is a company that published a parallel corpus and released NER data as well.

Yes, that's right. Twigfarm is the same company the LETR team is part of.

‍

Finally, I'll finish by summarizing the parallel corpus data in a chart.

[1] https://aihub.or.kr/aihubdata/data/view.do?dataSetSn=71263

[2] https://aihub.or.kr/aihubdata/data/view.do?dataSetSn=71263

[3] https://aihub.or.kr/aihubdata/data/view.do?dataSetSn=71265

[4] https://aihub.or.kr/aihubdata/data/view.do?dataSetSn=71265

[5] https://aihub.or.kr/aihubdata/data/view.do?dataSetSn=71265

[6] https://aihub.or.kr/aihubdata/data/view.do?dataSetSn=71266

[7] https://corpus.korean.go.kr/

[8] to [13] homologous

‍

Good content to watch together

  • NER's present and future: 01. From concepts to diverse approaches
  • NER's present and future: 02. Model structure and data set status
  • NER's present and future: 03. Future development direction and goals
  • ‍

    ✏️콘텐츠 번역&현지화, 한 곳에서 해결하세요.

    • 영상번역 툴 무료 체험하기
    • 월간 소식지로 더 많은 이야기 읽어보기 💌
    View all blogs

    View featured notes

    LETR note
    Introducing the Universe Matching Translator and AI Dubbing Technology
    2025-06-30
    WORKS note
    Leveraging VTT Solutions for Video Re-creation
    2025-06-27
    LETR note
    Comparing Google Gemini and LETR WORKS Persona chatbots
    2024-12-19
    User Guide
    Partnership
    Twigfarm Co.,Ltd.
    Company registration number : 556-81-00254  |  Mail-order sales number : 2021- Seoul Jongno -1929
    CEO : Sunho Baek  |  Personal information manager : Hyuntaek Park
    Seoul head office : (03187) 6F, 6,Jong-ro, Jongno-gu,Seoul, Republic of Korea
    Gwangju branch : (61472) 203,193-22, Geumnam-ro,Dong-gu,Gwangju, Republic of Korea
    Singapore asia office : (048581) 16 RAFFLES QUAY #33-07 HONG LEONG BUILDING SINGAPORE
    Family site
    TwigfarmLETR LABSheybunny
    Terms of use
    |
    Privacy policy
    ⓒ 2024 LETR WORKS. All rights reserved.