NER's Present and Future Ver. 2: Korean NER Data Set Summary

2024-07-04

‍

This article updates the series “NER's Present and Future,” which was published on this blog in 2021, according to the latest trends.

‍

NER's present and future: 01. From concepts to diverse approaches

NER's present and future: 02. Model structure and data set status

NER's present and future: 03. Future development direction and goals

‍

Getting started

‍

‍As interest in corpus has grown in recent years, the number of Korean NER (NER) Named Entity Recognition (NER) datasets (Datasets) has increased. The biggest difference from before is the presence of tagsets (analytical cover category). Currently, as the tag set standards of the Korea Information and Communication Technology Association (hereinafter TTA) have become common, most Korean NER data is created according to the 15 major categories or 150 subcategories of the TTA tag set.

‍

By the way, maybe an article in the previous series NER's present and future: 01. From concepts to diverse approachesIf you remember, I think you'll have one question.

“What? Animals, plants, materials, terms... isn't this domainized NE (specific field name) even if you just look at it?”

Yes, that's right. However, since the LETR team is researching machine translators, the last article did not focus on domainized NE. Perhaps if someone dealing with the medical side of the text data looked at it, they would have “puck-pounded” on the chest, saying that domain-related NE is more important. As an excuse, I think it's because each person needs different data depending on the job and field they are dealing with.

‍

If so, let me explain a little more about why the LETR team had no choice but to prefer generic NE (generic object name). In fact, to be honest, the bottom line is that generic NE is more difficult to handle in machine translation. Of course, it's also heartbreaking for domain-interpreted NE if the results are mistranslated, but since Out-of-Vocabulary (terms not in the vocabulary dictionary, OOV) comes up more often when it comes to terms in this field of expertise, the dictionary in the first place^*It's faster to apply it.

(If I were to do some enlightening promotion here, the LETR team is already able to overcome these limitations^*We are developing and servicing a translator using.)

_*_{Translation Dictionary (TD): A kind of 'jargon dictionary', a customized database built on previously translated documents. When translating new documents, you can greatly improve translation quality by improving the consistency and accuracy of the translation by referring to it.}

‍

But generic NE can't do that. It's easy to understand even if you think of just one simple example. What happens if a person's name is “Yuri,” and you put “Yuri” in every sentence where “glass” appears? On the other hand, in the case of “hydrogenated polyisobutene,” you can confidently put “Hydrogenated Polyisobutene” into every sentence where this word appears.

In other words, NER's role in processing terms that are risky because dictionaries are applied in batches, such as human names and organization names, is bound to grow.

‍

Korean NER dataset

‍

Now let's get down to the 'Korean NER Dataset', which is the main point of this article. As a quick reminder, you can forget about the Korean NER dataset I introduced in the previous post. For example, the Naver NER dataset contains automatically generated sentences, so there are many errors in the Korean sentences themselves. (Related GitHub pageI found and confirmed comments referring to this issue and stakeholders' responses to them.)

‍

As mentioned at the beginning of this article, a lot of data is organized into tag sets according to TTA categories. Of course, there are cases where each person organizes tag sets according to their own convenience. In any case, most of the datasets currently available to the public use three types of tagsets.

‍

The first, four-category tag set consists of organization name, person name, product name, and work name.

‍

Second, it is a set of 15 categories of tags that follow the TTA major classification criteria.

‍

Third, it's a set of 150 tags that follow the TTA subclassification criteria.

‍

Please refer to the following National Institute of Korean Language research report for definitions, detailed explanations, and examples of each subcategory criteria due to volume.

2021 object name analysis and object linkage corpus study analysis

‍

While finishing

‍

While researching for this update, we discovered one unusual thing. Twigfarm, which deals with machine translation, is a company that published a parallel corpus and released NER data as well.

Yes, that's right. Twigfarm is the same company the LETR team is part of.

‍

Finally, I'll finish by summarizing the parallel corpus data in a chart.