Product
OverviewVideo​Graphic​Document​
Enterprise
Story
LETR / TECH noteNews / Notice​
Pricing
En
한국어English日本語日本語
User guide
Getting started
한국어English日本語
한국어English日本語
View Korean Pre-Korean Language Models (1)
2024-07-04

Recently, deep learning-based natural language processing research using large-scale data is active. Everyone is jumping in, regardless of business or academia. Big tech companies such as Google and Meta, as well as public collaboration projects such as BigScience (BigScience), are showing remarkable results.

‍

The background of this achievement is a transformer (Transformer) that was pre-learned through extensive corpus data*is in place. Since then, many variants have appeared, and performance has improved rapidly. Also, most of these language models are unsupervised learning through large amounts of corpus data**Because of the use of, data acquisition has become very important.

‍

However, there is something disappointing about language model research that has developed so rapidly. Especially from the perspective of those of us who were born in this country and live in Korean. Broadly speaking, there have been many difficulties in studying the Korean language model due to the following two reasons.

letr_tech-20220908_1

‍

First, first of all, the linguistic characteristics of Korean are very different from English. Just as Japanese is generally easier to learn than English for us, artificial intelligence that has been learning English is bound to be much easier to process Spanish than Korean. I've already covered this in previous content, so check out the article below for more details.

‍- Why is artificial intelligence making Korean more difficult?

letr_tech-20220908_2

‍

Second, because crucially, the amount of training data is directly related to model performance. In general, low-resource (low-resource) languages such as Korean are bound to have relatively limited performance improvements. I've also looked at this through past content related to large language models and multi-language models, so please check it out as well.

‍- Can the open source language model BLOOM become the flower of AI democratization?

- AI, multilingual (Polyglot) model that became a linguistic genius (1)

- AI, multilingual (Polyglot) model that became a linguistic genius (2)

‍

However, as the level of natural language processing research in Korean is rising, the number of cases where Korean-centered models are being studied or published continues to increase. Leading domestic institutions and companies such as the Korea Institute of Electronics and Telecommunications Research (ETRI), Naver, and Kakao are releasing new models one after another. Various models such as Korbert, HyperClova, KoGPT, and EXAONE have appeared one after another, and research continues at this moment.

‍

Therefore, I would like to take this opportunity to share a summary of the Korean language models that have been revealed so far. Broadly speaking, Encoder Model (BERT)*** series), Decoder Model (GPT**** series), Encoder-Decoder Model (seq2seq***** I collected them by dividing them into 3 model groups (series).

We'll be introducing the results step by step in the next post, so stay tuned.

‍

‍

‍

* https://en.wikipedia.org/wiki/Transformer_(machine_learning_model)

** https://en.wikipedia.org/wiki/Unsupervised_learning

*** https://en.wikipedia.org/wiki/BERT_(language_model)

**** https://en.wikipedia.org/wiki/OpenAI#GPT

***** https://en.wikipedia.org/wiki/Seq2seq

‍

‍

References

[1] https://arxiv.org/abs/2112.03014

[2] https://aiopen.etri.re.kr/service_dataset.php

[3] https://github.com/SKTBrain/KoBERT

[4] https://github.com/monologg/HanBert-Transformers

[5] https://github.com/SKT-AI/KoGPT2

[6] https://huggingface.co/gogamza/kobart-base-v2

[7] https://arxiv.org/abs/2101.11363

[8] https://koreascience.kr/article/CFKO202130060717834.pdf

[9] https://huggingface.co/skt/ko-gpt-trinity-1.2B-v0.5

[10] https://arxiv.org/abs/2105.09680

[11] https://arxiv.org/abs/2109.04650

[12] https://huggingface.co/kakaobrain/kogpt

[13] https://s-space.snu.ac.kr/handle/10371/175838

‍

‍

Good content to watch together

  • AI that became a linguistic genius, multilingual (Polyglot) model (1)
  • AI that became a linguistic genius, multilingual (Polyglot) model (2)
  • Can the open source language model BLOOM become the flower of AI democratization?
  • Why is artificial intelligence making Korean more difficult?
  • ‍

    ✏️콘텐츠 번역&현지화, 한 곳에서 해결하세요.

    • 영상번역 툴 무료 체험하기
    • 월간 소식지로 더 많은 이야기 읽어보기 💌

    ‍

    View all blogs

    View featured notes

    LETR note
    Introducing the Universe Matching Translator and AI Dubbing Technology
    2025-06-30
    WORKS note
    Leveraging VTT Solutions for Video Re-creation
    2025-06-27
    LETR note
    Comparing Google Gemini and LETR WORKS Persona chatbots
    2024-12-19
    User Guide
    Partnership
    Twigfarm Co.,Ltd.
    Company registration number : 556-81-00254  |  Mail-order sales number : 2021- Seoul Jongno -1929
    CEO : Sunho Baek  |  Personal information manager : Hyuntaek Park
    Seoul head office : (03187) 6F, 6,Jong-ro, Jongno-gu,Seoul, Republic of Korea
    Gwangju branch : (61472) 203,193-22, Geumnam-ro,Dong-gu,Gwangju, Republic of Korea
    Singapore asia office : (048581) 16 RAFFLES QUAY #33-07 HONG LEONG BUILDING SINGAPORE
    Family site
    TwigfarmLETR LABSheybunny
    Terms of use
    |
    Privacy policy
    ⓒ 2024 LETR WORKS. All rights reserved.