[AI and copyright laws] Is Your AI Model's Data Usage Compliant with Copyright Laws?

2023/02/14 | Written By: Sungmin Park
 

Generative AI—such as ChatGPT and Midjourney—has become a hot topic today, and increasing interest follows alongside in the form of AI copyright. Is the various data we use when creating AI models in compliance with copyright laws?

This content introduces the copyright laws you need to know to legally create AI-as-a-service models outside of an educational environment. In particular, we will look at cases and questions that are commonly encountered in the process of creating data for AI models based on NLP (Natural Language Processing) technology.



Copyright law, why do we need to know it?

  • Good AI Models Come From Good Data

When developing AI models in educational environments such as schools, teachers or curriculum administrators usually prepare data and assignments devoid of copyright issues. However, when it comes to individual creation, in order to create models that solve desired issues, appropriate data must be discovered and created. By thinking simply and recklessly drawing data from the web to use in model training, one can unknowingly violate copyright laws. Therefore, knowledge on copyright laws must precede the production of data required in AI model development.

Similarly, academics are paying attention to copyright and licensing. Because this blog addresses whether contents of theses violates intellectual property rights or data collection methods, it is necessary for those in the academic world to properly understand and utilize copyright.

Copyright, attracting attention from the academic world (Source: International Joint Conference on Natural Language Processing, ACL-IJCNLP 2021 )

  • The Necessity of Paying Attention: Positively Amending the Law With AI and Creators in Mind

A second reason to be concerned about copyright is that many copyright laws have not yet taken into account the development of AI models. This may seem paradoxical, but it is something that requires attention for the positive development of AI. Mass learning is essential to creating superior AI, but there are no clear standards for copyright infringement when using data for AI training.

Looking at Article 1 (Purpose) of the Copyright Act, it is stated that “the purpose of this Act is to contribute to the improvement and development of culture and related industries by protecting the rights of authors and rights adjacent to them and promoting the fair use of copyrighted works.” As you can see, the copyright law does not yet consider the “AI industry.” At the time this law was enacted, AI did not receive as much attention as it does today, with its performance not as superior as today’s.

Starting 2020, an amendment to the Copyright Act included a new copyright disclaimer to address the field of AI. While this is being promoted to reflect current trends and advancements, continuous attention is required to positively revise the law with both AI and creators in mind.

What is Copyright Law?

So what exactly is copyright? 

  • Copyright: The right given to creators for their results (works) in expressing people's thoughts and emotions. If “creativity” is involved, this occurs naturally without a separate registration procedure.
    (Example: The Copyright of a painting naturally belongs to the artist.

How does the law describe Copyrighted works?

  • Work: The result of expressing a person's thoughts and emotions

    • Novels, poems, theses, lectures, speeches, screenplays, and other literary works

    • Musical works

    • Drama, dance, pantomime, and other theatrical works

    • Paintings, calligraphy, sculptures, prints, crafts, works of applied art, and other works of art

    • Models and design books for buildings and construction; other architectural works

    • Photographic works (including those produced in similar fashions)

    • Video work

    • Maps, charts, blueprints, schematics, models, and other graphic works

    • Computer program work

There are many types of Copyrighted works, such as the ones listed above, and those who work with AI will have heard much about text and image Copyrighted works specifically. Elements necessary for AI model development, such as literature, music, video, and photography, are protected as Copyrighted works.

However, there are works that are not protected by Copyright law.

  • Works not protected by Copyright law

    • Constitution, laws, treaties, orders, ordinances and rules

    • State or local government notices, announcements, instructions, and other similar matters

    • Judgment, decision, order, adjudication, administrative adjudication, and other resolutions and decisions of the court, etc.

    • Compilation or translation of the contents specified in subparagraphs 1 to 3 prepared by the state or local government

    • Reporting that is merely a statement of fact

These mainly apply to works written by the state and local governments, and include current affairs reports that are difficult to categorize as ‘creative.’

Based on the contents thus far, let's review some concerns about copyright that may arise in real life.

[Case 1]
Q. I am attempting to create and distribute a model that provides case precedent search services. Is that appropriate?

A. Yes. Since precedents are stipulated as Copyrighted works not protected by the Copyright Act, basing commercial services on them or using them for research purposes does not violate the Copyright Act.

[Case 2]
Q. I was so impressed with Upstage blog content that I left a comment. Do I own the copyright to this comment?

A. It depends on the content of the comment. ‘It was so good!’ is a sentence that can be written universally by anyone, and therefore is not protected by copyright. However, a sentence where “creativity” is recognized becomes copyrighted.

In the case of Hemingway’s six-word novel, creativity is recognized, therefore granting copyright to Hemingway.

Copyright occurs naturally for works where creativity is recognized. Let’s use this knowledge to take a closer look at properly using data for AI model training.

How to Use Data Legally

1. Consultation with the author

This means negotiating directly with the copyright holder to discuss the method of use. Usually the holder’s homepage lists their contact or email address where you can discuss the copyrighted work. According to the contract specified by the Korea Copyright Commission, obtaining permission to use copyrighted works or acquiring copyrighted property rights are important measures.

Let's interpret the above plan.

(1) Exclusive / Non-exclusive License for Copyright

  • Exclusive License: The author permits the “exclusive” right to use their data to contracted individuals

  • Non-exclusive License: Authors may enter into additional data utilization agreements with contracted individuals

(2) Transfer of All / Part of the Author's Property Rights

The right to transfer all or part of a naturally occurring copyright. All or part of the copyright can be acquired, and it is also possible to take over for just a period of time.

Is there any other way than forming a contract? There is an efficient method for both users and authors: a “license.”

2. License

The second way to legally utilize data is through a ‘license,’ or the terms of use stated by an author. A license is a stipulation that allows individuals to utilize a work if certain conditions are met, even without a formal request for permission.

A variety of organizations issue licenses, but the most famous among them is ‘CCL,’ the Creative Commons License, issued by a non-profit organization called Creative Commons. ‘Gonggongnuri’ provides a similar function, produced by the Ministry of Culture, Sports and Tourism in Korea.

  • MEANING OF CCL

    • BY: Attribution

    • ND: NoDerivatives

    • NC: Non-Commercial

    • SA: ShareAlike

CCL used internationally (Source: Creative Commons homepage)

A typical example of CC-BY-NC-SA is ‘Namu Wiki.’ Namuwiki data for AI model development can be used under the following conditions.

[Case 3]
Q. Is it possible to create a MRC (Machine Reading Comprehension, a technology in which AI algorithms analyze problems and find optimal solutions) datasets by searching Namuwiki data, then distribute them through personal Github?

A. Yes; as long as education is involved, this is considered non-profit. However, when distributing it, you must attach and specify the label CC-BY-NC-SA, the license of the original data, and the source of the original data.


Let's take a look at another example: CC-BY-ND. This is a license that combines BY, attribution, and ND, noDerivatives. A dataset called ‘KorQuAD,’ well known to those who work with Korean NLP, is distributed under this license.

[Case 4]
Q. After creating a new MRC dataset by just changing KorQuAD's questions, can I distribute it to my personal GitHub?

A. It is not appropriate to alter and disclose KorQuAD's fingerprints, questions, or answer pairs due to its prohibition of change.


What other Copyright-related cases are most often encountered while developing AI models?


Copyright Cases Often Encountered While Working With AI

Use of News Data

News data is commonly used in developing AI models. However, the copyright of news articles belongs to the media.

Currently, the Korea Press Foundation consigns and manages the copyrights of most media companies. Therefore, in order to legally use a news article, if the media company providing the article has entrusted the copyright to the Korea Press Promotion Foundation, one must contact the foundation directly. Alternatively, they can inquire with the media company about the scope of content use and terms of the contract. However, major media outlets often manage their copyrights without entrusting them to the Korea Press Foundation. In very rare cases, media outlets (ex. Wikitree) apply CCL, so it is important to verify each copyright according to the purpose of use.

Sometimes, the Korea Data Exchange (KDX) publishes news data for free. But to what extent can this data be used?

[Case 5]

Q. Can I use the data I purchased for 0 won?
A. In this case, it depends on the terms and conditions set by the data seller.

Source: KDX Korea Data Exchange

KDX can only be used within the scope of common usage in Articles A, B, and C (listed below). If the seller has additional conditions, uses outside the common scope may not be possible. Please verify carefully.

Titles of News Articles

Surprisingly, the title of a news article is not protected by copyright law. It is not recognized for its value as a copyrighted work. This is stated in the booklet “Newspapers and Copyright” issued by the Korea Copyright Commission.

Newspaper and Copyright, Korea Copyright Commission, 2009

If you want to build a model that predicts which category a news article belongs to just by looking at the headline, you can use this data legitimately.

Fair-use

In the following cases, Copyrighted work can be used without obtaining prior permission from the Copyright holder. Educational purposes are included within these fair use purposes, so there is no restriction on the Copyrighted work.

  • - Education , etc.

  • - Duplication in court proceedings, etc.

  • - Use of political speeches, etc.

  • - Use for school educational purposes, etc.

  • - Use for current affairs reporting

  • - Use of published works

  • - Non-profit performance/broadcasting

  • - Reproduction for private use

  • - Reproduction in libraries, etc.

  • - Replication as test questions

  • - Reproduction for the blind, etc.

  • - Temporary recording and broadcast recordings

  • - Exhibition or reproduction of works of art, photography or architecture

  • - Use by translation, etc.

  • - Reproduction of topical articles and editorials

  • - Reverse program code analysis

  • - Reproduction of programs for preservation by legitimate users
    Source: Korea Copyright Commission

The Gray Area of Copyright Law: AI

While many aspects seem to be regulated in Copyright law, laws relating to AI still have a long way to go. Can the data generated by ChatGPT, a recent trending topic, be recognized as Copyrighted work? If so, we have yet to decide which license should be attached to the data generated by ChatGPT. Furthermore, the extent to which it can be used, as well as what happens to the copyright of AI models that produce new results based on news articles, must be considered as well.

Until clear relevant standards are established, it is necessary to check the copyright and licensing of any work and carefully examine its scope of use. Through this content, we mainly explored CCL licenses. However, there are various types of licenses, so it is necessary to verify before utilizing any data.

We hope that each individual working with AI understands now how data is created within legal boundaries. Furthermore, we hope that AI modeling will develop positively, thereby providing opportunities to observe the limits of current Copyright laws.

Previous
Previous

Until the birth of OCR that recognizes text (Upstage in-house OCR image data collection challenge)

Next
Next

Three issues with handling data in AI models (Data! Data! Data!)