Home » Business » Utilizing illegal learning data…a shortcut to ruining AI companies [real! AI pro]

Utilizing illegal learning data…a shortcut to ruining AI companies [real! AI pro]

In the era of great AI transformation, it is not easy to distinguish between ‘necessary things to know’ and ‘good to know’ among the flood of issues and keywords. There are many floating stories. [real! AI Pro]We clearly organize these concerns with topics and insights hand-selected by industry experts.

[디지털데일리 이건한 기자] Recently, in the artificial intelligence (AI) industry, it is difficult to find disagreement on the claim that data is ‘gold’ and a key factor that determines the performance of AI models. This is a time when evaluations and expectations for companies with high-quality source data and companies that handle data well are gradually increasing. It is symbolic that ‘Scale AI’, an American AI data specialist company, was evaluated for a corporate value of a whopping 19 trillion won last May.

[ⓒ DALL·E – AI 생성 이미지]

However, as interest in data increases, the noise coming from the source of AI training data is also gradually increasing. A representative example is the New York Times (NYT) filing a lawsuit against OpenAI (23.12) for unauthorized collection and use of its news data for learning, and furthermore, the data disclosed online is considered fair use for AI learning. )’ The debate over whether it is a target appears to be intensifying.

Among these, there is a growing awareness in the AI ​​industry these days that it is necessary to respond to unexpected data-related risks by more thoroughly refining data collection and utilization procedures.

However, since this is not an area that has yet been standardized or legislated, concerns about this seem to be growing as well. What are the things that companies need to consider in order to safely secure AI-utilized data and connect it to a stable business? Let’s hear from Korea University research professor Park Chan-jun, a domestic AI data processing and analysis expert.

[ⓒ 디지털데일리]

[ⓒ 디지털데일리]

What are the criteria for classifying ‘good data’?

Hello, this is Chanjun Park. These days, when it comes to data-related trends in the AI ​​industry, there is a tendency to move from ‘quantitative expansion’ to ‘qualitative improvement’. Even from the same viewpoint of viewing data as important, the accumulation of massive amounts of data was previously seen as the main driver of AI performance improvement. We have now reached a point where data quality is more directly correlated with model performance. To this extent, ‘high-quality data’ has now become a key resource that all companies and researchers want to secure, and this can be described as ‘data that can improve model performance without ethical issues.’

Is data subject to fair use? There is no answer yet

However, it is very difficult to obtain data without sources and legal issues. This is because data is basically created from a wide variety of copyright holders, so on the surface it is impossible to obtain permission to use it from all original authors. For this reason, some say that AI learning data should be considered fair use, but there are many differing opinions from researchers and companies.

Personally, fair use is an issue involving very complex ethical and legal issues, so the ideal is for the data provider and the developer using the data to find a balance that promotes the development of AI technology while protecting each other’s rights. In addition, it is necessary to legally define the scope of fair use, but at the same time, it is difficult to ensure that technological innovation is not hindered. In the end, it can be said that it is a problem that requires a very careful approach and interpretation over time.

Even if there is no right answer, there is a way… The case of the ‘1T Club’

Instead, efforts to secure safe data even within the limitations of the current system can yield sufficient results through close cooperation among stakeholders. In this regard, I can present the ‘1T Club (1 Trillion Token Club)’ of Upstage, an AI startup where I worked not long ago, as a good example.

The 1T Club, launched under the leadership of Upstage in August 2023, is ‘an alliance of partner companies that each provide more than 100 million words of Korean data’ in various formats such as texts, books, articles, reports, and theses. Since its establishment, various places including 20 media companies, companies, and academia have been listed as members. In particular, recognizing the importance of securing safe data and establishing a fair usage environment for contributors and users, 1T Club was characterized by considering profit distribution plans from the beginning.

And this plan has been continuously refined since then, and now actual profit distribution is being realized. A concise explanation of the structure is as follows. First of all, stakeholders within the 1T Club are largely divided into ‘Data Contributors’, who provide data, and ‘Data Contributors’, companies that use the data to improve their services and generate profits. First, when a data provider submits data to the platform, the data goes through an automatic pre-processing process and becomes ‘high-quality data’ that can be used to distribute profits. Data providers will then be able to receive actual compensation based on the amount of data they contributed.

1T Club’s profit distribution formula. Ti is the number of tokens of data submitted by contributor i, Rapi is the total revenue generated by consumers, and a is the revenue percentage to be allocated to the contributor, which is determined by taking into account service operation costs. [ⓒ 1 Trillion Token (1TT) Platform: A Novel Framework for Efficient Data Sharing and Compensation in Large Language Model (24.9)

1T 클럽의 수익분배 공식. Ti는 기여자 i가 제출한 데이터의 토큰 수, Rapi는 소비자로 생성된 총수익, a는 기여자에게 할당될 수익 비율이며 서비스 운영 비용을 고려하여 결정된다. [ⓒ 1 Trillion Token (1TT) Platform: A Novel Framework for Efficient Data Sharing and Compensation in Large Language Model (24.9)

또한 제공자는 자신의 데이터 기여 비율, 기여 토큰 수, 현재 보상 추세에 따른 다음 예상 지급액도 확인할 수 있습니다. 보상 수준은 데이터 소비자의 사용 패턴에 따라서도 변동되는데 이 또한 실시간 확인이 가능하므로, 결국 기여자가 더 나은 데이터를 제공하도록 유도하기도 합니다. 생태계 측면에선 데이터 기여자가 경제적 혜택을 얻고, 데이터 소비자는 출처와 사용권리가 보장된 고품질 데이터를 AI에 적용함으로써 성능과 품질을 높이고 시장 경쟁력까지 확보할 수 있다는 점에서 이상적인 선순환 모델로 평가할 수 있겠습니다.

[ⓒ 1 Trillion Token (1TT) Platform: A Novel Framework for Efficient Data Sharing and Compensation in Large Language Model (24.9)

[ⓒ 1 Trillion Token (1TT) Platform: A Novel Framework for Efficient Data Sharing and Compensation in Large Language Model (24.9)

불법 데이터 사용은 돌이킬 수 없다

이 밖에도 AI 기업이 데이터를 확보할 수 있는 방안은 크게 3가지로 구분됩니다. 첫째, 자체 데이터 구축입니다. 기업이 원하는 데이터를 직접 수집, 필요에 따라 가공함으로써 데이터 품질과 안정성 보장이 가능하죠. 또한 데이터 독점이 가능한 장점이 있지만 많은 비용과 시간이 소요되는 단점도 따릅니다.

둘째, 데이터 구매입니다. 외부에서 정제된 데이터를 구입하는 방법으로, 초기 비용은 발생해도 데이터를 신속하고 합법적으로 확보할 수 있는 장점이 있습니다. 대신 자체 데이터 구축과는 반대로 원하는 수준의 충분한 핏(Fit)을 지닌 데이터를 확보하기 어렵다는 단점이 있을 수 있습니다.

셋째, 공공 데이터 활용입니다. 정부나 공공기관이 제공하는 데이터를 활용하는 방안으로, 비용 부담은 가장 적으면서 신뢰할 수 있는 데이터를 확보할 수 있는 것이 장점입니다. 다만 공공이란 특정 도메인에 한정된 데이터만 얻을 수 있다는 점에서 한계가 있기도 합니다.

이처럼 일장일단이 있지만 합법적인 경로로 데이터를 확보하는 건 대단히 중요합니다. 예컨대 기업이 불법적 경로로 획득한 데이터로 자사 AI 모델을 학습할 경우, 추후 법적 분쟁뿐 아니라 모델의 신뢰성에도 큰 타격을 입을 수 있습니다. 무엇보다 이 경우 기업이나 서비스의 존망과도 직접적으로 연결될 가능성도 있는데요. 확실한 사실은 모델 성능을 유지하면서 이미 학습시킨 특정 데이터만 제거하는 건 기술적으로 매우 어렵다는 점입니다. 나아가 이미 학습된 데이터가 모델에 미친 영향을 완전히 배제하는 것도 불가능하다고 할 수 있습니다.

이런 상황에서 데이터 출처에 문제가 발견된 경우, 기업은 돌이킬 수 없는 이미지 및 재정적 손실로 이어질 수 있다는 사실을 반드시 기억하길 바랍니다. 약간의 시간과 비용이 더 들더라도 처음부터 안전하고 법적으로 문제가 없는 데이터를 사용하는 것이 중요합니다.

[ⓒ DALL·E - AI 생성 이미지]

[ⓒ DALL·E – AI 생성 이미지]

The future of data… What should companies and individuals prepare for?

Additionally, some experts predict that all high-quality data for learning will run out within a few years. This means that even if the data source is legally secured, there may come a day when there is insufficient data for learning.

Of course, this is a future that cannot be confirmed yet, but at least to prepare for such a situation, in addition to simply increasing the amount of data, ‘efficient use of data’ will also be a task that companies must prepare for. As collecting new data becomes more difficult, it is important to develop technologies to refine and process existing data to better quality, and ways to effectively expand data through data augmentation or simulation techniques can be proposed.

Additionally, the ethical and legal issues of data use will only become more important in the future, not lessen. Therefore, AI researchers and companies must also propose transparent data utilization plans to gain public trust in the future. As public awareness of trustworthy use of AI and personal information protection increases, AI with guaranteed data transparency will be chosen by the public for the same service.

Finally, I would like to say that the AI-using public also need not worry too much about data abuse. According to my experience as an AI researcher, most AI learning data is used in a form that does not identify individuals, and various technological devices are in place for data security. Therefore, excessive concern is unnecessary, but it is more important to increase our understanding of AI and data so that we can fully recognize our data sovereignty and provide data in an environment where we can safely exercise our rights.

detail‍ photograph

**How can we⁣ balance the need for transparency in⁢ AI ‌data usage‌ with the protection of individual privacy, ensuring public trust while ⁢allowing for ‌innovation?**

‌Great! Here are some open-ended⁣ questions, divided into ‍thematic sections based on the article, designed to encourage discussion and diverse viewpoints:

**Theme 1: Challenges in Acquiring Data for AI**

* The article highlights the ‍difficulty of finding adequate training data for‌ AI. What innovative strategies can be developed ⁤to overcome this challenge?

* Should there be ⁢ethical regulations ‌for data scraping,⁤ given its potential benefits ​and drawbacks? How could we strike a balance?

**Theme 2: Ethical Considerations‍ and ⁢Data Security**

* How can we ensure the responsible use of data in AI development, especially when dealing with sensitive information?

* What mechanisms can be put in place to guarantee transparency in AI data usage and build⁤ public trust? How can individuals better understand and⁤ control how their data is used in⁣ AI?

* The​ article mentions the risk of biased AI due to biased ‌training data. How can we mitigate this risk and ensure fair ⁣and⁢ equitable AI systems?

**Theme 3: The Future ⁣of Data in‌ AI**

* With the ‌potential shortage of ​high-quality data, what innovative techniques ⁢can be developed for “efficient use of data” in‍ AI?

* How will the ⁤increasing use of synthetic data and data augmentation impact the field of AI?

* What role will data privacy and individual data sovereignty ⁣play in shaping the ​future of⁤ AI development?

**Theme 4: Public Perception and Trust in AI**

*‌ How can AI researchers ​and companies effectively communicate with the public about ⁤data usage and address concerns about data⁢ privacy?

* What steps can be taken to increase public understanding of​ AI and foster trust in​ AI-powered systems?

**General Discussion Questions:**

*⁣ Do you think the benefits of using AI​ outweigh the potential risks associated with ‍data privacy and ethical concerns?

* What role should governments play in ⁣regulating​ AI development and​ data usage?

**Remember:**

* These questions‍ are just a starting point. Feel free to adapt them to your specific interview context.

* Encourage participants ⁣to share​ their personal experiences, opinions, and perspectives.

* Create a safe and inclusive environment where all voices can be heard.

Let me know if you’d like me to expand ⁢on any of these questions or suggest additional ones!

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.