Yao Qian: Some Thoughts on the Ecological Construction of Large-scale Models

2023-07-10 03:28:57

Author｜Yao Qian "Director of China Securities Regulatory Commission Technology Supervision Bureau"

Source｜ "China Finance" Issue 13, 2023

Image source: Generated by Unbounded AI‌

Entering 2023, content generation-oriented artificial intelligence applications such as ChatGPT, GPT4, and Midjourney have triggered rounds of innovation waves. Some people even think that the large model is iteratively evolving in units of days. As a new factor of production, the benign and sustainable development of large model training data is crucial to the development of large model and artificial intelligence industries. As an important field of big data and artificial intelligence applications, the financial industry should pay close attention to the latest developments in technologies related to large model training. This paper first analyzes the evolution and upgrade path of large models, and then discusses the possible interaction methods between large models and small and medium models, and expounds the data ecology and model ecological construction of large models. The sustainable development of large-scale model ecology provides relevant ideas.

Upgrade and evolution path analysis of large models

From a long-term perspective, the evolution of large models has many branches. Recently, the iteration speed of large models has not only accelerated, but also more and more participants, basically covering all large technology companies, and the diversity and complexity of the ecology have initially emerged.

At present, there is no essential change in the underlying algorithm framework in the iterative process of upgrading the large model. The input of computing power and the abundance of training data are still the key to its rapid evolution, but the latest GPT4 presents some new features.

**One is that the algorithm is more suitable for specific downstream tasks. **GPT3 and GPT3.5 are large models with 175 billion parameters. GPT4 has not announced specific parameters at present, but some people speculate that its parameters will reach trillions of levels. At the same time, it will also have a significant improvement in reinforcement learning and solving specific tasks. The more popular term is "alignment". If the GPT3 series models prove to everyone that artificial intelligence can do multiple tasks in one model, then GPT4 has reached or even surpassed human levels in many tasks. the top 10% or so.

**The second is to have more standardized training data governance capabilities and support multi-modality. **GPT4 has a multi-modal capability "comparable to the human brain", which is not much different from the multi-modal mechanism described in many current papers, but it can combine the few-sample processing capability of the text model with the chain of thought (Chain of Thought) , CoT) combined. The governance and supply of GPT4 training data is inseparable from data labeling, data management and evaluation, data automation, and data synthesis.

The third is to build a more powerful computing power cluster to meet more training data sets and larger input parameters. ** For example, Microsoft has devoted more than half of its cloud resources to large model training and artificial intelligence generated content (AIGC) applications. Nvidia even joined forces with TSMC, ASML, and Synopsys to create a new computing platform and more powerful GPU.

Build an ecosystem where various models are interconnected

GPT-like large models are powerful and will become one of the important infrastructures in many industries such as the Internet, finance, and medical fields in the future. For example, in the financial field, after training with relevant professional data, the large model can have the ability to understand financial business knowledge, and can propose solutions for specific scenarios, supporting financial institutions to carry out marketing automation, customer relationship mining, intelligent risk identification, intelligent Customer service, smart investment research, etc.

However, in the process of implementing specific applications, GPT-like large models will face a series of challenges. One is how to ensure the quantity and quality of training data. Generally speaking, the training corpus of large models is general-purpose corpus from multiple fields, while the collection of professional corpus is usually time-consuming and laborious, and there are also privacy issues. As a result, large models may appear professional in specific individual application fields. sexual inadequacy. The second is how to reduce the operation and maintenance costs of large models. Large models require huge computing power support and strict data governance. It is often difficult for ordinary institutions and application departments to support the operation and iterative upgrade of large models. To this end, it is necessary to establish an ecology of healthy interaction and co-evolution of various models to ensure that the artificial intelligence industry related to large models can be successfully implemented in various application fields.

From a technical point of view, the evolution of large models relies on reinforcement learning with human feedback (Reinforcement Learning from Human Feedback, RLHF). The data labeling it uses is different from the simple data labeling work done with low-cost labor in the past. Very professional people will write entries, and give high-quality answers that conform to human logic and expression for corresponding questions and instructions. However, due to the gap between human and machine interaction, the ideal mode is to carry out reinforcement learning through the interaction between models, that is, reinforcement learning relying on model feedback (Reinforcement Learning from Model Feedback, RLMF). Based on the interaction of various models, the data and model ecology of the entire large model can be unified into a framework.

In the past, under the decentralized model development model, multiple tasks in a single artificial intelligence application scenario needed to be supported by multiple models, and each model construction had to go through the process of algorithm development, data processing, model training and tuning. The pre-trained large model enhances the versatility and generalization of artificial intelligence. Based on the large model, fine-tuning with zero samples or small samples can achieve better results in various tasks. The large model "pre-training + fine-tuning" model has brought a new standardized paradigm to artificial intelligence research and development, enabling artificial intelligence models to achieve large-scale production in a more unified and concise manner. Focusing on technological innovation and application implementation, the data and industrial ecology of large models can be divided into infrastructure (including general corpus and computing power platforms), basic large models, and large model services (including synthetic data, model supply, and application plug-ins). In downstream applications, users can deploy their own small models to improve performance through various services of the large model, and at the same time provide corresponding feedback services to the large model in reverse to help iteratively evolve the large model (see Figure 1).

The basic large model is the core engine of the large model industrial ecology. Its advantages lie in its basicity and versatility. It is oriented to typical tasks such as natural language processing, computer vision, and cross-modal tasks. It further combines task characteristics, optimizes model algorithms, and learns related Data and knowledge, so that large models can show better results, and can even be directly applied with zero samples.

The small model has the characteristics of small size (usually at the level of tens of billions of parameters), easy training and maintenance, so it is suitable for various vertical fields and internal development and use in various industries. In general, small models are less expensive to train, but far less performant than large models. Through the interactive application of large and small models, the small model can obtain part of the capabilities of the large model or realize some functions, so that the performance of the small model can be greatly improved without increasing operation and maintenance costs, and meet specific application requirements. The ways of large and small model interaction can be divided into three categories: data interaction, model interaction and application interaction (see Figure 2).

* Data interaction

Data interaction means that large and small models do not directly participate in each other's training or reasoning process, but interact indirectly through the data generated by each other. The training of large models usually requires large-scale general-purpose corpus. For example, the training corpus of GPT3 reaches 753GB, which comes from multiple data sources such as Wikipedia. General-purpose corpus refers to corpus covering multiple fields, and knowledge coverage in some specific fields may be insufficient. After the training of the large model is completed, some domain-specific synthetic corpus can be generated through instructions, and then through localized deployment, the small model can be trained together with the dedicated corpus of the field or the private corpus of the industry. The field of small model training corpus is relatively concentrated, so the knowledge in this field can be systematically mastered, so that the output of the model is more professional, more detailed, and more accurate. The role of the large model in this process is to generate large-scale high-quality synthetic corpus, so that the training of the small model can be more adequate, and prevent the overfitting of the model due to the small size of the special corpus or private corpus. Conversely, the professional corpus generated by the small model can also be used as a supplement to the training corpus of the large model to enhance the professional capabilities of the large model in different fields, so that the large model can continue to evolve iteratively.

To achieve data interaction between large and small models, in addition to relying on the data source management organization, it is also necessary to consider the establishment of a data custody and trading organization, so that the training data of large and small models can be controlled and flowed in an orderly manner, and the corresponding allocation for all parties is reasonable. rights and interests.

Model Interaction

In addition to indirect data interaction, large and small models can also interact at the model level. By participating in each other's training process, both parties can benefit from each other and improve the iteration efficiency of large models. On the one hand, large models can guide the training of small models, and the commonly used method is knowledge distillation. In the distillation learning mode, the trained large model can be used as the teacher model, and the small model to be trained can be used as the student model. For the same batch of training data, by designing a reasonable loss function, the soft labels generated by the large model and the training data itself Hard labels jointly guide the training of small models. Similarly, the small model can also perform reverse distillation on the large model, and use the small model to make sample value judgments to help the large model accelerate convergence—after further fine-tuning the trained small model on the downstream data set, a sample value judgment model is obtained.

App Interaction

The typical way for large and small models to interact at the application level is the plug-in mode, which encapsulates the application built by the model into a plug-in service for other models to call. The plug-in mode has two advantages: one is convenient and efficient, and the model does not need to be retrained; the other is good isolation, which can avoid the leakage of model details, thereby better protecting the rights and interests of model trainers and users.

On the one hand, the large model basically adopts the pre-training method, and the real-time performance is not high. By calling the small model application plug-in, the large model application can not only improve the real-time performance of the output results, but also expand its lack of knowledge in specific fields. On the other hand, applications built with small models can also directly obtain the powerful generation and reasoning capabilities of large models by calling the plug-ins provided by GPT-like large models. This application interaction method can save the small model from the training process of general knowledge, and focus on the content production of specific fields at a lower cost. Users can also feel the "chemical" reaction produced by the interconnection of various models.

The new product ChatGPT plugins recently released by Open AI (Open AI) can connect ChatGPT and third-party applications through application plug-ins. These third-party applications can be built from small models of a single domain. In this way, the small model can complete a variety of extended functions in the ChatGPT-like large model, such as retrieving real-time information or knowledge base information, and replacing users with "intelligent scheduling" of the real world.

Standardization and security control of large model training data and model tool chain

The performance of a large model depends on the quality of the training data. At the same time, the underlying technical specifications required by the model in different landing scenarios are also different. Therefore, to build a good industrial ecology with sustainable development and healthy interaction of large models, it is necessary to promote the standardization of large model training data and underlying technologies, and accelerate the iteration and implementation of models.

On the one hand, the training data set of the large model itself and the defined data service interface (API) will become the industry's de facto standard, and various applications that access the large model must follow this standard. At present, model "pre-training + fine-tuning" has become a unified standard process and paradigm in the industry. On this basis, combined with specific application scenarios and professional data, small models in various fields and industries can be further customized and optimized. To some extent, large model training data and data service interface standards will become one of the cores of the next generation of international standards.

On the other hand, the tool chain required by the underlying technology for processing large model training data must also be productized and standardized. With the strong support of standardized technical services, the large model can output technical solutions such as hardware adaptation, model distillation and compression, model distributed training and acceleration, vector database, graph database, and model interconnection, providing natural language processing, computer vision, Various capabilities such as cross-modality and knowledge graphs allow more companies and developers to apply large models to their own businesses and build industry vertical models with a low threshold, thereby promoting the widespread implementation of artificial intelligence in various fields.

It is worth noting that although the development and application of large models will bring huge dividends to industrial and economic development, if not properly controlled, it will also bring risks to national and industrial security. One is the risk of data leakage. The training and implementation of large models need to be supported by massive amounts of data, including industry or personal sensitive information. If there is no reasonable data desensitization and data custody mechanism, it may cause data leakage and cause losses to the industry and individuals. The second is model security risk. For example, plug-ins may be implanted with harmful content and become a tool for fraud and "poisoning" by criminals, endangering social and industrial security.

Related suggestions

**Using large model training data as the starting point, standard formulation and data governance are two-pronged. ** Promote the standardized development of the industry by formulating model application specifications and unifying interface standards. It may be considered to host the synthetic data of the model to strengthen supervision and ensure compliance of data content, clear rights and interests, and smooth circulation. At the same time, improve laws and regulations, optimize policies and systems, form a joint regulatory force in various ways and methods, and strictly prevent malicious tampering with models and infiltration of harmful data.

**Construct a large model training data element market. ** Clarify the industrial chain between training data collection and processing, synthetic data services, interconnection between large and small models, and application APIs. Accelerate the construction of the data element market, provide market-oriented pricing for training data, and facilitate the distribution and incentives of rights and interests.

**Construct a good ecology of symbiotic development and mutual promotion of large and small models. **In general, there is no intergenerational difference in the algorithm level of mainstream large models at home and abroad, but there is a gap in computing power and data. It is recommended to vigorously support domestic leading technology companies to develop independent and controllable domestic large-scale models in the general field. At the same time, encourage all vertical fields to use open source tools to build standardized and controllable independent tool chains on the basis of large models, so as to explore "big and strong" It also develops a "small and beautiful" vertical industry model, so as to build a good ecology of interactive symbiosis and iterative evolution between the basic large model and the professional small model.

(Editor-in-charge Zhang Lin)

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

Reward
like
Comment
Repost
Share

Comment

0/400

No comments

Topic
#ETH Hits New ATH
11k Popularity
#Powell Turns Dovish
7k Popularity
#Gate Alpha FST Points Airdrop
6k Popularity
#Altcoin Market Cap Up 2.64%
2k Popularity
#Aave & WLFI Token Allocation Dispute
2k Popularity

Sitemap