LLaMA, a large-scale language model announced by Meta last February, is smaller than the existing GPT-3 and can produce performance comparable to GPT-3 even in a general GPU environment. They also released flama.cpp. In the midst of this, it has been reported that programmer Justin Tani performed an update that reduces memory usage when flama.cpp is running, and that some models of LLaMA operate with RMA less than 6GB.
LLaMA is a large-scale language model announced by MetaAI Research, a meta-AI research organization. It is reported that the number of parameters representing the scale of the large-scale language model ranged from 7 billion to 65 billion, and the LLaMA 13B model benchmark test result was comparable to GPT-3 with 175 billion parameters.
In addition, since LLaMA operates without problems even on a single GPU, the possibility of moving interactive AI such as ChatGPT in a consumer-level hardware environment was also suggested.
Afterwards, he reported that he succeeded in operating LLaMA with the M1-equipped MacBook Pro by developing flama. According to the developer, the LLaMA 13B model can operate at a processing speed of 10 tokens per second with an M1-equipped Mac.
On March 31st, we added a C++ source code update to llama.cpp again. As a result of the update, memory usage significantly decreased when running, and it was reported that the memory usage of the LLaMA 13B model, which previously required 30GB, was operated without problems with only 5.8GB including system memory usage. In this way, only partial weights necessary for actual reasoning are loaded into the user memory, realizing less memory usage than before.
With this change, it is said that inference command loading can be accelerated up to 100 times faster than before, and there is a possibility of stably loading models more than twice as high, and multiple inference processing can be performed simultaneously. Related information this placecan be found in