top of page

AI Progress Inside the World of Performance

Updated: Dec 28, 2024

GPT-4 has been out for a couple of weeks now, and all the buzz has centered on the number of parameters compared to its predecessor. GPT-4 reportedly increased from 175 billion to an unverified ~100 trillion parameters, and we're already hearing talk about GPT-5 opening up a black hole in the cosmos. What does it all mean?


The more parameters a new release has, the more excitement it generates in the tech space. But is this just like Gillette adding a seventh blade to their razor—a marketing gimmick that adds no real value?


When it comes to ChatGPT and other conversational AI tools, we've learned that the number of parameters isn't necessarily the best measure of quality and capabilities. OpenAI is measuring against a list of micro and macro level performance metrics, as well as open-sourcing the development of new metrics on GitHub.


This post aims to demystify the hype around parameters while highlighting some of the better and more nuanced ways that OpenAI is demonstrating value. If that doesn't float your boat, skip to the end for some links to new AI tools to boost productivity.

Parameters refer to the adjustable settings within the model that help determine its behavior and performance. Think of a parameter like a volume setting on your computer—you can adjust it to make the sound louder or softer.


Now imagine you're an audio engineer working behind the scenes on your next Billboard chart-topper. What's in front of you is a large console equipped with 100 control sliders, each capable of tuning the sound with filters, effects, EQ, panning, and more to create an elegant and harmonious piece of music. If one slider is out of whack, the music doesn't sound right.


In this context, developing a useful language model is similar to recording a professional album, but the difference in the size of the audio mixer is enormous, and the controls are designed for mathematical and computing adjustments that could fill a textbook. An audio mixer with 100 trillion sliders? It's impossible to conceptualize.


Consider this: if every photo ever posted to Instagram totals 50 billion (estimates may vary), and commenting on each photo would consume over 1,000 lifetimes of human energy—multiply that comment cycle by 2,000 to reach 100 trillion. Massive!


With a number that large, any novice to this field would ask the obvious question:

How in the world can an office of engineers possibly program that many settings?


They don't. They're engineers. They're lazy (kidding). I mean, they're resourceful(!), and the answer to the 100 trillion parameter question is: Automation.


In 1963, Arthur Lee Samuel, showcased a computer that could teach itself to play checkers against a formidable opponent. The program beat a college-level champion named Robert Nealey, in a single match. Note that ChatGPT gets the dates and events all wrong, hence the hyperlinks. You'd think that the history of machine learning would be a cakewalk for ChatGPT?!


Samuel's checkers program is one of the first instances where a computer demonstrated that it could learn and improve with each iteration of the game through trial and error—a huge breakthrough. Many consider this the birth of machine learning.

Fast forward 70 years and machine learning has become a revolutionary force in computer science. Within that field, we now have this automated training algorithm (Generative Pretrained Transformer) doing the heavy lifting by generating most of the parameters within the model. Herein lies the beauty and the danger of AI: It's self-taught, but the learning is conducted inside an engineered virtual environment.


Engineers program a higher level of governance called *hyperparameters* to direct its learning. These are the big dials on your mixer responsible for human-guided training. The word "direct" is used with purpose because the model's learning process is moving toward some direction (getting smarter) and some magnitude (getting faster). It's a vector, and it's why your teacher handed you graph paper in grade 11 to draw various lines but you could never see why any of it was practical. Well, here you go.


Skilled engineers set the hyperparameters, and the model learns within them. The day AI breaks free from those constraints is the day T1000 shows up at your doorstep. Note that ChatGPT will comfort your nerves by explaining this unlikely scenario—but many of us have seen the movie Ex Machina and know the ending (Ava....)


Thus, machine learning facilitates the adjustments required to develop the model, mapping input text to output text through algorithms. No need to rent all 500 million square feet of New York City's commercial office space to house these parameter-setting engineers—or the avocados and tofu required to feed them.


While parameters are a useful metric, OpenAI's CEO Sam Altman confirmed on the Lex Friedman podcast this week that parameters are not a great catch-all performance metric for a well-functioning model. He compared it to the Gigahertz Race—a battle largely fought between Intel and AMD in the 1990s to produce the fastest clock speeds on their computer chips, only to face diminishing returns from heat and energy demands, signaling a shift toward other efficiencies.

Much like clock speed redlining, more parameters ineffectively managed leads to bloat and poor performance. And with that, we'll place a wooden stake through this vampire called a parameter.


There's a lot more to balance out, far more than I could hope to understand in a few months. On a macro level, however, accuracy, speed, interpretability, scalability, and robustness (adaptation to challenge) are common targets used to measure performance, and these five properties seem like a good place to begin anchoring understanding. All the metrics provided on the OpenAI resource page are a subset of at least one of these categories.


So how is OpenAI tracking performance on a macro level that demonstrates progress? One overall metric is measuring performance against human standardized testing: GPT-3.5 completed 34 exams across various disciplines from the hard sciences, mathematics, language, and law professions as a benchmark, then redid them in the current year to measure improvements. All scores were either improved or equal, with the Bar Exam showing the greatest improvement overall at +80%. Full report can be found here.

Another metric tests the model's ability to understand facts in a basket of intentional untruths. While simple truths can be easily tested because we can distinguish them (the grass is green or the car is damaged), some biases have emerged over the past four months around more complex and polarizing issues. In either case, GPT-4 is outperforming GPT-3 by 40% in this area according to the data.


There are metrics around multilingual performance and image recognition to name a few on the resources page for those interested in learning more.


Personal Experience:


I've been using ChatGPT and GPT-4 interchangeably over the last two weeks within the BearlyAI platform, which just released their iOS app on Test Flight this week. GPT-4 gives more detailed responses, but it's also noticeably slower in its response time. It's good to be able to toggle between the two engines depending on the task. On the text-to-image side and image recognition capability, I have yet to take it for a spin.


Some customers have received access (fee-based) to increase their token limits to 32,000 units, which permits a query and response of a combined 50 pages of text. I suspect that document query and chatbots for contracts will be at the forefront of many startups and big businesses alike, and will change how we communicate on reports and resolve contract disputes in the near future.


This one's hot off the press: A new GPT-4 API setup 'chat' with a 56-page legal PDF document about the famous supreme court case: Morse v. Frederick. What's great about each response in this ChatBot is productions of sources of direct quotes from the document.


The future of AI is a hallway of doors yet to be opened...


Some AI Tools to Check Out:

Free text-to-image generator

Compare Anthopic, Cohere, and GPT-3.5 responses

Query chatbot for Lex Friedman Podcast

Pro headshots with AI

Query scientific papers

If you found this article helpful, please like and share. Also, follow us on Twitter @BluelineProAI.

Disclosures

- Article written by Mike Bogias - Article editing and image generation by AI Software - Image prompts and GIF recordings on Twitter by Blueline Consulting - Published by Blueline Consulting

Tags













bottom of page