The Surprising Finds of Gradient Descent: Small Steps May Not Always Be Best
Category Science Saturday - August 12 2023, 02:14 UTC - 1 year ago Gradient descent is a common method for solving optimization problems that mathematicians and scientists have been perfecting for over 150 years. However, a recent study contradicted a basic assumption of the technique, showing that in some cases taking bigger leaps can actually get to the optimal solution faster.
Optimization problems can be tricky, but they make the world work better. These kinds of questions, which strive for the best way of doing something, are absolutely everywhere. Your phone’s GPS calculates the shortest route to your destination. Travel websites search for the cheapest combination of flights that matches your itinerary. And machine learning applications, which learn by analyzing patterns in data, try to present the most accurate and humanlike answers to any given question.
For simple optimization problems, finding the best solution is just a matter of arithmetic. But the real-world questions that interest mathematicians and scientists are rarely simple. In 1847, the French mathematician Augustin-Louis Cauchy was working on a suitably complicated example — astronomical calculations — when he pioneered a common method of optimization now known as gradient descent. Most machine learning programs today rely heavily on the technique, and other fields also use it to analyze data and solve engineering problems.
Mathematicians have been perfecting gradient descent for over 150 years, but last month, a study proved that a basic assumption about the technique may be wrong. "There were just several times where I was surprised, [like] my intuition is broken," said Ben Grimmer, an applied mathematician at Johns Hopkins University and the study’s sole author. His counterintuitive results showed that gradient descent can work nearly three times faster if it breaks a long-accepted rule for how to find the best answer for a given question. While the theoretical advance likely does not apply to the gnarlier problems tackled by machine learning, it has caused researchers to reconsider what they know about the technique.
"It turns out that we did not have full understanding" of the theory behind gradient descent, said Shuvomoy Das Gupta, an optimization researcher at the Massachusetts Institute of Technology. Now, he said, we’re "closer to understanding what gradient descent is doing." .
The technique itself is deceptively simple. It uses something called a cost function, which looks like a smooth, curved line meandering up and down across a graph. For any point on that line, the height represents cost in some way — how much time, energy or error the operation will incur when tuned to a specific setting. The higher the point, the farther from ideal the system is. Naturally, you want to find the lowest point on this line, where the cost is smallest.
Gradient descent algorithms feel their way to the bottom by picking a point and calculating the slope (or gradient) of the curve around it, then moving in the direction where the slope is steepest. Imagine this as feeling your way down a mountain in the dark. You may not know exactly where to move, how long you’ll have to hike or how close to sea level you will ultimately get, but if you head down the sharpest descent, you should eventually arrive at the lowest point in the area.
Unlike the metaphorical mountaineer, optimization researchers can program their gradient descent algorithms to take steps of any size. Giant leaps are tempting but also risky, as they could take longer to get to the bottom. Taking smaller steps is better, as they ensure steady, if slow, progress. However, the study suggested this could be a bit overwrought; in some cases, taking bigger leaps might actually get you to the optimal solution in less time.
Share