Baeldung Pro – CS – NPI EA (cat = Baeldung on Computer Science)
announcement - icon

Learn through the super-clean Baeldung Pro experience:

>> Membership and Baeldung Pro.

No ads, dark-mode and 6 months free of IntelliJ Idea Ultimate to start with.

1. Overview

Constrained optimization problems show up in many different fields like science, engineering, and economics.

In this tutorial, we’ll discuss an essential method of finding the maxima and minima of constrained functions, namely the method of Lagrange multipliers.

2. The Method of Lagrange Multipliers

Sometimes, we need to find the extreme values of a function subject to some constraint. For example, the function’s domain may be constrained to a specific region, like a particular quadrant of the plane or a triangular area. The constraints might be along a line or curve as well.

2.1. A Constrained Optimization Example

Let’s consider the following example, which was taken from Calculus and Analytic Geometry (p. 865), by G. B. Thomas, Jr. & R. L. Finney, 1984, Addison-Wesley Publishing Company:

 

ConstrainedMaximumExample 281x300

The maximum value of the function f(x, y) = z = 9 – x2 – y2  is 9. We show it as the free maximum in the figure, i.e., the point (0, 0, 9). However, we may need to find the maximum of f(x, y) subject to the constraint g(x, y) = x + 3y – 12 = 0. In this case, the constrained maximum is another point on the surface of f(x, y). The perpendicular projection of this point to the x-y plane is on the line g(x, y).

2.2. Equations

We can solve constrained optimization problems of this kind using the method of Lagrange multipliers. If we want to find the local maximum and minimum values of f(x, y) subject to the constraint g(x, y) = 0, then the method of Lagrange multipliers states that we need to solve the following equations for x, y, and \boldsymbol{\lambda}:

\nabla f=\lambda\nabla g

g(x, y) = 0

The functions f(x, y) and g(x, y) must have continuous partial derivatives. \nabla f and \nabla g are the gradients of the functions f(x, y) and g(x, y), respectively. The scalar value, \boldsymbol{\lambda}, is the Lagrange multiplier.

The gradient of f(x, y), \nabla f, is defined as:

\nabla f=\frac{\partial f}{\partial x}\vec{i} + \frac{\partial f}{\partial y}\vec{j}

The gradient vector points to the direction in which the function increases most rapidly. Therefore, the method of Lagrange multipliers implies that the gradient of f(x, y) must align with the gradient of g(x, y) at the maxima and minima.

We can easily extend the method of Lagrange multipliers to higher dimensions. Similarly, there might be more than one constraint. For example, we may need to solve the following equations simultaneously for a differentiable function f(x, y, z) with two constraints:

\nabla f=\lambda\nabla g_1+\mu\nabla g_2,\ g_1(x,y,z)=0,\ g_2(x,y,z)=0

Here, g1(x, y, z) = 0 and g2(x, y, z) = 0 are the two constraints. They must also be differentiable. \boldsymbol{\lambda} and \boldsymbol{\mu} are the Lagrange coefficients.

2.3. Solution of the Example

Let’s solve the constrained optimization problem we presented before using the method of Lagrange multipliers. The gradients of f(x, y) and g(x, y) are:

\nabla f=\frac{\partial f}{\partial x}\vec{i} + \frac{\partial f}{\partial y}\vec{j}=-2x\vec{i}-2y\vec{j}

\nabla g=\frac{\partial g}{\partial x}\vec{i} + \frac{\partial g}{\partial y}\vec{j}=\vec{i}+3\vec{j}

\nabla f=\lambda\nabla g is equivalent to:

\nabla f=\lambda\nabla g \implies -2x\vec{i}-2y\vec{j} = \lambda(\vec{i}+3\vec{j})

Therefore, the equations we need to solve using the method of Lagrange multipliers are:

-2x = \lambda

-2y =3\lambda

x + 3y - 12 = 0

Solving these three equations, we get:

\lambda = -12/5, x = 6/5, y=18/5

Therefore, the maximum value of f(x, y) subject to the constraint g(x, y) = 0 is 27/5, i.e., the constrained maximum is the point (6/5, 18/5, -27/5).

3. Applications in Machine Learning

Lagrange Multipliers play a key role in many different fields of machine learning. In this section, we’ll discuss three of them briefly.

3.1. Regularization

Regularization is an essential technique in machine learning that prevents overfitting by providing a bias-variance tradeoff. It typically adds a term to the original cost function:

J_{\lambda}(w) = J(w) + \lambda R(w)

Here, \lambda R(w) is the added term to the original cost function, J(w). \boldsymbol{\lambda} is a regularization parameter or tuning parameter, which is nonnegative. R(w), the regularization penalty, is a nonnegative function of the model’s parameters, w, just like the original cost function. J_{\lambda}(w) is the regularized cost function.

In Ridge regression, the regularized cost function becomes:

J_{\lambda}(w) = J(w) + \lambda \sum_{k}{w^2_k}

wk values are the estimated weights of the model. Ridge regression looks for estimates that fit the data well by trying to minimize the regularized cost function. However, while a typical least squares regression generates only one set of weight estimates, Ridge regression produces different weight estimates for each value of \lambda.

When \lambda=0, Ridge regression produces the same weights as least squares. Increasing \lambda increases the impact of the regularization penalty. The weight estimates will approach zero in this case. Therefore, selecting a good value for \lambda is critical.

Ridge regression is indeed the following constrained optimization problem:

Minimize\ J(w)\ subject\ to \sum_{k}{w^2_k} \le\ t

Where t is a constant that controls the magnitude of the estimated weights. Therefore, the regularization factor is the Lagrange multiplier.

In Lasso regression, the regularized cost function becomes:

J_{\lambda}(w) = J(w) + \lambda \sum_{k}{|w_k|}

In a similar way to Ridge regression, Lasso regression is the following constrained optimization problem:

Minimize\ J(w)\ subject\ to \sum_{k}{|w_k|} \le\ t

Therefore, we can solve Ridge and Lasso regressions using the method of Lagrange multipliers.

3.2. Maximal Margin Classifiers

The goal of a Maximum Margin Classifier (MMC) is to find a decision boundary that is farthest from the training observations. For example, there are two classes of observations in the following figure, shown with crosses and circles:

MMC DecisionBoundary 300x274

 

We label the training observations shown with crosses with “1”, and the training observations shown with circles with “-1”. Observations A and B marked with crosses lie on the red line whereas observation C marked with a circle lies on the green line. The lines are parallel.

The black line is equidistant from the observations A, B, and C. It maximizes the distance (margin) from these observations, therefore it’s called the maximal margin hyperplane. The closest points to the maximal margin hyperplane, i.e., A, B, and C, are known as support vectors.

Therefore, we need to solve the following optimization problem to find the decision boundary drawn using the black line:

Maximize\ M\ subject\ to\ \sum_{j=1}^{2} w_j^2= 1\ and

y_i(w_0+w_1x_{i1}+w_2x_{i2})\ge M\ \forall i=1,..,n

Here, M is the hyperplane’s margin. wj’s are the model’s weights. w_0+w_1x_{i1}+w_2x_{i2} is the equation of the hyperplane. Finally, yi’s are the associated class labels, either “1” or “-1”, whereas n is the number of observations.

The first constraint, \boldsymbol{ y_i(w_0+w_1x_{i1}+w_2x_{i2})\ge M}, ensures that each observation is on the correct side of the hyperplane. The other constraint, \sum_{j=1}^{2} w_j^2= 1 guarantees that the perpendicular distance from an observation is given by y_i(w_0+w_1x_{i1}+w_2x_{i2}).

Clearly, finding the decision boundary is an optimization problem where the method of Lagrange multipliers can be applied. The decision boundary is a line in the x-y plane,  but we can easily extend it to higher dimensions.

The generalizations of MMCs, like Support Vector Machines, also employ constrained optimization problems. Therefore, we can use the method of Lagrange multipliers to find an optimal solution for those classifiers.

3.3. Constrained Markov Decision Processes

Markov Decision Processes (MDP) form the mathematical foundation for reinforcement learning (RL). In RL, an agent learns to make decisions by interacting with the environment. The agent chooses actions over time according to a policy to maximize the expected cumulative reward. Normally, MDP doesn’t impose constraints.

However, in real-life RL applications, there might be some constraints. The constraint function can be related to safety, energy consumption, or cost. For example, in autonomous driving, the agent’s actions must ensure a safe distance from other obstacles to satisfy safety constraints. In this case, we have the following optimization problem:

Maximize R(π) subject to C(π) ≤ ϵ

Here, R(π) is the expected cumulative reward. π denotes the policy, a function that maps a state to an action. C(π) is the constraint function, and ϵ is a threshold that the constraint mustn’t exceed.

This case is an example of a constrained MDP. In a constrained MDP, the agent’s goal is again to maximize the expected cumulative reward, but subject to some constraints on state or action-based costs. Clearly, we can solve this optimization problem using the method of Lagrange multipliers.

4. Conclusion

In this article, we discussed an essential method of finding the maxima and minima of constrained functions, namely the method of Lagrange multipliers.

Firstly, we learned what a constrained optimization problem is using an example. We saw the equations we need to solve while we’re using the method of Lagrange multipliers. Additionally, we touched on the geometric interpretation of the method.

Then, we discussed the method’s application in three fields of machine learning. The first example was its application in regularization. The extra term added to the cost function prevents overfitting, which is generally a consequence of using a too complex model. The tuning of the added term converts the regularization into a constrained optimization problem.

The second example was its application in finding the decision boundary for an MMC. The constraint of each observation being on the correct side of the hyperplane, together with the distance constraint, turns the problem into a constrained optimization problem.

Finally, we learned that it can also be used for constrained MDPs, and therefore for RL applications with constraints. Since we want to maximize the expected cumulative return subject to some constraints, the problem is again a constrained optimization problem.