Introduction to Web Proxies | Baeldung on Computer Science

1. Introduction

Many corporate environments use proxy servers to control their Internet traffic better. Also, some proxy servers can even cache Internet resources, reducing bandwidth needs.

In this tutorial, we’ll introduce Web Proxy, what they are, how they work, and some of their types.

2. What Is a Proxy Server?

Before delving into the proxy client configuration, let’s review some key concepts. A proxy server is a software solution that acts as an intermediary between clients and other service servers. Instead of reaching the servers directly, the clients must connect to the proxy server and ask it to hop the request to the actual servers.

The need for proxy servers goes back to the early days of the modern Internet. At that time, as soon as administrators began to realize that the free, uncontrolled flow of information, might lead to abuse. So they created ways of filtering Internet traffic to and from specific routes. Those ways are what we now know as firewalls.

In general, firewalls can be implemented using packet filtering, such as iptables, or through proxy servers. The main difference between them is precise what kind of operations they must perform.

2.1. Packet Filtering

Packet filtering is intended to be fast, light on resources, and normally operate at the kernel level. As such, there are some limits on what sort of rules they can apply.

The first packet filters only dealt with applying rules to the network and transport layers, which means, controlling IP addresses and TCP/UDP ports by means of ACL rules. Today, they can go even higher on the OSI model layers, up to the application layer, thus giving even finer control.

2.2. Proxies

Proxies, on the other hand, are implemented at the user-space level and can apply very complex rules. For instance, they can fully identify the user, the application, the protocol operation, filter content, and such. They even create corporate-level caches, to reduce bandwidth usage. Additionally, they can do lookups to check if any target service access is allowed by some policy. If so, in what conditions?

For instance, we can allow game servers access to a group of users, at lunch hour, and within some bandwidth limit.

2.3. How Proxies Work

The figure below shows the schematic of a regular Internet access proxy server, detailing how the connections compare with a non-proxied connection:

3. Proxy Types

Like any complex software, there are many classifications for proxies. Let’s see some of the more important ones.

3.1. Data Flow Direction

Depending on the data flow the proxy server can be:

Forward, outbound, or direct proxies are usually used to filter or control traffic generated inside the organization directed to external servers. In this tutorial, we’ll focus on configuring the client software to use this proxy class. In this category, we will find software packages like Squid, Privoxy, Tinyproxy, or even full-fledge webservers like Apache with mod_proxy and Nginx. Also, some anonymizing tools such as tor works by implementing proxy servers. Even ssh has its own proxy mode!
Reverse or inbound proxy, dedicated to protecting internal services from outbound-originated traffic. They can offload tasks from the actual servers. Common applications are static content hosting, data stream compression, encryption or decryption (for TLS or SSL), session authentication, or load balancing. In this category, we will find software such as HAProxy, Apache, Nginx, and Kubernetes Ingress. For more information on reverse proxies, please check this tutorial

3.2. Protocols Supported

Regarding the protocol supported, proxy servers can be:

Single protocol or application-oriented, when they are designed for specific protocols or services
Multi-protocol, they can connect to multiple target systems

3.3. Deployment Type

Then, for the deployment, they can be:

Transparent: in this case, the network default’s gateway intercepts the outgoing packets and forces them through the proxy server
Auto-discoverable through Web Proxy Autodiscovery (WPAD): the network admin creates a Proxy Auto Configuration File (PAC), a javascript-like script, that informs the compatible clients how to find the proxy servers. This configuration may use DHCP or DNS queries to provide the URL that hosts the PAC file
Auto-deployed by corporate-wide management tools such as Microsoft Group Policies
Manually configured, the user must provide the proxy settings

Finally, proxies can require user authentication or not. Again, the authentication methods that can be deployed are the same ones usually available to Web servers.

Most operational systems, like Windows, Linux, and Mac OS have their own proxy configuration methods. However, the user-level applications should have built-in proxy support, or additional helper software must be used.

4. Proxy URIs

As with any Internet resource, proxy servers are described in universal resource locators (URLs). The common proxy URL format is:

<Schema>://[<user>[:<password>]@]<Host|IP address>[.<Domain>]:<Port>/

The schema relates to the protocol used to access the proxy. Usual schemas are:

HTTP or HTTPS
SOCKS

Also, SOCKS schemas can have suffixes according to the proxy server versions (4, 4a, or 5) and some options. For instance, the schema socks5h, specifies that the server must do the DNS resolution.

5. HTTPS Streams

Now we may wonder: if there is no end-to-end connection of proxied connections how can SSL work through proxies?

As we saw in the HTTPS tutorial, encrypted secure connections negotiation occurs between the client and the server. Thus, once the data stream is encrypted, it can’t be disclosed by any in-between node, proxy included.

For that reason, the standard is that, as soon as the client reaches the proxy, it sends a CONNECT command to the proxy. Then, the proxy opens a new server connection and relays packets between both connections. Then, as soon as the SSL/TLS negotiation ends, the proxy cannot grasp conversation content. In consequence, many useful proxy functions, like caching, data compression, and content filtering, can’t happen.

An alternative is to configure the proxy as a true man in the middle for those connections. In this mode, also known as SSL bumping, the proxy impersonates the destination server using ad-hoc generated signed certificates. In that case, there is no simple relaying, but in fact two SSL connections: client to the proxy and proxy to the server. Also, for this to work, the client must trust the certification authority the proxy uses to sign the certificates.

But SSL bumping, while allowing full functionality to the proxy, also enables full access to any user content. Passwords, credit card numbers, personal information, and everything else can be read (and logged) by the proxy.

Needless to say, it’s definitely not advisable to use this mode as its risks, in most cases, far outweigh the benefits.

6. Web Proxy Auto-Discovery – PAC Files and WPAD

Many organizations sought easy-to-configure and deploy proxy systems. The issue with transparent proxies is that they are harder to configure at the network level. Transparent proxies impose more overhead at the border routers or firewalls, and additional complexity on the proxy server itself.

That’s when Web Proxy Auto-Discovery (WPAD) comes into place. It consists of two components, Proxy Auto Configuration files (PAC) and Web Proxy Auto-Discovery (WPAD) protocols.

6.1. Proxy Auto Configuration Files

PAC files are javascript-like codes that derive the proxy to be used from a set of rules. They implement a function called FindProxyForURL. This function returns the proxy to use for a given URL or host, or if the connection to the target must be direct.

PAC files can use functions to get the client’s metadata such as IP addresses and networks. Also, there are functions to do DNS lookups, to get the current date and time. This can help to create precise rules for a lot of situations.

However, as PAC files are interpreted at runtime, they add a little latency to each request. Also, errors on PAC files can create hard-to-troubleshoot issues. Debugging PAC files is not so easy, so some browsers have specific debug modes to do that.

6.2. Web Proxy Auto-Discovery

It provides advertisements on where the clients should go to find the relevant Proxy Auto-config (PAC) files. The advertisements can use DHCP or DNS.

For Windows desktops, it is enabled by default on the network configuration pane. Also, its settings are very simple to deploy using group policy. Linux currently only supports it using graphical interfaces.

7. Conclusion

As we saw in this tutorial, proxy servers are invaluable tools for enterprise network security, many organizations rely on them to reduce their risks. They can provide additional security and performance benefits.

Knowing how they work and their different kinds is essential to choose the best way to deploy corporate-wide proxies. Also, it is critical to troubleshoot if they don’t work as expected.

Learn Java Collections

Learn Spring

Learn Maven

View All Courses

Core Concepts

Operating Systems

Neural Networks

Graph Theory

Latex

Full Archive

About Baeldung