## 1. Introduction

In this tutorial, we’ll learn about the ugly duckling theorem and its relation to machine learning.

We’ll start by discussing the problem of algorithmic bias and its impact on the development of machine learning models.

Then, we’ll concentrate on the ugly duckling theorem and study its formulation. In this manner, we’ll learn why some kind of bias is unavoidable in any classification task.

At the end of this tutorial, we’ll be able to explain why classification without some kind of prejudice is impossible.

## 2. The Problem of Algorithmic Bias

### 2.1. Systematic Errors in Classification or Selection

Some algorithms can have systematic errors in the way they produce an outcome and end up with an unwarranted result that favors certain classes over others. **This is the problem that, in the literature on computer science, we call algorithmic bias**.

Algorithmic bias can occur in a variety of contexts, including autonomous systems, the healthcare system, marketing campaigns, the delivery of digital services to consumers, and automated decision-making for e-governance.

The classical example of algorithmic bias in natural language processing consists of the association of female names with family and relationship and of male names with career and professional success, which has been learned by some models and has been noted in the literature.

### 2.2. Algorithmic Bias Isn’t the Bias Term

In the context of artificial intelligence, we frequently talk about bias in relation to machine learning models. Therefore, we might be tempted to reduce algorithmic bias to model bias.

This would be a mistake, though. **Model bias is a subset of algorithmic bias but isn’t reducible to it**. Algorithmic bias includes model bias, but also the selection bias originating from non-representative training data. It also includes our prejudice with regards to the measurements that we decide to consider as features of interest.

Finally, it also includes the bias that derives from the ontology that we choose for classification tasks. The ugly duckling theorem is particularly important in regards to the latter because **it proves that no method for classification is intrinsically preferable to any other**.

## 3. The Ugly Duckling Theorem

### 3.1. The Space of All Possible Ducks

A common task in zoology consists of the creation of taxonomies for animal species. In machine learning, we’re used to thinking of taxonomy in terms of classification problems for supervised learning. In zoology, instead, we think of taxonomy as the problem of grouping animals according to their similarity.

**The underlying idea is that if two animals present a sufficient number of similar features, then the two animals belong to the same taxonomic group**. For example, we may want to study the pair-wise similarity between an ugly duckling and two beautiful swans:

The intuitive idea that we have is that, of course, the two swans are very similar to one another, while the ugly duckling is the odd one out. **The ugly duckling theorem tells us however, that this isn’t necessarily the case**.

### 3.2. The Abstract Notion of a Duck

Ducks live in a world of abstractions **but can also be treated as abstract concepts themselves**. Let’s imagine we’re coding an abstract class called “duck”, to which we associate a finite number of boolean features:

These features can represent, for example, the color, the size, or the beak of a duck, or any other physical, behavioral, or psychological characteristics of the waterfowl.

Let’s now say that we want to group some ducks in classes, but we have no preconceived idea as to which features are more important than others. If this is the case, we can list all possible combinations of abstract features and the values potentially associated with them. **We can then pick any arbitrary feature and sort lexicographically the string of bits** that describe that feature first and all other features second.

Let’s use for simplicity a value , and list all possible combinations:

For a given duck in the space of all possible ducks, we can then ask ourselves the question: what ducks are most dissimilar to it? **We can answer this question by computing the Hamming distance between all pairs of strings**. If we start from, say, duck number 1, we can see that duck number 4 is the one with the most dissimilar digits, and specifically two.

This graph represents the distance between the abstract ducks in duck-space, according to their Hamming distance:

**Starting from any given duckling, the ugly duckling is the one that’s most dissimilar to the first**, as calculated in this manner. In the case above, the ugly duckling in relation to the first abstract duck is duck number 4.

### 3.3. The Similarity of Concrete Ducks

The restriction of comparing only individual features, however, is largely arbitrary. It might have, in fact, been more informative to select multiple features out of the available. or, indeed a boolean function that combines them all. Because we don’t have a criterion for selecting which boolean function to use, however, **the only unbiased approach is for us to select all of them**.

There are boolean functions for a feature vector of size . In this context, we can also think of as the arity of the boolean functions that we consider. If we start with the two features that we saw earlier, then we can generate a binary string of size that corresponds to the output of all boolean functions.

Let’s take an example to clarify this further. We can encode with the proposition “it’s smiling”, and with the proposition “it’s wearing a top-hat”, and measure these features in the three ducks we represented above.

This is the representation of all possible boolean functions comprising the three ducks:

We can now notice how each duck has exactly bits in common with any other duck, and bits that differ from it. In other words, we can see that the pair-wise hamming distance between the vectors associated with all possible boolean functions is always equal to . **That is to say, all ducks are equally similar or equally dissimilar to any other**.

This property is independent of the specific value that assumes. Increasing the number of features we select as meaningful increases the number of boolean functions accordingly. **The ratio between the number of similar and the dissimilar bits, however, remains equal to 1**.

Notice that this is also agnostic with regards to the specific features that we select for our measurements. The only condition we have to respect is that no two ducks have the same feature vector, **but this condition is always satisfied**. It is, in fact, always true if all ducks are discernable as distinct objects.

### 3.4. All Ducks Are Beautiful

The conclusion of the argument we studied here is that, of course, there are no ugly ducklings at all. **All ducklings are beautiful on their own, and it’s their individuality that makes them so**.

If we bring this back to the problem of classification and bias in machine learning, we can now understand why a classification without some kind of bias is impossible. We need a rule or an indication of some kind that tells us what features to favor over what others. **In the absence of that, we’re bound to consider any two observations as equally similar or dissimilar to any other**.

## 4. Conclusion

In this article, we studied the ugly duckling theorem in its relationship with algorithmic bias.