# Is activation only used for non-linearity ?

Is activation only used for non-linearity or for both problems . I am still confused why do we need activation function and how can it help.

Generally, such a question would be suited for Stats Stackexchange or the Data Science Stackexchange, since it is a purely theoretical question, and not directly related to programming (which is what Stackoverflow is for).

Anyways, I am assuming that you are referring to the classes of linearly separable and not linearly separable problems when you talk about "both problems. In fact, non-linearity in a function is always used, no matter which kind of problem you are trying to solve with a neural network.The simple reason for non-linearities as activation function is simply the following:

### Every layer in the network consists of a sequence of linear operations, plus the non-linearity.

Formally - and this is something you might have seen before - you can express the mathemtical operation of a single layer `F` and it's input `h` as:

``````F(h) = Wh + b
``````

where `W` represents a matrix of weights, plus a bias `b`. This operation is purely sequential, and for a simple multi-layer perceptron (with `n` layers and without non-linearities), we can write the calculations as follows:

``````y = F_n(F_n-1(F_n-2(...(F_1(x))))
``````

which is equivalent to

``````y = W_n W_n-1 W_n-2 ... W_1 x + b_1 + b_2 + ... + b_n
``````

Specifically, we note that these are only multiplications and additions, which we can rearrange in any way we like; particularly, we could aggregate this into one uber-matrix W_p and bias b_p, to rewrite it in a single formula:

``````y = W_p x + b_p
``````

This has the same expressive power as the above multi-layer perceptron, but can inherently be modeled by a single layer! (While having much less parameters than before).

Introducing non-linearities to this equation turns the simple "building blocks" `F(h)` into:

``````F(h) = g(Wh + b)
``````

Now, the reformulation of a sequence of layers is not possible anymore, and then non-linearity additionally allows us to approximate any arbitrary function.

EDIT: To address another concern of yours ("how does it help?"), I should explicitly mention that not every function is linearly separable, and thus cannot be solved by a purely linear network (i.e. without non-linearities). One classic simple example is the XOR operator.