Is activation only used for non-linearity ?

Is activation only used for non-linearity or for both problems . I am still confused why do we need activation function and how can it help.

1 answer

  • answered 2018-08-10 05:46 dennlinger

    Generally, such a question would be suited for Stats Stackexchange or the Data Science Stackexchange, since it is a purely theoretical question, and not directly related to programming (which is what Stackoverflow is for).

    Anyways, I am assuming that you are referring to the classes of linearly separable and not linearly separable problems when you talk about "both problems. In fact, non-linearity in a function is always used, no matter which kind of problem you are trying to solve with a neural network.The simple reason for non-linearities as activation function is simply the following:

    Every layer in the network consists of a sequence of linear operations, plus the non-linearity.

    Formally - and this is something you might have seen before - you can express the mathemtical operation of a single layer F and it's input h as:

    F(h) = Wh + b

    where W represents a matrix of weights, plus a bias b. This operation is purely sequential, and for a simple multi-layer perceptron (with n layers and without non-linearities), we can write the calculations as follows:

    y = F_n(F_n-1(F_n-2(...(F_1(x))))

    which is equivalent to

    y = W_n W_n-1 W_n-2 ... W_1 x + b_1 + b_2 + ... + b_n

    Specifically, we note that these are only multiplications and additions, which we can rearrange in any way we like; particularly, we could aggregate this into one uber-matrix W_p and bias b_p, to rewrite it in a single formula:

    y = W_p x + b_p

    This has the same expressive power as the above multi-layer perceptron, but can inherently be modeled by a single layer! (While having much less parameters than before).

    Introducing non-linearities to this equation turns the simple "building blocks" F(h) into:

    F(h) = g(Wh + b)

    Now, the reformulation of a sequence of layers is not possible anymore, and then non-linearity additionally allows us to approximate any arbitrary function.

    EDIT: To address another concern of yours ("how does it help?"), I should explicitly mention that not every function is linearly separable, and thus cannot be solved by a purely linear network (i.e. without non-linearities). One classic simple example is the XOR operator.