Machine learning questions with parts

Q1 –
You are given the following training dataset, made of pairs (feature, label), Dtrain = {([0 −2], −1), ([0 2], 1), ([1 0], −1)}. Without setting up and solving any learning or optimization problem, answer the following questions: 1. Provide a plot of the two-dimensional feature space where you have marked the three features that you have been provided. 2. On the same plot, draw the decision boundary that maximizes the margin for each of the features. 3. Provide a weight vector that defines the decision boundary that you have drawn before. 4. Provide the geometric margin for each of the three features. 5. Does the linear predictor defined by the weight vector above perform a perfect classification (meaning without making any classification error) on the features of Dtrain?
Address the following questions: 1. What is the motivation for using the hinge loss, as opposed to the 0-1 loss, in a binary classification problem?
2. Why would you use the logistic loss rather than the hinge loss in a classification problem?
3. In a regression problem, explain how would you expect your learning outcome to change when you use the absolute deviation loss, as opposed to using the squared loss.
You are running a gradient descent optimization to learn the parameters of your model, and after a few iterations you notice that the values of the parameters being estimated are oscillating.
1. Explain what is happening.
2. What would you suggest doing to complete your optimization procedure?
3. Explain in what situation it makes sense to switch from gradient descent optimization to a stochastic gradient descent optimization
Consider the following hypothesis class F = {x 7→ w1 w2x 2 : [w1 w2] ∈ R 2}, where x ∈ R is the input datum.
1. Provide a hypothesis class that is less expressive than F.
2. Provide a hypothesis class that is more expressive than F.
3. Provide a hypothesis class that is disjoint from F.
Consider the squared loss Loss(x, y, w) = (y − max{w · φ(x), 0}) 2 1. Draw the computational graph of the function Loss(x, y, w).
2. Number the internal nodes, 1, 2, . . ., and next to every node i in the graph, indicate the forward value with fi , and indicate the backword value with gi . Also, on the edges of the graph indicate the corresponding derivatives, and use the forward values as appropriate to do so. Provide the expression of the gi ’s as function of other backword and edge values, as appropriate.
3. Assume that w = [1 2], φ = [−1 1], and y = 3. Compute all the forward values, effectively performing a forward pass.
4. Using the forward values computed previously, compute all the backward values, effectively performing a backword pass. In particular, compute also the quantity ∂Loss ∂w .