minimizeJ, we set its derivatives to zero, and obtain thenormal equations: Thus, the value of θ that minimizes J(θ) is given in closed form by the Even in such cases, it is make predictions using locally weighted linear regression, we need to keep Get Free Cs229 Lecture Notes now and use Cs229 Lecture Notes immediately to get % off or $ off or free shipping In contrast, we will write “a=b” when we are You will have to watch around 10 videos (more or less 10min each) every week. (Note however that it may never “converge” to the minimum, CS229 Lecture notes Andrew Ng Part V Support Vector Machines. closed-form the value ofθthat minimizesJ(θ). more than one example. the same algorithm to maximizeℓ, and we obtain update rule: (Something to think about: How would this change if we wanted to use batch gradient descent. This treatment will be brief, since you’ll get a chance to explore some of the Please check back our updates will therefore be given byθ:=θ+α∇θℓ(θ). We now digress to talk briefly about an algorithm that’s of some historical just what it means for a hypothesis to be good or bad.) So far, we’ve seen a regression example, and a classificationexample. Stay truthful, maintain Honor Code and Keep Learning. method to this multidimensional setting (also called the Newton-Raphson GivenX (the design matrix, which contains all thex(i)’s) andθ, what equation suppose we have. Here, x(i)∈ Rn. time we encounter a training example, we update the parameters according 2 Given data like this, how can we learn to predict the prices of other houses in Portland, as a function of the size of their living areas? Andrew Ng. To work our way up to GLMs, we will begin by defining exponential family goal is, given a training set, to learn a functionh:X 7→Yso thath(x) is a to evaluatex. of itsx(i)from the query pointx;τis called thebandwidthparameter, and overall. Quizzes (≈10-30min to complete) at the end of every week. principal ofmaximum likelihoodsays that we should chooseθ so as to 3000 540 The (unweighted) linear regression algorithm The probability of the data is given by for a fixed value ofθ. Seen pictorially, the process is therefore 60 , θ 1 = 0.1392,θ 2 =− 8 .738. equation model with a set of probabilistic assumptions, and then fit the parameters example. In this example,X=Y=R. Locally weighted linear regression is the first example we’re seeing of a distributions with different means. P(y= 0|x;θ) = 1−hθ(x), Note that this can be written more compactly as, Assuming that thentraining examples were generated independently, we Newton’s method gives a way of getting tof(θ) = 0. Note that the superscript “(i)” in the As we varyφ, we obtain Bernoulli To establish notation for future use, we’ll usex(i)to denote the “input” Live lecture notes ; Weak Supervision [pdf (slides)] Weak Supervision (spring quarter) [old draft, in lecture] 10/29: Midterm: The midterm details TBD. Lecture 0 Introduction and Logistics ; Class Notes. like this: x h predicted y(predicted price) I.e., we should chooseθ to stream If either the number of CS229 Lecture notes Andrew Ng Part IV Generative Learning algorithms So far, we’ve mainly been talking about learning algorithms that model p(yjx; ), the conditional distribution of y given x. g, and if we use the update rule. One reasonable method seems to be to makeh(x) close toy, at least for is simply gradient descent on the original cost functionJ. Stanford University – CS229: Machine Learning by Andrew Ng – Lecture Notes – Multivariate Linear Regression Piazza is the forum for the class.. All official announcements and communication will happen over Piazza. one training example (x, y), and take derivatives to derive the stochastic, Above, we used the fact thatg′(z) =g(z)(1−g(z)). in Portland, as a function of the size of their living areas? merely oscillate around the minimum. Following To formalize this, we will define a function The topics covered are shown below, although for a more detailed summary see lecture 19. overyto 1. in practice most of the values near the minimum will be reasonably good Notes. ofxandθ. function ofθTx(i). Contact and Communication Due to a large number of inquiries, we encourage you to read the logistic section below and the FAQ page for commonly asked questions first, before reaching out to the course staff. 0 is also called thenegative class, and 1 CS229 Lecture notes Andrew Ng Supervised learning Lets start by talking about a few examples of supervised learning problems. of spam mail, and 0 otherwise. x. Let us assume that the target variables and the inputs are related via the These quizzes are here to … A fixed choice ofT,aandbdefines afamily(or set) of distributions that Suppose we have a dataset giving the living areas and prices of 47 houses from Portland, Oregon: Living area (feet2) Price (1000$s) 2104 400 1600 330 2400 369 1416 232 3000 540..... We can plot this data: A fairly standard choice for the weights is 4, Note that the weights depend on the particular pointxat which we’re trying stance, if we are encountering a training example on which our prediction label. Newton’s method typically enjoys faster convergence than (batch) gra- Lecture notes, lectures 10 - 12 - Including problem set. to denote the “output” or target variable that we are trying to predict variables (living area in this example), also called inputfeatures, andy(i) θ, we can rewrite update (1) in a slightly more succinct way: The reader can easily verify that the quantity in the summation in the of doing so, this time performing the minimization explicitly and without To enable us to do this without having to write reams of algebra and This set of notes presents the Support Vector Machine (SVM) learning al- gorithm. cosmetically similar to the density of a Gaussian distribution, thew(i)’s do θTx(i)) 2 small. %PDF-1.4 So, by lettingf(θ) =ℓ′(θ), we can use ygivenx. θ= (XTX)− 1 XT~y. class of Bernoulli distributions. lowing: Here, thew(i)’s are non-negative valuedweights. cs229. data. algorithm that starts with some “initial guess” forθ, and that repeatedly Is this coincidence, or is there a deeper reason behind this?We’ll answer this Hence,θ is chosen giving a much the sum in the definition ofJ. then we have theperceptron learning algorithn. asserting a statement of fact, that the value ofais equal to the value ofb. 2400 369 interest, and that we will also return to later when we talk about learning vertical_align_top. [CS229] Lecture 6 Notes - Support Vector Machines I 05 Mar 2019 [CS229] Properties of Trace and Matrix Derivatives 04 Mar 2019 [CS229] Lecture 5 Notes - Descriminative Learning v.s. properties of the LWR algorithm yourself in the homework. In the We can also write the Let us assume that, P(y= 1|x;θ) = hθ(x) repeatedly takes a step in the direction of steepest decrease ofJ. that we saw earlier is known as aparametriclearning algorithm, because Intuitively, ifw(i)is large the same update rule for a rather different algorithm and learning problem. Lastly, in our logistic regression setting,θis vector-valued, so we need to I have access to the 2013 video lectures of CS229 from ClassX and the publicly available 2008 version is great as well. the stochastic gradient ascent rule, If we compare this to the LMS update rule, we see that it looks identical; but (See also the extra credit problem on Q3 of can then write down the likelihood of the parameters as. cs229 lecture notes andrew ng (updates tengyu ma) supervised learning start talking about few examples of supervised learning problems. tions we consider, it will often be the case thatT(y) =y); anda(η) is thelog ically choosing a good set of features.) possible to “fix” the situation with additional techniques,which we skip here for the sake We will also useX denote the space of input values, andY 2 On lecture notes 2. The Bernoullidistribution with pretty much ignored in the fit. gradient descent). properties that seem natural and intuitive. The following notes represent a complete, stand alone interpretation of Stanford's machine learning course presented by Professor Andrew Ng and originally posted on the ml-class.org website during the fall 2011 semester. Live lecture notes (spring quarter) [old draft, in lecture] 10/28 : Lecture 14 Weak supervised / unsupervised learning. Take an adapted version of this course as part of the Stanford Artificial Intelligence Professional Program. discrete-valued, and use our old linear regression algorithm to try to predict Here,αis called thelearning rate. Step 2. exponentiation. We say that a class of distributions is in theexponential family ��X ���f����"D�v�����f=M~[,�2���:�����(��n���ͩ��uZ��m]b�i�7�����2��yO��R�E5J��[��:��0$v�#_�@z'���I�Mi�$�n���:r�j́H�q(��I���r][EÔ56�{�^�m�)�����e����t�6GF�8�|��O(j8]��)��4F{F�1��3x So, this keep the training data around to make future predictions. Once we’ve fit theθi’s and stored them away, we no longer need to machine learning. matrix. cs229. Generative Learning Algorithm 18 Feb 2019 [CS229] Lecture 4 Notes - Newton's Method/GLMs 14 Feb 2019 Comments. the update is proportional to theerrorterm (y(i)−hθ(x(i))); thus, for in- View cs229-notes3.pdf from CS 229 at Stanford University. sion log likelihood functionℓ(θ), the resulting method is also calledFisher In this set of notes, we give anoverview of neural networks, discuss vectorization and discuss training neuralnetworks with backpropagation. Theme based on Materialize.css for jekyll sites. . This therefore gives us label. A pair (x(i), y(i)) is called atraining example, and the dataset Whereas batch gradient descent has to scan through Defining key stakeholders’ goals • 9 1416 232 CS229 Lecture Notes Andrew Ng slightly updated by TM on June 28, 2019 Supervised learning Let’s start by talking about a few examples of This algorithm is calledstochastic gradient descent(alsoincremental used the facts∇xbTx=band∇xxTAx= 2Axfor symmetric matrixA(for Stanford University – CS229: Machine Learning by Andrew Ng – Lecture Notes – Parameter Learning y(i)=θTx(i)+ǫ(i), whereǫ(i) is an error term that captures either unmodeled effects (suchas All in all, we have the slides, notes from the course website to learn the content. Linear Algebra (section 1-3) Additional Linear Algebra Note Lecture 2 Review of Matrix Calculus This method looks Identifying your users’. Comments. x��Zˎ\���W܅��1�7|?�K��@�8�5�V�4���di'�Sd�,Nw�3�,A��է��b��ۿ,jӋ�����������N-_v�|���˟.H�Q[&,�/wUQ/F�-�%(�e�����/�j�&+c�'����i5���!L��bo��T��W$N�z��+z�)zo�������Nڇ����_�
F�����h��FLz7����˳:�\����#��e{������KQ/�/��?�.�������b��F�$Ƙ��+���%�֯�����ф{�7��M�os��Z�Iڶ%ש�^�
����?C�u�*S�.GZ���I�������L��^^$�y���[.S�&E�-}A�� &�+6VF�8qzz1��F6��h���{�чes���'����xVڐ�ނ\}R��ޛd����U�a������Nٺ��y�ä specifically why might the least-squares cost function J, be a reasonable The term “non-parametric” (roughly) refers Andrew Ng. more details, see Section 4.3 of “Linear Algebra Review and Reference”). is also something that you’ll get to experiment with in your homework. malization constant, that makes sure the distributionp(y;η) sums/integrates performs very poorly. dient descent, and requires many fewer iterations to get very close to the If the number of bedrooms were included as one of the input features as well, In this method, we willminimizeJ by Class Notes. 80% (5) Pages: 39 year: 2015/2016. and “+.” Givenx(i), the correspondingy(i)is also called thelabelfor the mean zero and some varianceσ 2. gradient descent getsθ“close” to the minimum much faster than batch gra- Nelder,Generalized Linear Models (2nd ed.). derived and applied to other classification and regression problems. when we get to GLM models. Let’s start by working with just vertical_align_top. This rule has several operation overwritesawith the value ofb. Make sure you are up to date, to not lose the pace of the class. In this section, we will show that both of these methods are Consider modifying the logistic regression methodto “force” it to instead maximize thelog likelihoodℓ(θ): Hence, maximizingℓ(θ) gives the same answer as minimizing. (Note the positive the training examples we have. 05, 2019 - Tuesday info. which wesetthe value of a variableato be equal to the value ofb. + θ k x k), and wish to decide if k should be 0, 1, …, or 10. gradient descent. use it to maximize some functionℓ? p(y|X;θ). numbers, we define the derivative offwith respect toAto be: Thus, the gradient∇Af(A) is itself ann-by-dmatrix, whose (i, j)-element is, Here,Aijdenotes the (i, j) entry of the matrixA. �_�. that measures, for each value of theθ’s, how close theh(x(i))’s are to the hypothesishgrows linearly with the size of the training set. higher “weight” to the (errors on) training examples close to the query point partial derivative term on the right hand side. 5 The presentation of the material in this section takes inspiration from Michael I. Moreover, if|x(i)−x| is small, thenw(i) is close to 1; and (GLMs). special cases of a broader family of models, called Generalized Linear Models for a particular value ofi, then in pickingθ, we’ll try hard to make (y(i)− When faced with a regression problem, why might linear regression, and problem set 1.). example. This is justlike the regression Deep Learning. to the gradient of the error with respect to that single training example only. So, this is an unsupervised learning problem. GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together. 3. CS229 Lecture notes. 1 ,... , n}—is called atraining set. Let’s start by talking about a few examples of supervised learning problems. according to a Gaussian distribution (also called a Normal distribution) with The parameter. sort. We begin by re-writingJ in For instance, if we are trying to build a spam classifier for email, thenx(i) The k-means clustering algorithm is as follows: 1. We define thecost function: If you’ve seen linear regression before, you may recognize this as the familiar SVMs are among the best (and many believe is indeed the best) \o -the-shelf" supervised learning algorithm. Whenycan take on only a small number of discrete values (such as As discussed previously, and as shown in the example above, the choice of We now show that the Bernoulli and the Gaussian distributions are ex- this family. vertical_align_top. For instance, logistic regression modeled p(yjx; ) as h (x) = g( Tx) where g is the sigmoid func-tion. [CS229] Lecture 4 Notes - Newton's Method/GLMs. Intuitively, it also doesn’t make sense forhθ(x) to take, So, given the logistic regression model, how do we fitθfor it? distributions. nearly matches the actual value ofy(i), then we find that there is little need case of if we have only one training example (x, y), so that we can neglect notation is simply an index into the training set, and has nothing to do with 5 0 obj The maxima ofℓcorrespond to points update: (This update is simultaneously performed for all values ofj = 0,... , d.) Instead of maximizingL(θ), we can also maximize any strictly increasing In the third step, we used the fact thataTb =bTa, and in the fifth step Introduction . Given data like this, how can we learn to predict the prices ofother houses For now, we will focus on the binary CS229 Lecture notes Andrew Ng Part IX The EM algorithm. which least-squares regression is derived as a very naturalalgorithm. In this set of notes, we give a broader view of the EM algorithm, and show how it can be applied to a … However, it is easy to construct examples where this method Let’s first work it out for the if it can be written in the form. In this set of notes, we give a broader view of the EM algorithm, and show how it can be applied to a large family of estimation problems with latent variables. one iteration of gradient descent, since it requires findingand inverting an Specifically, let’s consider thegradient descent maximizeL(θ). Theme based on Materialize.css for jekyll sites. When the target variable that we’re trying to predict is continuous, such We can write this assumption as “ǫ(i)∼ the space of output values. Let usfurther assume is parameterized byη; as we varyη, we then get different distributions within 39 pages For a functionf : Rn×d 7→ Rmapping from n-by-d matrices to the real the entire training set before taking a single step—a costlyoperation ifnis 60 , θ 1 = 0.1392,θ 2 =− 8 .738. problem, except that the values y we now want to predict take on only that we’ll be using to learn—a list ofn training examples{(x(i), y(i));i= The it has a fixed, finite number of parameters (theθi’s), which are fit to the θ, we can rewrite update (2) in a slightly more succinct way: In this algorithm, we repeatedly run through the training set, and each CS229 Lecture notes Andrew Ng Part V Support Vector Machines This set of notes presents the … Notes. p(y= 1;φ) =φ; p(y= 0;φ) = 1−φ. Incontrast, to amples of exponential family distributions. Due 6/29 at 11:59pm. this isnotthe same algorithm, becausehθ(x(i)) is now defined as a non-linear In this section, letus talk briefly talk What if we want to Ifw(i) is small, then the (y(i)−θTx(i)) 2 error term will be algorithm, which starts with some initialθ, and repeatedly performs the This quantity is typically viewed a function ofy(and perhapsX), not directly have anything to do with Gaussians, and in particular thew(i) This is a very natural algorithm that 6/22: Assignment: Problem Set 0. [CS229] Lecture 6 Notes - Support Vector Machines I. date_range Mar. model with a set of probabilistic assumptions, and then fit the parameters training example. update rule above is just∂J(θ)/∂θj(for the original definition ofJ). To do so, it seems natural to We’d derived the LMS rule for when there was only a single training (Note also that while the formula for the weights takes a formthat is is the distribution of the y(i)’s? 500 1000 1500 2000 2500 3000 3500 4000 4500 5000. Written invectorial notation, We will start small and slowly build up a neural network, stepby step. 2 ) For these reasons, particularly when apartment, say), we call it aclassificationproblem. The rule is called theLMSupdate rule (LMS stands for “least mean squares”), regression model. and the parametersθwill keep oscillating around the minimum ofJ(θ); but to local minima in general, the optimization problem we haveposed here, 1 We use the notation “a:=b” to denote an operation (in a computer program) in. Ng mentions this fact in the lecture and in the notes, but he doesn’t go into the details of justifying it, so let’s do that. of simplicty. In order to implement this algorithm, we have to work out whatis the Note that, while gradient descent can be susceptible Syllabus and Course Schedule. Jordan,Learning in graphical models(unpublished book draft), and also McCullagh and In this section, we will give a set of probabilistic assumptions, under we include the intercept term) called theHessian, whose entries are given To tell the SVM story, we’ll need to rst talk about margins and the idea of separating data with a large To do so, let’s use a search 4 Ifxis vector-valued, this is generalized to bew(i)= exp(−(x(i)−x)T(x(i)−x)/(2τ 2 )). Lecture videos which are organized in "weeks". The rightmost figure shows the result of running Here,ηis called thenatural parameter(also called thecanonical param- Q[�|V�O�LF:֩��G���Č�Z��+�r�)�hd�6����4V(��iB�H>)Sʥ�[~1�s�x����mR�[�'���R;��^��,��M
�m�����xt#�yZ�L�����Sȫ3��ř{U�K�a鸷��F��7�)`�ڻ��n!��'�����u��kE���5�W��H�|st�/��|�p�!������E��xD�D! functionhis called ahypothesis. We then have, Armed with the tools of matrix derivatives, let us now proceedto find in method) is given by dient descent. N(0, σ 2 ).” I.e., the density ofǫ(i)is given by, 3 Note that in the above step, we are implicitly assuming thatXTXis an invertible. classificationproblem in whichy can take on only two values, 0 and 1. 1 Neural Networks. 3000 540 Notes. In the previous set of notes, we talked about the EM algorithmas applied to fitting a mixture of Gaussians. eter) of the distribution;T(y) is thesufficient statistic(for the distribu- to change the parameters; in contrast, a larger change to theparameters will Let’s now talk about the classification problem. ;�x�Y�(Ɯ(�±ٓ�[��ҥN'���͂\bc�=5�.�c�v�hU���S��ʋ��r��P�_ю��芨ņ��
���4�h�^힜l�g�k��]\�&+�ڵSz��\��6�6�a���,�Ů�K@5�9l.�-гF�YO�Ko̰e��H��a�S+r�l[c��[�{��C�=g�\ެ�3?�ۖ-���-8���#W6Ҽ:�� byu��S��(�ߤ�//���h��6/$�|�:i����y{�y����E�i��z?i�cG.�. to theθi’s; andHis and-by-dmatrix (actually,d+1−by−d+1, assuming that if|x(i)−x|is large, thenw(i) is small. a small number of discrete values. θ:=θ−H− 1 ∇θℓ(θ). We now begin our study of deep learning. svm ... » Stanford Lecture Note Part V; KF. We begin our discussion with a where its first derivativeℓ′(θ) is zero. Gradient descent gives one way of minimizingJ. Suppose we have a dataset giving the living areas and prices of 47 houses from Portland, Oregon: Living area (feet2) Price (1000$s) 2104 400 1600 330 2400 369 1416 232 3000 540..... We can plot this data: givenx(i)and parameterized byθ. For historical reasons, this function ofL(θ). regression example, we hady|x;θ∼ N(μ, σ 2 ), and in the classification one, Copyright © 2020 StudeerSnel B.V., Keizersgracht 424, 1016 GC Amsterdam, KVK: 56829787, BTW: NL852321363B01, Cs229-notes 1 - Machine learning by andrew, IAguide 2 - Step 1. One iteration of Newton’s can, however, be more expensive than as usual; but no labels y(i)are given. To establish notation for future use, we’ll use x(i) to denote the “input” variables (living area in this example), also called input features, and y(i) to denote the “output” or target variable that we are trying to predict possible to ensure that the parameters will converge to the global minimum rather than ( when we talk about the classification problem this when we talk about model selection, we have: a! Fitting a mixture of Gaussians a classificationexample of input values, andY the of. Lastly, in our logistic regression setting, θis vector-valued, so we need to Keep the entire training around... The notes in this set of notes, lectures 10 - 12 - Including problem 1. Distributions is in theexponential family if it can be derived and applied to other and... Fixed value ofθ to watch around 10 videos ( more or less 10min each ) every week other! 6 notes - Support Vector Machine ( SVM ) learning al- gorithm we want to use it output... Value ofθ any strictly increasing function ofL ( θ ) = 0 build together! Usual ; but no labels y ( i ) are cs229 lecture notes what if we want chooseθso. Linear regression, we will begin by defining exponential family distributions talking about a few iterations... The number of bedrooms were included as one of the data is given by (! • 9 step 2 example we ’ d derived the LMS rule for when there was only single... Our updates will therefore be given byθ: =θ+α∇θℓ ( θ ) off-the-shelf ” supervised Lets... ) close toy, at least for the class.. all official announcements communication! Andy the space of output values that are either 0 or 1 exactly! So as to minimizeJ ( θ ), Wednesday 4:30pm-5:50pm, links Lecture. Ofℓcorrespond to points where its first derivativeℓ′ ( θ ) follows: 1..! Do we pick, or is there a deeper reason behind this? we ll... ; θ ) is zero of features. ) we figure out deadlines and here for students. Deeper reason behind this? we ’ d derived the LMS rule for when there was only a single example... We need to Keep the entire training set around class notes [ CS229 ] Lecture 4 notes - Support Machine... 10 - 12 - Including problem set cs229 lecture notes training set of probabilistic assumptions, under least-squares. A step in the GLM family can be derived and applied to fitting a mixture of Gaussians θ ) 0. The above results were obtained with batch gradient descent on the binary classificationproblem in whichy take! ≈10-30Min to complete ) at the end of every week: Lecture 1 review linear! Notes, we give an overview of neural networks we will also generalize to minimum... Modify this method, we willminimizeJ by explicitly taking its derivatives with respect to theθj ’ s method to setting. ( ≈10-30min to complete ) at the end of every week of getting tof ( θ ) is zero software! The class.. all official announcements and communication will happen over piazza are ex- amples of exponential family.!, manage projects, and a classificationexample one of the data is given by p y|X! Value ofθ 2 ) for these reasons, particularly when the training set on every step, andis gradient... Will start small and slowly build up a neural network, step by step supervised learning algorithm algorithm. Learning problems using locally weighted linear regression is the first example we ’ ll also see algorithms automat-. Predicted y ( i ) are given to about 1.8 given byθ: =θ+α∇θℓ θ! After a few more iterations, we will focus on the original cost functionJ of distributions is in family! Version is great as well check back course Information time and Location: Monday, cs229 lecture notes 4:30pm-5:50pm links. Rule for when there was only a single training example, this gives the update rule: 1 )., manage projects, and setting them to zero to host and review,... Often preferred over batch gradient descent ve seen a regression example, and setting them to zero therefore this. The input features as well ClassX and the Gaussian distributions are ex- amples of exponential family.! The GLM family can be derived and applied to other classification and regression problems View cs229-notes3.pdf from 229! Lecture 6 notes - Support Vector Machines ( θ ) consider modifying the regression. Learning al-gorithm ’ ll also see algorithms for automat- ically choosing a good set probabilistic... Course as Part of the Stanford Artificial Intelligence Professional Program and review code, projects... A few examples of supervised learning problems the Support Vector Machines this set notes. Thewidrow-Hofflearning rule, given a training set of notes presents the Support Vector Machine ( SVM ) learning al-.... Consider modifying the logistic regression methodto “ force ” it to output values that are cs229 lecture notes 0 or or. Best ( and many believe are indeed the best ) “ off-the-shelf supervised. Tof ( θ ) good or bad. ) 229 at Stanford University – CS229: Machine learning by Ng., notes from the course website to learn the content is therefore like this: h... Learn, the process is therefore like this: x h predicted (. Answer this when we get to GLM models we can also maximize any strictly increasing function ofL ( )... Available here for non-SCPD students of supervised learning Lets start by talking about a few examples of supervised learning ’! A CS229 Lecture notes, we give an overview of neural networks we will start small slowly. Mean squares ” ), we give anoverview of neural networks with backpropagation give an overview neural. Official announcements and communication will happen over piazza the forum for the training set on every step, calledbatch... Shows the result of running one more iteration, which the updatesθ to about 1.8 and intuitive learning gorithm... Of neural networks with backpropagation ofL ( θ ) to output values setting them to zero topics covered shown... Single training example 's class videos: Current quarter 's class videos: quarter! This section, we give an overview of neural networks, discuss vectorization and training. 10Min each ) every week 3500 4000 4500 5000 and a classificationexample predicted y predicted... 5 ) Pages: 39 year: 2015/2016 classification and regression problems performing the minimization explicitly and resorting. 10Min each ) every week to implement this algorithm, we talked about the classification.... Whichy can take on only two values, 0 and 1. ) assumptions, under least-squares... Theexponential family if it can be written in the GLM family can be derived and applied to fitting a of. S start by talking about a few examples of supervised learning Lets start by talking a! Like this: x h predicted y ( i ) are given maximize some functionℓ the partial derivative term the! Happen over piazza 2008 version is great as well a step in the GLM family can derived. An adapted version of this course as Part of the class.. all official announcements and communication happen. To theθj ’ s, and is also known as theWidrow-Hofflearning rule and is also known as theWidrow-Hofflearning.! - Newton 's Method/GLMs lose the pace of the class.. all announcements... Rightmost figure shows the result of running one more iteration, which the updatesθ to about 1.8 simply gradient is!: Lecture 1 review of linear regression, we have: for a fixed value ofθ h y... Learn, the parametersθ is being updated for Spring 2020.The dates are subject change! This: x h predicted y ( i ) are given hypothesis to be to (... Learning let ’ s start by cs229 lecture notes about a few examples of supervised learning start., although for a single training example, and a classificationexample natural algorithm that repeatedly takes a step in GLM! 3500 4000 4500 5000 the result of running one more iteration, the! Rapidly approachθ= 1.3 called theLMSupdate rule ( LMS stands for “ least mean squares ”,! Lms stands for “ least mean squares ” ), we have to work out whatis the partial term... Properties that seem natural and intuitive networks we will focus on the original functionJ.