Alexandre de BrébissonFounder of lyrebird.ai and PhD student at MILA
http://adbrebs.github.io
A problem with gradient descent parametrization<p>Despite being widely used, gradient descent has a flaw: it is very sensitive to the parametrization of the model.</p>
<p>To see that, condider a parameter $w$ of a model, a neural network for example. Let’s say that a gradient descent iteration updates $w$ by $\Delta_\lambda w$, where $\lambda$ denotes the learning rate. Let’s now consider that we re-parametrize $w$ by introducing $w_1$ and $w_2$ such that $w = w_1 + w_2$. With this parametrisation, a gradient descent iteration with learning rate $\lambda$ will update $w$ by $2 \times \Delta_\lambda w$, i.e. two times more than before!</p>
<p>Similarly, we can also convince ourselves by considering the reparametrisation $w=Cw_1$. With this new parametrisation, a gradient descent iteration will update $w$ by $C^2 \times \Delta_\lambda w$. In this case, Newton’s method will discard the scaling factor $C^2$.</p>
Tue, 20 Oct 2015 00:00:00 +0000
http://adbrebs.github.io/A-problem-with-gradient-descent-parametrization/
http://adbrebs.github.io/A-problem-with-gradient-descent-parametrization/Winning the Kaggle Taxi destination prediction<p><a href="https://adnab.me">Alex Auvolat</a>, <a href="http://ejls.fr">Étienne Simon</a> and myself recently took part in a Kaggle competition organized by the conference ECML/PKDD. The task was simple: given a partial trajectory of a taxi, we were asked to predict its destination.</p>
<p>We stopped working on the competition about two weeks before the deadline as 20 teams were ahead on the public leaderboard and we thought that we had no chance left. Finally we had the good surprise to see that we obtained the first-place on the private leaderboard: many teams were heavily overfitting the public test set. After the deadline, we actually trained our winning model a bit longer (until convergence…) and obtained significantly better results!</p>
<p>As expected, the approaches we tried are all based on neural networks, whereas most published competitor solutions heavily rely on hand-engineering with little to no machine learning. By comparison our approaches have very little pre-processing, no post-processing and no ensembling.</p>
<p>The full description of our models can be found in <a href="http://arxiv.org/abs/1508.00021">our paper</a>, which we presented in September at ECML/PKDD in Porto. Kaggle has also published a <a href="http://blog.kaggle.com/2015/07/27/taxi-trajectory-winners-interview-1st-place-team-🚕/">blog post</a> with other details.</p>
<p>The following video is an example of real-time prediction of our best model as we were in the taxi going from the airport to the conference center!</p>
<div style="text-align: center;">
<iframe width="780" height="430" src="/images/prediction_airport.webm" frameborder="0"> </iframe>
</div>
Fri, 11 Sep 2015 00:00:00 +0000
http://adbrebs.github.io/Kaggle-Taxi-Destination-Prediction/
http://adbrebs.github.io/Kaggle-Taxi-Destination-Prediction/Trekking in the Lofoten islands<p>In August, I went trekking on the superb Lofoten islands with a good friend. As it was quite difficult to plan a relatively long trekking itinerary, I am writing this blog post to share it. The itinerary is designed to trek for at least 5 days and I suggest extensions to reach 9 days (we actually did 8 days). You should consider taking extra security days in case of bad weather that would slow you down.</p>
<p>You should consider our itinerary if:</p>
<ul>
<li>you want to go to the wildest parts of the Lofoten,</li>
<li>you want to walk most of the time and avoid roads,</li>
<li>you plan to carry food for at least 5 days,</li>
<li>you want to see many empty beaches with cold water,</li>
<li>you want to take your time and not hurry,</li>
<li>you don’t want any repetitions in the itinerary.</li>
</ul>
<p>These reasons rapidly convinced me to focus our trip on the Moskenesøya island at the southern end of the Lofoten archipelago. The idea is to take the bus from Moskenes and come back to Moskenes by foot.</p>
<p>Also, if you want to avoid tourists as much as possible, try to go there after the high season like us, i.e. after the ~15th of August, when the Norwegian children start class again.</p>
<p>Sounds good? Ok let’s go!</p>
<h2 id="equipment">Equipment</h2>
<h3 id="clothes">Clothes</h3>
<p>We had good weather without a single cloud for 10 days but this was really extraordinary. Be ready to have rain, mist and cold weather in August. You should carry a full set of waterproof clothes with you. Don’t forget your swim-suit. Bring semi-high or high hiking boots as there are fens at several places.</p>
<h3 id="food-and-water">Food and Water</h3>
<p>We packed food for around 8-9 days but it is actually possible to re-supply in very expensive shops after 5-6 days. You can find a lot of different berries at the end of August, check <a href="http://www.countrylovers.co.uk/wfs/wfsberries.htm">this website</a> to see which ones are eatable. There are also mushrooms, I ate a few chanterelles, the only species I allowed myself to eat. Be extremely careful with mushrooms, you can find pictures of poisonous and eatable mushrooms on <a href="http://www.soppognyttevekster.no/media/Kurs%20fremmedspr%C3%A5klige/Soppkurs_Fremmedspr%C3%A5klige_Heftet_Engelsk_Aug2012.pdf">this website</a>.</p>
<p>Water is very abundant on the Lofoten, there are many streams and lakes, so you should not worry too much about it. We used pills to purify it but I really don’t think that it was necessary.</p>
<h3 id="navigation">Navigation</h3>
<p>I recommend using a GPS (I used my phone with extra batteries with the Gaia GPS app) as some trails can be hard to find. Alternatively there are also available 1:50000 topographic maps.</p>
<p>The best topographic maps of Norway can be found on <a href="http://www.norgeskart.no/">this website</a>. These maps are actually available on the Gaia GPS app for Android. We also recommend the excellent website <a href="http://rando-lofoten.net/index.php/en/">rando-Lofoten</a> with many GPS tracks of hikes. It greatly helped us to design this itinerary.</p>
<p>Most of the trails at the end of August are quite well-marked as many hikers already took them during the summer. For the same reason, they can be quite marshy sometimes.</p>
<h2 id="going-to-moskenesøya">Going to Moskenesøya</h2>
<p>We flew from Paris to Oslo and then took a SAS flight to Bodø, a small city on the continent, in front of the Lofoten islands. We arrived there in the evening. You have to buy camping gas in Bodø as there are no easy/cheap options later in our itinerary. In Bodø you can find gas in oil stations or in one of the many outdoor stores. To reach Moskenesøya from Bodø, you have to take the ferry to Moskenes which is a small town on the Moskenesøya island. The timetables of the ferry can be found <a href="http://ruteinfo.thn.no/no/velgrute.aspx">there</a>, be sure to check the dates. We took the midnight ferry and arrived around 4 a.m. at Moskenes. This ferry does not have many passengers, only a few adventurous backpackers. The sunrise with the Lofoten in the background is unforgettable.</p>
<h2 id="five-day-trek">Five-day trek</h2>
<!--![](http://adbrebs.github.io/images/lofoten/map_5_day.jpg)-->
<p><img src="http://adbrebs.github.io/images/lofoten/map_5_day.jpg" alt="Drawing" style="display: block; width: 500px; margin-left: auto; margin-right: auto" />
<em>Map of the 5-day itinerary.</em></p>
<h3 id="first-day-moskenes---fredvang---ryten-summit">First day: Moskenes - Fredvang - Ryten summit</h3>
<p>If you arrive at 4 a.m. like us, you might still be a bit sleepy. We actually set up our tent on a grass field near a campground a few minutes away from the harbor and slept there a few hours until the first bus arrived. In the morning, as we were waiting at the bus stop, we actually realized that the very best sleeping option in Moskenes is to sleep in the waiting room of the ferry harbor! It is heated, there are toilets and during the night there is absolutely no one! We actually slept there two nights later during the trip. The waiting room is on the right when you land, just behind the bus stop.</p>
<p><img src="http://adbrebs.github.io/images/lofoten/ferry_low.JPG" alt="" />
<em>Taking the ferry at 1am was actually quite fun.</em></p>
<p>So the first part of the day is to reach Fredvang, the starting point of the trek. Take the bus from the Moskenes Ferry stop to Fredvang X (or Napp if you want to add two days to the itinerary as we explain later). The timetable of the bus can be found <a href="http://www.177nordland.no/index.php?ac_id=331&ac_parent=280">here</a>. Be careful, timetables are different during the weekdays and the weekend. For us, the first bus was at 7 am during the weekdays and 9 a.m. during the weekend. The bus from Moskenes to Fredvang X takes around 40 minutes, you can pay in cash inside the bus and there is a nice student discount (I paid 40 kr). The bus passes through Reine, a scenic town on a small portion of land between steep mountains and the sea.</p>
<p>We arrived around 10 am at Fredvang X. From there we crossed the two bridges to go to Fredvang. We saw a dolphin from one of the bridge. In Fredvang, we recommend to go (and swwim) to the beach, which is empty and has a very nice background.</p>
<p><img src="http://adbrebs.github.io/images/lofoten/fredvang_low.jpg" alt="" />
<em>View from the bridge</em></p>
<p>After having lunch on the beach, we headed towards the Ryten, a 543m peak. At some point the path forks in two: one path heads to the Ryten, the other to the Kvalvika beach. We took some water in a small lake at the intersection of the two paths and started climbing the Ryten. We reached the summit around 5 pm. From there, the view on the Kvalvika beach is breathtaking and we spent a few hours wandering in the area until a colourful sunset.</p>
<p><img src="http://adbrebs.github.io/images/lofoten/kvalvika_low.jpg" alt="" />
<em>View of the Kvalvika beach from the Ryten</em></p>
<p>As the summit was very windy, we decided to set up our tent a bit before the summit, where the view of the beach is the best. Thick moss provided us with an excellent mattress.</p>
<h3 id="second-day-ryten-summit---kvalvika-beach---selfjord">Second day: Ryten summit - Kvalvika beach - Selfjord</h3>
<p>We hiked down the summit to the beach, crossed it, passed a small lake, a second one in which I took a quick bath and we arrived at a road that we walked down to arrive to Selfjord. From there, we started our journey to the Horseid beach. The ground at the beginning is a bit marshy and you ‘d better have high boots. We found there our first cloudberries and bilberries. At some point, we reached a lake surrounded by mountains. We were actually planning to sleep there but the mosquitoes and black flies (only place we have actually been bothered by bugs) convinced us to climb to the first pass towards Horseid beach. Up there, no more bugs, a perfect grass spot to set up the tent and an astonishing panoramic view. This was for sure one of the best spot I have ever camped on.</p>
<p><img src="http://adbrebs.github.io/images/lofoten/pass_low.jpg" alt="" />
<em>Climbing the pass to set up our bivouac was an excellent idea.</em></p>
<h3 id="third-day-horseid-beach---kirkefjord---vindstad---bunes-beach">Third day: Horseid beach - Kirkefjord - Vindstad - Bunes beach</h3>
<p>After a fantastic night, we crossed another pass and reached a valley heading to the secluded Horseid beach. It is the least visited beach of the Lofoten because it is the most difficult to reach. As a result it was completely wild and we had this huge beach for us alone. We had lunch there and then headed to Kirkefjord to get a boat at 3pm. Be sure to check in advance <a href="http://www.reinefjorden.no/rutetabell.htm">the timetable</a>, which depends on the season. Kirkefjord is actually a nice small village, only reachable by boat. We took the boat to Vindstad and walked down to the Bunes beach. There, it is touristic, it does not take long to realize it: the inhabitants of Vindstad don’t smile to you and the beach is not wild anymore. We camped on the beach with maybe 10 other tents, which is not so terrible because the beach is really huge. For nature purists, I would actually advise to skip Bunes beach. It has been spoiled by irresponsible tourists and if you are like me who can not stand seeing trash in nature, then you will feel angry.</p>
<h3 id="fourth-day-bunes-beach---forsfjorden---hermanndalstinden">Fourth day: Bunes beach - Forsfjorden - Hermanndalstinden</h3>
<p>The first part of the day was to go from Vindstad to Forsfjorden. It should be possible to do this step by boat but we decided to hike there. This was actually a bad idea as it was very difficult (difficulty is increasing). There is a kind of trail at the end of August but sometimes it disappears and a few parts are particularly dangerous. I would advise against hiking this portion, take the boat instead. It took us 2-3 hours to reach the hydraulic plant of Forsfjorden, whereas it would only take 5 minutes to go there by boat. At the plant, we climbed a steep cliff to reach a lake in the mountains. We had lunch there and I went in the water. This was probably one of the coldest bath I ever had ; snow was actually still melting there. Then we climbed a hill to reach a small plateau, which is usually the starting point to climb the Hermanndalstinden, the highest peak of the island (1029m). So we decided to set up our base camp here.</p>
<p><img src="http://adbrebs.github.io/images/lofoten/mountains_low.jpg" alt="" />
<em>View from the hill on which we set up our tent</em></p>
<p>I decided to attempt the climb but I finally stop around 750 meters high as some portions were particularly exposed and windy, some parts even collapsed and I was already tired after a long day of hiking.</p>
<h3 id="fifth-day-hermanndalstinden---munkebu-hut---sørvågen---å---moskenes">Fifth day: Hermanndalstinden - Munkebu hut - Sørvågen - Å - Moskenes</h3>
<p>From our hill we hiked down to another lake and then reached the Munkebu hut. From there we headed to Sørvågen. This part is actually very popular and we came across many tourists going there for a day hike. From Sørvågen, we went to Å and then came back to Moskenes. We had in mind to take the ferry to Værøy but we actually arrived 5 minutes late and missed the ferry. The next one was 48 hours later… We decided to sleep in the waiting room of the ferry, which appears to be a great sleeping option in Moskenes!</p>
<!--
![](http://adbrebs.github.io/images/lofoten/waiting_room_low.JPG)
*We actually spent two nights in the waiting room*
-->
<h2 id="extensions">Extensions</h2>
<h3 id="værøy-2-day-extension">Værøy: 2-day extension</h3>
<p>If we had caught the ferry on time, we would have spent two days on Værøy, an island 70 minutes south of Moskenesøya. It is supposed to be quite authentic with only few tourists. There is an abandoned town in the south of the island, called Mostad, where would have liked to camp. The island is also famous for its colony of puffins.</p>
<h3 id="flakstadøya-2-day-extension">Flakstadøya: 2-day extension</h3>
<p>To add 2 more days to the above itinerary, you can also start your trek not in Fredvang but in Napp, in the very east of the Flakstadøya island. For that, instead of taking the bus from Moskenes to Fredvang X, continue further and stop at Napp.</p>
<p><img src="http://adbrebs.github.io/images/lofoten/map_7_day.jpg" alt="Drawing" style="display: block; width: 500px; margin-left: auto; margin-right: auto" />
<em>7-day itinerary</em></p>
<p>From Napp, you just have to follow the cost, one short portion is actually very steep and difficult but the rest is easy. At some point, you will deviate from the cost and cross a very nice part of the island with lakes. We saw there a sea eagle. Then you will reach crops and start heading to Nusfjord. At some point you will have to walk on the road until reaching the village. This was our longest day of hike, with nearly 10 hours without meeting any other hiker.</p>
<p><img src="http://adbrebs.github.io/images/lofoten/nusfjord_low.jpg" alt="" />
<em>The village of Nusfjord</em></p>
<p>Nusfjord was a nice little fishermen village (a touristic village now) with a small store where you can buy (expensive) food. After asking the locals, we found an ideal spot to set up our tent in front of a lake with majestic mountains in the background. The exact location can be seen on the following satellite image:</p>
<p><img src="http://adbrebs.github.io/images/lofoten/nusfjord_camp.png" alt="" />
<em>Great camping spot in Nusfjord</em></p>
<p>The next day, we walked from Nusfjord to the village of Nesland and then walked along a small road towards the North. From there you can decide to go to Fredvang, sleep there and the next day start the 5-day itinerary explained above.</p>
Thu, 03 Sep 2015 00:00:00 +0000
http://adbrebs.github.io/Trekking-in-the-Lofoten/
http://adbrebs.github.io/Trekking-in-the-Lofoten/Gradient identities<p>In machine learning, it is common to manipulate vectors instead of scalars.
This post lists a few identities, which can be helpful to quickly compute
gradients over computational graphs.
If you have a doubt, you should not hesitate to derive the scalar identities
first and then generalize them to vectors.</p>
<p>Let’s define the following functions:</p>
\[\begin{align*}
h&\colon \mathbb{R} \rightarrow \mathbb{R} \\
f&\colon \mathbb{R}^n \rightarrow \mathbb{R} \\
g&\colon \mathbb{R}^n \rightarrow \mathbb{R} \\
\mathbf{F}&\colon \mathbb{R}^n \rightarrow \mathbb{R}^m \\
\mathbf{G}&\colon \mathbb{R}^n \rightarrow \mathbb{R}^m
\end{align*}\]
<p>With the following conventions for the gradients and jacobian matrices:</p>
\[f (\vec{x})=\left[\begin{array}{c}f_1(\vec{x})\\ \vdots\\ f_n(\vec{x})\end{array}\right]\]
\[\nabla (\vec{x})=\left[\begin{array}{c}\pder{f}{x_1}\\ \vdots\\ \pder{f}{x_n}\end{array}\right]\]
\[\mathbf{J}^\mathrm{T}_\mathbf{F}(\vec{x})=\left[\begin{array}{ccc}
\pder{F_1}{x_1}(\vec{x}) & \dots & \pder{F_1}{x_n}(\vec{x})\\
\vdots & \ddots & \vdots\\
\pder{F_m}{x_1}(\vec{x}) & \dots & \pder{F_m}{x_n}(\vec{x})\\
\end{array}\right]\]
<h3 id="addition">Addition:</h3>
\[\nabla ( f + g ) = \nabla f + \nabla f\]
<h3 id="multiplication">Multiplication:</h3>
\[\nabla (f \, g) = g \,\nabla f + f \,\nabla g\]
<h3 id="division">Division:</h3>
\[\nabla\left(\frac{f}{g}\right) = \frac{g\nabla f - f\nabla g}{g^2}\]
<h3 id="composition">Composition:</h3>
\[\nabla(h \circ f) = (h' \circ f) \nabla f\]
\[\nabla(f \circ \mathbf{F}) = \mathbf{J}_\mathbf{F}^\mathrm{T} \, (\nabla f \circ \mathbf{F})\]
<p>proof:</p>
<p>Let’s mask \(\mathbf{F}(\vec{x})\) by a new variable \(\vec{y}\). Using the multivariate chain rule, we get:</p>
\[\pder{f \circ \mathbf{F}}{x_k}(\vec{x}) = \sum_i \pder{f}{y_i}(\vec{y}) \pder{y_i}{x_k}(\vec{x})
= [\mathbf{J}_\mathbf{F}(\vec{x})_{:,k}]^\mathrm{T} \nabla f (\vec{y})
= [\mathbf{J}_\mathbf{F}(\vec{x})_{:,k}]^\mathrm{T} \nabla f (\mathbf{F}(\vec{x})),\]
<p>where the last product is a typical matrix multiplication. Therefore, we have:</p>
\[\nabla(f \circ \mathbf{F}) = [\mathbf{J}_\mathbf{F}(\vec{x})]^\mathrm{T} \nabla f (\mathbf{F}(\vec{x})).\]
<h3 id="dot-product">Dot product:</h3>
<p>Note that \(\mathbf{F} \cdot \mathbf{G} = \mathbf{F}^\mathrm{T} \mathbf{G}\), where the second product is the matrix multiplication.</p>
\[\nabla(\mathbf{F} \cdot \mathbf{G}) = \mathbf{J}^\mathrm{T}_\mathbf{F} \, \mathbf{G} + \mathbf{J}^\mathrm{T}_\mathbf{F} \, \mathbf{G}\]
<p>proof:</p>
\[\pder{\nabla(\mathbf{F} \cdot \mathbf{G})}{x_k} = \sum_i \left[ \pder{F_i}{x_k} G_i + F_i \pder{G_i}{x_k} \right]\]
Sat, 11 Apr 2015 00:00:00 +0000
http://adbrebs.github.io/Gradient-identities/
http://adbrebs.github.io/Gradient-identities/How big is seven billion people?<p>There are nearly 7 billion people on Earth. How much is that? Imagine that you
want to shake one of their hands, to all of them, and that it takes one second
to shake one hand. Then you would need more than 220 years to shake all these
hands… Without sleeping nor eating!</p>
<p>By comparison, there are about 3 billion words on Wikipedia English. Reading
of them would would take you around 30 years only!</p>
<p>This last comparison raises an interesting point. If we consider that all
human knowledge is exposed in Wikipedia, this would mean that an extremely
bright mind could potentially memorize most of contemporary human knowledge
in a human lifetime. Obviously this only holds if the person is very bright and
can understand complex reasoning (for example philosophical or scientific)
as fast as he/she reads trivial text…</p>
Tue, 06 Jan 2015 00:00:00 +0000
http://adbrebs.github.io/How-many-are-seven-billion-people/
http://adbrebs.github.io/How-many-are-seven-billion-people/The future of University<p>University has recently started a big transformation process triggered by the apparition of MOOCs (Massive open online course), which were popularized with Coursera since 2012. I believe that these MOOCs are the right way to go and that higher education should continue its transition. This post lists a few further measures in the same direction as the MOOCs and summarizes the benefits they could yield. In my opinion, we should head to a system with:</p>
<ul>
<li>no more physical courses at University,</li>
<li>online courses (MOOC) given by the best Professors and accesible to everyone for free,</li>
<li>yearly world exams identical for all students in order to mark candidates on the same scale and grant them universal certifications.</li>
</ul>
<p>This may sound extreme but let’s have a look to the potential benefits of these measures. For sake of clarity, let me call <em>system B</em> the new system resulting from the above measures and <em>system A</em> the former traditional system.</p>
<h2 id="1-get-the-best-teachers-and-courses">1. Get the best teachers and courses</h2>
<p>In current universities there are Professors with terrible teaching skills and others with excellent ones. Throughout a degree it is likely to encounter both types. It is true that good universities have probably better Professors, which results in overall better teaching than in bad universities.</p>
<p>By asking the best Professors to create online courses with videos, anyone could learn from the best course with the best Professor. As a result, students would learn faster and better.</p>
<p>Of course, a good course is a subjective notion and it may depend on the student background or personality. A solution would be to have not a single but a few “good courses” for the same topic so that students could pick the most suitable for them. Courses would be rated and reviewed to help prospective students.</p>
<h2 id="2-save-time-and-improve-research">2. Save time and improve research</h2>
<p>This new education system would likely save a lot of time to Professors, students and companies, resulting in an overall global increase of productivity.</p>
<h3 id="to-professors">To Professors</h3>
<p>If only one or a few teachers teach (actually, only once!), what would the others do during all thieir new spare time? For sure more interesting things than repeating every year the same things.</p>
<p>Teaching is extremely time consuming for a professor. No more teaching would mean more time for research and more time to dedicate to PhD students! So this new system would undoubtedly speed up research.</p>
<p>Many professors have already stopped writing on the board and instead they prepare slides that they project every year. This already saves quite a bit of time, it is somehow a first tiny step towards system B.</p>
<h3 id="to-students">To students</h3>
<p>Having the best courses save time as students would understand faster than during a bad course, by definition.</p>
<p>The possibility of stopping a video and coming back a few seconds before to watch again a misunderstood point is extremely valuable and can save a lot of time, in particular if this point is necessary to understand the rest of the lecture.</p>
<p>No more students left behind…</p>
<h3 id="to-companies">To companies</h3>
<p>Having universal exams make hiring faster because companies can base their judgement on the reliable grades that the candidates got.</p>
<h2 id="3-free-studies-broader-access-to-education">3. Free studies, broader access to education</h2>
<p>University fees for students are usually very expensive and many students have to take long loans. On the other hand, the system B would not cost anything or almost nothing and thus, studies would be free for anyone! This would make people from any background more equal with respect to education.</p>
<h2 id="4-be-recognized">4. Be recognized</h2>
<p>Having the same exam for all the students would make grades reliable, comparable and meaningful.</p>
<p>These exams would probably be the only ones to cost a little bit in order to finance the organization and the correction.</p>
<p>Anyone, of any age, could take theses exams and get a grade.</p>
<p>Compagnies could impartially judge candidates and advertise jobs with specific prerequisites.</p>
<p>Note that world exams are not incompatible with the current teaching system and reforming teaching and exams could be done separately.</p>
<h2 id="5-get-better-help">5. Get better help</h2>
<p>Questions would be posted and answered online on forums with a voting system similar to the StackExchange websites.</p>
<h2 id="6-design-your-own-cursus">6. Design your own cursus</h2>
<p>With this new system, no worries about which program to choose, you can customize yours. No courses are forced.</p>
<p>You can change change your area of studies whenever you want.</p>
<hr />
<p>Let me list a few critics that may arise:</p>
<p><strong>How to select the best courses</strong><br />
Trial and error. Courses would be rated by students. Best courses will rapidly emerge.</p>
<p><strong>No physical social interactions?</strong><br />
Students would still have time to do extra activities with other students.</p>
<p><strong>Discipline?</strong><br />
The system B targets higher education students, serious enough to decide what they want to do of their future. This system should not be applied to college or high school students.</p>
<p><strong>How to fund research?</strong><br />
I don’t know how much of student fees fund research. That may be a problem.</p>
<p><strong>How to organize this system?</strong><br />
I think the system B should develop incrementally and in parallel to the current system A. At some point system A will become obsolete.
The exams should take time to evaluate well the students (might be a few days). To compare results between years, grades should be modified to have similar distributions.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Current MOOC websites are not very far from what I have described. Except that there are still physical courses running in parallel and that the certifications are not well-recognized. As usual, the main difficulty is probably to reform the current system as many niche people benefit from it and have no intention to make it change. It would be great if governments could meet up and speed up the transition process towards system B.</p>
Mon, 15 Sep 2014 00:00:00 +0000
http://adbrebs.github.io/The-future-of-university/
http://adbrebs.github.io/The-future-of-university/KL divergence minimization and maximum likelihood<p>Let’s consider two distributions $p$ and $q$. The KL divergence of $q$ from $p$ is:</p>
\[KL(p||q) = \sum_x p(x) \ln \frac{p(x)}{q(x)}\]
<p>Let’s now consider that $q$ is a parametrized distribution $q_\theta$ and $p$ is an empirical distribution $p_D$ over a dataset $D = \left\lbrace x_1, … x_n \right\rbrace$, i.e. $\forall x \in D, p(x) = \frac{1}{n}$ and \(p(x) = 0\) otherwise, then the KL divergence can be re-written as:</p>
\[KL(p_D||q_\theta) = \sum_{i=1}^n \frac{1}{n} \ln \frac{\frac{1}{n}}{q_\theta(x_i)}.\]
<p>Minimizing this KL divergence with respect to $\theta$ is then equivalent to minimize the following quantity:</p>
\[- \sum_{i=1}^n \ln q_\theta(x_i),\]
<p>which is the log likelihood of the data $D$. In other words, we have:</p>
\[argmin_\theta KL(p_D||q_\theta) = \theta_{MLE}.\]
Mon, 15 Sep 2014 00:00:00 +0000
http://adbrebs.github.io/KL-divergence-maximum-likelihood/
http://adbrebs.github.io/KL-divergence-maximum-likelihood/Dropout and model averaging<p>A few results that are important to know about dropout.</p>
<p><strong>Difference between dropout and bagging:</strong></p>
<ul>
<li>Dropout is an approximation of model averaging for deep non-linear networks,</li>
<li>Bagging usually uses the arithmetic mean versus the geometric mean for dropout,</li>
<li>Only a single datapoint is used to train each model during dropout (except if the model is very small or trained sufficiently long that the same dropout mask can be picked several times),</li>
<li>In dropout, all the models share the same parameters.</li>
</ul>
<p><strong>Theoretical results:</strong></p>
<ul>
<li>Linear network: dropout is equivalent to mean-averaging,</li>
<li>Network with one hidden layer of logistic neurons and a linear output: dropout is equivalent to geometric-averaging,</li>
<li>Deep non-linear network: dropout is an approximation of geometric averaging.</li>
</ul>
<p>Proofs and detailed results can be found in
<a href="http://papers.nips.cc/paper/4878-understanding-dropout.pdf">this paper</a>.</p>
<p><strong>Experimental results:</strong></p>
<ul>
<li>Dropout performs as well as geometric averaging,</li>
<li>Arithmetic average performs as well as geometric average,</li>
<li>Dropout performs better than averaging untied networks,</li>
<li>Averaging untied networks performs better than baseline sgd.</li>
</ul>
<p>More experiments and results can be found in
<a href="http://arxiv.org/pdf/1312.6197.pdf">this paper</a>.</p>
Thu, 11 Sep 2014 00:00:00 +0000
http://adbrebs.github.io/Dropout-model-averaging/
http://adbrebs.github.io/Dropout-model-averaging/Backpropagation simply explained<p style="display:none">
\(
\newcommand\nn{\mathit{net}}
\)
</p>
<p>During my studies, I attended a few classes in which the backpropagation algorithm was explained. Unfortunately it was not very clear, notations and vocabulary were messy and confusing. As a result, many students ended up saying it is a complicated algorithm. In fact, it is pretty simple and this is all the more surprising when you know that it now fuels so many real-world applications. This article simply aims to explain backpropagation as simply as it should be with the minimum one requires to understand it.</p>
<p>Let’s consider a multi-layer perceptron modelled by a function \(\nn_{\vec{\theta}}\), where \(\vec{\theta}\) denotes all the parameters of the network (weights and biases). Training this network consists in learning from a training dataset a set of parameters \(\vec{\theta}\) such that the resulting network has the desired behaviour. The training dataset is composed of pairs \(\left\lbrace (\vec{x}^{(i)}, \vec{y}^{(i)})~\vert 1 \le i \le n \right\rbrace\), where for each \(i\), \(\vec{y}^{(i)}\) is the known desired output of input \(\vec{x}^{(i)}\). The performance of the network is evaluated with an error/cost function defined as</p>
\[E: \vec{\theta} \mapsto \frac{1}{n} \sum_{i=1}^n L(\nn_{\vec{\theta}}(\vec{x}^{(i)}), \vec{y}^{(i)}),\]
<p>where \(L\) is the loss function such that \(L(\nn_{\vec{\theta}}(\vec{x}^{(i)}), \vec{y}^{(i)})\) measures the discrepancy between the desired output $\vec{y}^{(i)}$ and the actual output \(\nn_{\vec{\theta}}(\vec{x}^{(i)})\) computed by the neural network \(\nn_{\vec{\theta}}\). Common loss functions are the square error (for regression) or the negative log-likelihood (for classification). The training is then cast into the minimisation of \(E\), which is usually carried out by a variant of the <strong>gradient descent algorithm</strong>. Let’s consider the three layers displayed in the following figure.</p>
<p class="centeredImage"><img src="/images/backprop.png" style="width: 300px;" /></p>
<p>The output of neuron \(j\) of layer \(l\) is given by</p>
\[h_j^l = \varphi(z_j^l) = \varphi \left( \sum_i w_{ij}^l h_i^{l-1} + b_j^l \right),\]
<p>where \(\varphi\) is the activation function of the neurons.</p>
<p>The vanilla gradient descent algorithm specifies that the update rule of the weight \(w_{ij}^{l}\) connecting the neuron \(i\) of layer \(l-1\) and the neuron \(j\) of layer \(l\) is given by</p>
\[\Delta w_{ij}^{l} = -\alpha \pder{E}{w_{ij}^l},\]
<p>where \(\alpha\) is a scalar parameter called the learning rate. So far we have just spoken of gradient descent and it is only now that we introduce the term of <strong>backpropagation</strong> of the gradient, which is simply an algorithm to compute the gradient \(\pder{E}{w_{ij}^l}\). It is based on the chain rule, the <a href="http://en.wikipedia.org/wiki/Chain_rule">univariate</a> and the <a href="https://www.math.hmc.edu/calculus/tutorials/multichainrule/">multivariate</a> versions.</p>
\[\begin{align*}
\Delta w_{ij}^l &= -\alpha \pder{E}{w_{ij}^l} \\
& = - \alpha \pder{E}{z_{j}^l} \pder{z_{j}^l}{w_{ij}^l} && \text{univariate chain rule}\\
& = - \alpha h^{l-1}_i \pder{E}{z_{j}^l} && \text{as } z_j^l = \sum_i w_{ij}^l h_i^{l-1} + b_j^l \\
& = - \alpha h^{l-1}_i \delta_{j}^l,
\end{align*}\]
<p>where \(\delta_{j}^l = \pder{E}{z_{j}^l}\) is known as the error of neuron $j$ of layer \(l\) and is computed by applying the multivariate chain rule:</p>
\[\begin{align*}
\delta_{j}^l = \pder{E}{z_{j}^l} &= \sum_{k} \pder{E}{z_{k}^{l+1}} \pder{z_{k}^{l+1}}{z_{j}^l} && \text{multivariate chain rule}\\
& = \sum_{k} \delta_k^{l+1} \pder{z_{k}^{l+1}}{h_{j}^{l}} \pder{h_{j}^{l}}{z_{j}^l} && \text{chain rule}\\
& = \sum_{k} \delta_k^{l+1} w_{jk}^{l+1} \varphi'(z_{j}^l) && \text{as }z_k^{l+1} = \sum_k w_{jk}^{l+1} h_j^{l} + b_k^{l+1} \text{ and } h_j^l = \varphi(z_j^l)\\
& = \varphi'(z_{j}^l) \sum_{k} \delta_k^{l+1} w_{jk}^{l+1}
\end{align*}\]
<p>Therefore, by computing \(\vec{\delta}^{l+1}\) at layer \(l+1\), we can compute \(\vec{\delta}^{l}\) at the previous layer $l$. Starting from the output layer, the process is repeated until the input layer is reached. This iterative update to compute the gradient of $E$ with respect to all the weights is known as the backpropagation algorithm. Since \(E\) is non convex (except when there is no hidden layer), gradient descent will likely fall into a local minimum. The important question is to find a suitable local minimum, where the resulting network generalize well to new examples.</p>
<p>Note that the gradient descent algorithm refers to an optimisation algorithm to minimize a differentiable function while the backpropagation algorithm is the procedure to compute the gradient of the error with respect to any weights of a neural network. However, abusively, it is common to refer to the whole process (gradient descent + backpropagation) as backpropagation.</p>
<h3 id="matrix-form">Matrix form</h3>
<p>When implementing backpropagation, it is important to write the computations in a matrix form so that efficient matrix multiplication algorithms can be used. \(\vec{h}^{l}\) and \(\vec{\delta}^l\) are row vectors, \(\vec{W}^{l}\) is a matrix. We then have:</p>
\[\Delta \vec{W}^{l} = -\alpha \left(\vec{h}^{l-1}\right)^T \vec{\delta}^l,\]
<p>with</p>
\[\vec{h}^l = \varphi \left( \vec{h}^{l-1} \vec{W}^l + \vec{b}^l \right),\]
<p>and</p>
\[\vec{\delta}^l = \varphi'(\vec{z}^l) \odot \vec{\delta}^{l+1} \left(\vec{W}^{l+1}\right)^T.\]
Fri, 21 Mar 2014 00:00:00 +0000
http://adbrebs.github.io/Backpropagation-simply-explained/
http://adbrebs.github.io/Backpropagation-simply-explained/