Chris X Edwards

I never expected to see this phrase: "Ferrari SUV". Wait... Lamborghini too? Well, why not? They'll mostly get driven in video games anyway.
2018-01-18 11:37
Lots of people smoking weed down at the beach this morning. No more than normal. Let's hope the same holds for drivers.
2018-01-17 09:31
Heard some of a speech by that idiot, Ronald Reagan. Today he sounds sane, decent, and even thoughtful and intelligent. By comparison.
2018-01-10 11:53
Who is worried about advanced AI achieving human level cognition and then idly watching TV all day? Because apparently that's what happens.
2018-01-03 13:53
I've always thought that the Game of Thrones theme music sounds like Ghost Riders In The Sky. Seeming much less like a coincidence.
2018-01-03 09:06

GPU Machine Learning And Ferrari Battle

2018-01-17 00:47

It used to be that if you were an exclusive Linux user (guilty!) gaming was pretty much not something you did. There just were, relatively, very few games for Linux. However, that list has been growing extremely quickly in recent years thanks to Valve’s SteamOS which is really a euphemism for "Linux".

With this in mind, some time ago (a couple of years?) I purchased an ok graphics card for my son’s gaming computer. Now I’m pretty thrifty about such things and I basically wanted the cheapest hardware I could get that would work and that would reasonably play normal games normally. As a builder of custom workstations for molecular physicists, I’ve had a lot of experience with Nvidia and hardware accelerated graphics. But it turns out that rendering thousands of spherical atoms in the most complex molecules is pretty trivial compared to modern games. So much so that for the workstations I build, I like to use this silent fanless card (GeForce 8400) which is less than $40 at the moment. Works fine for many applications and lasts forever. Here’s an example of the crazy pentameric symmetry found in an HPV capsid taken from my 3 monitors, reduced from 3600x1920, driven by this humble $40 card.


But for games, it doesn’t even come close to being sufficient.

How do you choose a modern graphics card? I have to confess, I have no idea. I only recently learned that Nvidia cards had a rough scheme to how their model numbers work despite seeming completely random to me.

Eventually I purchased an Nvidia GeForce GTX 760. I thought it worked fine. Recently, my son somehow had managed to acquire a new graphics card. A better graphics card. This was the Nvidia GeForce GTX 1050 Ti. Obviously it’s better because that model number is bigger, right? My son believed it was better but we really knew very little about the bewildering (intentionally?) quagmire of gaming hardware marketing.

Take for example this benchmark.


Sadly they don’t show the GTX 1050, but based on the 1060 and 1070, you’d expect this card to be way better, right?

But then check out this benchmark which does include both. It’s better but not such a slam dunk. (Ours is the Ti version, whatever that means.)


People often come to me with breathless hype for some marketing angle they’ve been pitched for computer performance and I always caution that the only way you can be sure it will have the hoped for value is if you benchmark it on your own application. You can’t blindly trust generic benchmarks which at best might coincidentally be unlike your requirements and at worst be completely gamed. Since I had these cards and I was curious to find out what the difference between GPUs really looked like, I did some tests.

Before we return to the point of the exercise, playing awesome games awesomely, let’s take a little hardcore nerd detour into another aspect of gaming graphics cards: the zygote of our AI overlords. Yes, all that scary stuff you hear about super-intelligent AI burying you in paperclips is getting real credibility because of the miracles of machine learning that have been, strangely, enabled by the parallel linear algebra awesomeness of gaming graphics hardware.

Last year I did a lot of work with machine learning and one thing that I learned was that GPUs make the whole process go a lot faster. I was curious how valuable each of these cards was in that context. I dug out an old project I had worked on for classifying German traffic signs (which is totally a thing). I first wanted to run my classifier on a CPU to get a sense of how valuable the graphics card (i.e. the GPU) was in general.

Here is the CPU based run using a 4 core (8 with hyperthreading) 2.93GHz Intel® Core™ i7 CPU 870.

Loaded - ./xedtrainset/syncombotrain.p Training Set:   69598 samples
Loaded - ./xedtrainset/synvalid.p Training Set:   4410 samples
Loaded - ./xedtrainset/syntest.p Training Set:   12630 samples

2018-01-16 19:02:54.260119: W tensorflow/core/platform/]
The TensorFlow library wasn't compiled to use SSE4.1 instructions, but
these are available on your machine and could speed up CPU
2018-01-16 19:02:54.260143: W tensorflow/core/platform/]
The TensorFlow library wasn't compiled to use SSE4.2 instructions, but
these are available on your machine and could speed up CPU

EPOCH 1 ... Validation Accuracy= 0.927
EPOCH 2 ... Validation Accuracy= 0.951
EPOCH 3 ... Validation Accuracy= 0.973
EPOCH 4 ... Validation Accuracy= 0.968
EPOCH 5 ... Validation Accuracy= 0.958
EPOCH 6 ... Validation Accuracy= 0.980
Model saved

Test Accuracy= 0.978

real    4m42.903s
user    17m31.120s
sys     2m28.476s

So just under 5 minutes to run. I could see that all the cores were churning away and the GPU wasn’t being used. You can see some (irritating) warnings from TensorFlow (the machine learning library); apparently I have foolishly failed to compile support for some of the CPU tricks that could be used. Maybe some more performance could be squeezed out of this setup but compiling TensorFlow from source code doesn’t quite make the list of things I’ll do simply to amuse myself.

Oh, and my software can indeed identify which German traffic sign it’s looking at 98% of the time which is pretty decent.

Next I installed the version of TensorFlow that uses the GPU.

conda install -n testenv tensorflow-gpu

Now I was running it on the card that Linux reports as: NVIDIA Corporation GP107 [GeForce GTX 1050 Ti] (rev a1).

Loaded - ./xedtrainset/syncombotrain.p Training Set:   69598 samples
Loaded - ./xedtrainset/synvalid.p Training Set:   4410 samples
Loaded - ./xedtrainset/syntest.p Training Set:   12630 samples

2018-01-16 20:12:40.294673: I tensorflow/core/common_runtime/gpu/]
Found device 0 with properties:
name: GeForce GTX 1050 Ti
major: 6 minor: 1 memoryClockRate (GHz) 1.392
pciBusID 0000:01:00.0
Total memory: 3.94GiB
Free memory: 3.76GiB
2018-01-16 20:12:40.294699: I tensorflow/core/common_runtime/gpu/] DMA: 0
2018-01-16 20:12:40.294713: I tensorflow/core/common_runtime/gpu/] 0:   Y
2018-01-16 20:12:40.294726: I tensorflow/core/common_runtime/gpu/]
Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1050 Ti, pci bus id: 0000:01:00.0)

EPOCH 1 ... Validation Accuracy= 0.920
EPOCH 2 ... Validation Accuracy= 0.937
EPOCH 3 ... Validation Accuracy= 0.975
EPOCH 4 ... Validation Accuracy= 0.983
EPOCH 5 ... Validation Accuracy= 0.971
EPOCH 6 ... Validation Accuracy= 0.983
Model saved

2018-01-16 20:13:18.767520: I tensorflow/core/common_runtime/gpu/]
Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1050 Ti, pci bus id: 0000:01:00.0)
Test Accuracy= 0.984

real    1m7.441s
user    1m5.344s
sys     0m5.452s

You can see that it found and used the GPU. This took less than a quarter of the time that the CPU needed! Clearly GPUs make training neural networks go much faster. What about how it compares to the other card?

One caveat is that I didn’t feel like swapping the cards again, so I ran this on a different computer. This time on a six core AMD FX(tm)-6300. But this shouldn’t really matter much, right? The processing is in the card. That card identifies as: NVIDIA Corporation GK104 [GeForce GTX 760] (rev a1). Here’s what that looked like.

Loaded - ./xedtrainset/syncombotrain.p Training Set:   69598 samples
Loaded - ./xedtrainset/synvalid.p Training Set:   4410 samples
Loaded - ./xedtrainset/syntest.p Training Set:   12630 samples

2018-01-16 20:13:57.953655: I tensorflow/core/common_runtime/gpu/]
Found device 0 with properties:
name: GeForce GTX 760
major: 3 minor: 0 memoryClockRate (GHz) 1.0715
pciBusID 0000:01:00.0
Total memory: 1.95GiB
Free memory: 1.88GiB
2018-01-16 20:13:57.953694: I tensorflow/core/common_runtime/gpu/] DMA: 0
2018-01-16 20:13:57.953703: I tensorflow/core/common_runtime/gpu/] 0:   Y
2018-01-16 20:13:57.953715: I tensorflow/core/common_runtime/gpu/]
Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 760, pci bus id: 0000:01:00.0)

EPOCH 1 ... Validation Accuracy= 0.935
EPOCH 2 ... Validation Accuracy= 0.953
EPOCH 3 ... Validation Accuracy= 0.956
EPOCH 4 ... Validation Accuracy= 0.976
EPOCH 5 ... Validation Accuracy= 0.971
EPOCH 6 ... Validation Accuracy= 0.979
Model saved

2018-01-16 20:14:43.861117: I tensorflow/core/common_runtime/gpu/]
Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 760, pci bus id: 0000:01:00.0)
Test Accuracy= 0.977

real    1m0.685s
user    1m10.636s
sys     0m7.164s

As you can see, this is pretty close. I certainly wouldn’t want to spend a bunch of extra money on one of these cards over another for machine learning purposes. So that was interesting but what about where it really matters? What about game performance?

This is really tricky to quantify. Some people may have different thresholds of perception about some graphical effects. Frame rate is an important consideration in many cases, but I’m going to assume that 30 frames per second is sufficient since I’m not worrying about VR (which apparently requires 90fps). My goal was to create the setup most likely to highlight any differences in quality. I created two videos, one using each card on the same computer, and then spliced the left side of one to the right side of the other.

This video is pretty cool. In theory, it is best appreciated at 1920x1080 (full screen it maybe). Locally, it looks really good but who knows what YouTube has done to it. Even the compositing in Blender could have mutated something. Even the original encoding process on my standalone HDMI pass-through capture box could have distorted things. (This standalone capture box does produce some annoying intermittent artifacts like the left of the screen at 0:15 and the right at 0:21 — this is the capture box and has nothing to do with the cards.) And of course if you’re using Linux and Firefox you probably can’t see this in high quality anyway (ahem, thanks YouTube).

So that’s video cards for you. What may look like like hardware models with an obvious difference may not really have much of a difference. Or they might. In practice, you need to check them to really be sure. If you noticed any clear difference in the two video sources, let me know, because I didn’t see it. Frame rates for both were locked solidly at 30fps.

Speaking of incredibly small differences, how about those two laps around the Monaco Grand Prix circuit? I drove those separately (in heavy rain with manual shifting) and the driving is so consistent that they almost splice together. I’ve enjoyed playing F1 2015. This is the first time Linux people could play this franchise. The physics are as amazing as the graphics. What is completely lame, however, are the AI opponents (too annoying to include in my video). Wow they are stupid! Computer controlled cars… a very hard problem.

My Next Car

2018-01-16 19:51


Some random guy on the internet offered to send people who wrote to him this bumpersticker. And I did and he did and thanks Ben!

Blender The Beast

2017-12-08 13:00

The first computer program I ever saw run was a 3d graphical virtual reality simulation which was as immersive as any I’ve ever experienced. What is really astonishing is that this took place in 1979 and the program was loaded into less than 48 kilobytes of RAM from a cassette tape. Yes, a cassette tape.

That program, called FS1, was written by a genius visionary named Bruce Artwick. Very soon after my dad and I saw that demonstration, we were among the first families to have a computer in our home. Of course we loved Flight Simulator, as it is better known. But there’s some even more obscure ancient history hiding in there.

Not long after that, Artwick’s company Sublogic released a program called A23D1. You can find an ancient reference to it in the March 1980 edition of Byte magazine. It simply says, "A23D1 animation package for the Apple II ($45 on cassette, $55 for disk)." That is all I can find to remind myself that I wasn’t just dreaming it.

Although Flight Simulator was jaw droppingly spectacular, I almost think that A23D1 was even more historically premature. It was nothing less than a general purpose 3d modeling program and rendering engine. Remember, this was for the 8-bit 6502 processor with 48kB of RAM.

Of course we’re not talking about Pixar level of polish, but in 1980 seeing any 3d computer graphics was nearly a religious experience. I think it would be hard to understand the impact today. It was like looking through a knothole in the fence between our reality and the magical land of the fantastic. Remember at this time the only 3d graphics anybody had ever seen were on Luke’s targeting computer and, as dorky as those graphics look today, at the time we walked out of the theaters no less stunned than if we’d just returned from an actual visit to a galaxy far, far away.

I remember my dad getting out the graph paper and straining his way through the severe A23D1 manual until, many hexadecimal conversions later, he had created a little sailboat in a reality that had not existed before in our lives. To see a window to another universe in our house, tantalizingly under our control, was mind-blowing. These were the first rays of light in the dawn of virtual reality.

I think A23D1 overreached a bit. It was not truly time for 3d. I spent my high school years absorbed by the miraculous new 2d "paint" programs. When I landed my first gig as an engineering intern for a metrology robot company, they had a copy of AutoCAD. I don’t know exactly why because nobody used it or even knew how. I was drawn to it immediately. There was no mouse (yes, the AutoCAD of 1988 had a keyboard-only mode which was pretty commonly used) and the monitor was monochrome. I started systematically building expertise. I eventually learned how to model things in 3d and how to write software in AutoLisp (apparently a direct contemporary of EmacsLisp).

AutoCAD formed the basis of a pretty good engineering career for me. The problem was that I was pushing the limits of what AutoCAD was designed for. I constantly struggled with the fact that (1990s) AutoCAD’s 3d features were roughly bolted on to an earlier 2d product. The expense of AutoCAD was tolerable for a business but not for me personally. As AutoCAD moved away from any kind of cross-platform support, the thought of using it on a stupid OS filled me with dread. As a result of the dark curse of proprietary formats I found myself cut off from a large body of my own intellectual work.

That’s the background story that helps explain why I thought it might be best if I recreated AutoCAD myself from scratch. I was kind of hoping the free software world would handily beat me to it, but no, my reasons are still as good as ever to press on with my own incomplete geometric modeler.

But it is incomplete. And that has been a real impediment for someone like me who is so experienced with 3d modeling. A few years ago, I was making some videos and having trouble finding free software that was stable enough to do the job. I eventually was directed to Blender and I was impressed. I have done a lot of video editing now with Blender (email me for a link to my YouTube empire if you’re interested) and it has never let me down. Blender has a very quirky interface (to me) but it is not stupid nor designed for stupid people. After getting a feel for it I started to realize that this was a serious tool for serious people. I believe it is one of the greatest works of free software ever written.

My backlog of 3d modeling projects has grown so large that I decided to try to get skilled at using Blender at the end of this year. I have envisioned a lot of engineering projects that just need something more heavy duty than what my modeling system is currently ready for. I also think that my system can be quite complimentary to something like Blender.

The problem with Blender for me is that it is a first class tool for artists. But for engineering geometry, I find it to be more of a challenge. My system on the other hand is by its fundamental design the opposite. One of the things that would always frustrate me with bad AutoCAD users (which is almost all of the ones I ever encountered, and if you’re an exception, you’ll know exactly what I mean) is that they often would make things look just fine. This is maddening because looking right is not the same thing as being right. Blender specializes in making things look great. Which is fine but when I start a project I usually have a long list of hard numerical constraints that make looks irrelevant. I’m not saying Blender is incapable; the fact that there’s a Python console mode suggests that all serious things are more than possible with Blender.

But I get a bit dispirited when I go looking for documentation for such things and turn up nothing. Even for relatively simple things this is all too common.


Since I’ve just had such a great experience with on-line education I thought maybe there was some such way to learn Blender thoroughly. And there is! I’ve been going through this very comprehensive course from Udemy. I’m about half way through it and it basically provides a structured way to go through most of the important functionality of Blender while getting good explanations and plenty of practice.

Here’s an example of a stylish low-poly chess set I created.


Not that exciting but a good project to get solid practice with.

With AutoCAD I remember writing all my own software to animate architectural walk-throughs and machine articulation simulations. Obviously Blender comes with all of that refined for a professional level of modern 3d animation craftsmanship. Here’s a quick little animation I did which was not so quick to create, but very educational.


Rendering this tiny thing I learned that Blender is the ultimate CPU and GPU punisher. Simultaneously! If you want to melt your overclocked gaming rig, I recommend Blender.

The reason I think it’s wise and safe to invest so heavily in Blender is that this rug will never be pulled out from under me. I can’t afford AutoCAD so that door is slammed in my face. Blender, on the other hand, is in the public domain. I even have access to the source code if there’s something I don’t like. No excuses.

I hope I can integrate it with the more engineering oriented geometry tools I have written. I am confident that I can use it to start design work on my own autonomous vehicles and to generate assets for vehicle simulations in game engines.

Blender is a fun program. It is heroically cross-platform. You can just download it from If you can’t get inspired by the awesome artwork people have created (e.g.) you’re probably pretty dull. While there is a lot to it, the rewards are commensurate. If you have ever used A23D1, Blender is well within your capabilities. The same is true if you have ever run a virtual fashion empire designing and selling virtual skirts to virtual people. In fact, if that describes you, I would highly recommend you pay the $10 for this Udemy course and get to it!

Patently Ridiculous

2017-12-06 13:33

Years ago I tried to talk some sense about what I feel are overblown fears of scary AI enslaving humanity. In that, I pointed out The Economist pointing out that we’ve been here before. They mention that "government bureaucracies, markets and armies" have supernatural power over ordinary humans and must be handled with care. A new article expands on that theme nicely; the short version is entirely captured by the title, "AI Has Already Taken Over, It’s Called the Corporation".

In my aforementioned post I proposed my own idea that AI wouldn’t be much of a concern because if it was truly intelligent, it wouldn’t care about humans one bit. Sort of like we don’t go around worrying about diatoms even though they’re pretty awesome and vastly outnumber us.

If escaping from the scary menace of SkyNet AI involves, essentially, obscurity, maybe the same is true with the ominous spectre of corporations. For example, USC §271(a) says pretty clearly that, "…whoever … makes, [or] uses … any patented invention… infringes the patent."

Let’s say I’m pursuing a research agenda to accelerate autonomous car technology. If I work for a big company, patents provide a guide to what must be treated as forbidden. If I avoid such entanglements and work by myself, patents can be stolen as expedient with complete disregard to the law. Probably. So I got that goin for me, which is nice.

And with all that in mind, let’s turn now to autonomous car news of the weird. This article in Wired talks about some random engineer dude with an interest in autonomous car company lawsuits. Sort of like me but, apparently, with a bit more disposable cash. If you’ll recall, I wrote about the extremely bizarre testimony in the Waymo v. Uber lawsuit here and here.

This random engineer guy, Eric Swildens, was watching the circus too and he started to get the feeling that the whole case Waymo was presenting was kind of weak for no other reason than the putative infringed upon patent was kind of stupid. Sure enough, he does some minor digging and finds out that there’s prior art and yadayada Waymo’s case is embarrassing and Uber’s defense oversight maybe more so. If any of that sounds interesting, do check out the whole article which is surreal.

But here’s the thing… In those depositions, Waymo seemed pretty pissed off at their man switching teams and taking some tech (and enough bonus pay to start a cult). I thought the technology involved was trade secret stuff. There was all this talk about what was checked out of the version control and who had what hard drive where, etc. But am I to understand that all of this was really about a specific patent which can be accessed by anybody with a web browser (made easy by Google no less)? Something doesn’t make sense.

Whatever. Thanks to the magic of the Streisand effect, I am cheerfully reading through all Waymo’s patents.

Part II - An Example Simplified Until Comprehensible

2017-11-24 09:47

In my last post about machine learning "neural" networks I tried to frame a very rough way to think about that topic. This isn’t because my physical analogy is technically exactly what is going on with machine learning but because it is close enough that it will hopefully help make things clearer when the details are studied in more depth. Well, clearer than neurophysiology!

In this post I will try to simplify and explore some of the math involved in the actual optimization (learning) strategy used in normal neural network approaches. The goal here is to do this with a minimal illustrative example. This means that I’m going to snip away almost all of the complexity of a real neural network system so that some intuition about some core ideas can be a little clearer than when they are later awash in a flood of data and complexity in a real practical system. Although this example is just a "simple" optimization problem, I think it conveys some of the important themes found in machine learning neural network techniques and is helpful for getting acclimated to its important concepts.

Recall from the last article, I proposed a thought experiment featuring a big jumble of hardware arranged in layers with a bunch of adjustment screws. In that example, there was a huge question left unanswered — how much exactly are the adjustment screws adjusted? Since the actual classifier (dog or muffin) is just a complex but essentially similar case to the log example I presented, I’ll focus on the simple log example. In that example, I imagined driving some screws into the logs to do the adjusting. Screws are really just helical wedges so let’s think about that problem visualizing wedges.


Recall that the goal is to adjust all the wedges until the actual value is where you want it. This value that the system actually produces for a given input is often marked as a Y with a hat on it. People even say "why-hat". Plain Y sans chapeau is used to designate what the target should be and thus what we are aiming for. In machine learning the plain Y is often the "label" part of a labeled training set. We want to adjust the system (weights) so that it at least hits these known targets pretty well before trying it on data we don’t have the correct answers for.

Looking over the diagram with the wedges, it’s almost simple enough now to actually do explicit geometric calculations. But I’m lazy so I need to simplify this yet more. We could imagine a big simplification by removing every other wedge and replacing it with a hinge.


Now we’re down to just 3 knobs to adjust. I’ve made them omegas because that seems like a traditional angle sort of measurement and they still look like "w" which will remind us that these angle settings are now the "weights" in the system.

This is definitely doable, but I am even lazier than that. If we simplify this system even further we get something like this.


What’s cool about this format is that although it is structurally very similar to my previous conceptual model, it seems to have taken on a different form. This problem could be a robot arm with 3 servo motors. How would you set the servo motors to put the robot’s gripper on the target? In case you feel we’ve wandered too far away from machine learning, consider that this problem is just an optimization problem and so is machine learning. This highly stripped down version allows us to study it without tons of other complex considerations required by the scale of machine learning’s typical complexity. In other words, machine learning is basically solving problems like this; it’s just usually doing thousands at a time in parallel to be properly useful. We can just focus on one strand of the network that serendipitously has a different practical application.

This particular problem format is in fact an important problem on its own. It is called inverse kinematics and is critical to many fields from robotics to molecular physics. Now that I’ve evolved my tower of logs example into a simpler inverse kinematics problem, how can we solve it using the rough ideas also used at the heart of machine learning?

First let’s consider how we would figure out the structure’s current position given certain settings. If you recall very basic trigonometry and we assume that each segment of the linkage is one unit long, the positions of the joints are very easy to calculate. The lateral position is just the sine of that joint’s angle. We can keep an account of these as we go, each joint’s position added to the previous. Here is some simple code that takes a starting position where the base is located (Y0) and angle settings for each of three joints (w1,w2,w3), and returns the lateral position at each end point (Y1,Y2, and the end, Y3).

from math import sin,cos,radians # This example involves trigonometry.

def calculate_pose(Y0,w1,w2,w3): # Base position and linkage angles.
    Y1= Y0 + sin(w1)             # First arm's end position.
    Y2= Y1 + sin(w2)             # Second arm's end position.
    Y3= Y2 + sin(w3)             # End of entire 3 bar linkage.
    return Y1,Y2,Y3              # Output lateral positions of linkage.

Pretty simple, right? This is the forward pass. We take the system and see how it is with no meddling. Seeing what you’ve got and how the system works out is the first step before messing with things to try and improve the system.

I’m trying to show a radically simplified example here so that the core ideas used in machine learning are less likely to be lost in the bustle of all of the other things necessary for useful deep learning neural networks in practice (a large network, more complex and less visual functions, a lot of data to apply statistics to, framework conventions, etc). So don’t fixate too much on the deficiencies. In most neural network lessons, you will start with a different kind of gross simplification. I feel having two different simple perspectives is helpful.

Once we know how well the system is working, i.e. how far Y3 is from being the same as Y0, we want to adjust the system (weights) so if we try again, we can hopefully do better. The huge difference between neural network techniques and the way humans usually solve these kinds of hard problems is that humans don’t explicitly calculate algorithmic guesses for how to adjust each of the weights. For a computer to attack such problems, this is exactly what must be done.

Since we have 3 weights (the joint angles) that can be adjusted which affect the desired goal, we need to figure out optimal amounts to tweak each of these angles. One might wonder why we can’t just solve for the final answer. In some simple cases maybe that’s possible, but even in this one there are many (infinite) settings of the weights that will line up the end of the arm with the base. Perhaps with more constraints you could just solve it but in practice the complexity will make that notion prohibitive. We just want to converge effectively on something that works with a simple algorithm because in neural networks we’ll be applying it a gazillion times.

The main gist of how this works is we consider in turn how each weight affects the overall error. In other words, if I turn w1, how does the Y3 end position change? Or similarly but more importantly how does the distance to the target change? I’ll call that distance E for error and unlike the position, Y3, it will always be positive. We ask the same about w2 and w3. For people who can remember calculus, these values are the derivatives of E (the error) with respect to each weight. If I turn w1 quickly, does the error E change slowly or quickly? Does it go up or down? That’s what we’re looking for. Math people write this quantity as a "dE" over "dw1" like a fraction (maybe even using Greek deltas). As a programmer I’ll write it like dE_dw1.

The trick with machine learning often involves very elaborate networks of calculations that are as simple as I’ve contrived. It is generally necessary to calculate the change in error, dE, with respect to an intermediate thing changing and then calculate how that intermediate thing changes with respect to your important weight adjustment. There can be many layers of this. This is what back propagation really is.

With all that explained let’s continue with the program and see how we can figure out how to adjust the weights to lower the error.

def update_weights(Y0,w1,w2,w3):
    Y1,Y2,Y3= calculate_pose(Y0,w1,w2,w3)

Here’s a new function and the first thing to do is figure out where we’re at with the weights as they are. You could think of this first step as the forward pass or forward propagation.

    E= .5*(Y0-Y3)**2 # Magnitude of error.
    dE_dY3= -(Y0-Y3) # Change in error as Y3 changes (just Y3 for Y1=0).

This next bit looks ugly but is really not too bad. The E is the error we want to minimize. We’re trying to make Y3 line up with the base at Y0, so their differences need to be close to zero. The first line just calculates the sum of the squared error, SSE, to prevent large negative errors from seeming better (smaller) than small positive errors.

Next the derivative dE_dY3 is calculated. This is the change in error E with respect to the change in Y3 (the position of the end of the linkage). Obviously this is a very simplistic thing to worry about but it is illustrative of the bulk of the work that is done in real neural networks at deeper layers. This also shows why it’s often traditional to multiply by 1/2 when calculating E (because the derivative of .5*x*x simplifies to just x).

One thing I do remember from my many misspent years studying calculus is that the derivative of the sine function is, interestingly, the cosine function. This means that the rate of change in each arm’s position is related to the joint angle’s rate of change by cosine. That gives us this.

    dY3_dw1= cos(w1) # Rate of change of Y3 as w1 is adjusted.
    dY3_dw2= cos(w2) # Rate of change of Y3 as w2 is adjusted.
    dY3_dw3= cos(w3) # Rate of change of Y3 as w3 is adjusted.

But this isn’t exactly what we’re after. We need to link the adjustment of the joint with the final error and currently we have joint angle to position, and position to error. To chain these two steps together, we use a trick of calculus called the chain rule. When I learned the chain rule long ago, I was confident that it could be safely forgotten. But no! It’s actually quite useful and really at the heart of allowing neural network machine learning to be possible. If you want to brush up on your calculus, look carefully at the chain rule.

If getting your head around how exactly the chain rule works and why it is important seems hard, thankfully, just deploying it is refreshingly easy. Here it is in action.

    dE_dw1= dE_dY3 * dY3_dw1 # Chain rule.
    dE_dw2= dE_dY3 * dY3_dw2
    dE_dw3= dE_dY3 * dY3_dw3

Again, that’s a super simple example by design for educational purposes. In practice this will get ugly enough that you will definitely want a computer to keep track of things but conceptually, this is all there is to it.

After that step, we know how the error, E, is linked to each weight. Now comes the part where we actually adjust the weights. This introduces something called the "learning rate". Imagine I’m leveling my log tower by turning screws. I may feel like a full turn of screw J will bring down the error twice as much as a full turn of screw K. That’s super helpful (and basically what we have with dE_dw1, etc) but that still leaves an important practical question — how much should I actually turn those screws? I could turn K one turn and J two turns. Or I could turn K half a turn and J one turn. Or K 6 turns and J 12. We know which screws most effectively solve our problem relatively speaking but we don’t know how much of that solution to apply. The answer to this question is specified by the "learning rate". This is often shown with a greek letter eta (though other conventions are annoyingly common).

In neural network training, this is a hyperparameter which must be selected by the designer. You can imagine that 100 turns with K and 200 with J might overshoot your goals while 0.1 degree of J turning and 0.2 degrees of K might not accomplish enough to be useful in a reasonable amount of adjustment iterations. You just have to choose based on intuition and make revisions if it is not improving at a sensible pace.

Now the weights can be corrected using the original weights and the learning rate and the connection factor between the error and this weight. This is known as the delta rule though memorizing that fact doesn’t seem critical.

    eta= .075           # Learning rate. Chosen by trial and error.
    w1= w1 - eta*dE_dw1 # Delta rule.
    w2= w2 - eta*dE_dw2
    w3= w3 - eta*dE_dw3
    return w1,w2,w3     # New improved weights ready for another try!

And that is basically it. Now we just need to do this operation a decent number of times. Each time the metaphorical tower is disassembled, adjusted, and reassembled is called an "epoch".

Another surprisingly important technicality is choosing where the system starts from. This example is so simplified that if all the joints are set to zero, no further work is needed! But in real neural networks, the opposite is often true. By setting all the weights to zero initially, you often have a terrible time training it. It is common that performance is greatly enhanced with starting weights set randomly. Often subtle changes in this can have a huge impact on overall learning success. For example, maybe setting them with a Gaussian distribution versus just purely random noise. But in our little example, I’ll just pick some nice looking arbitrary starting angles.

Here then is the main program that actually iterates towards a solution.

Y0= 0                                          # Initial input.
w1,w2,w3= radians(22),radians(-20),radians(14) # Initial arbitrary weights.
print(calculate_pose(Y0,w1,w2,w3))             # Show initial pose.
for epoch in range(20):                        # Iterate through epochs.
    w1,w2,w3= update_weights(Y0,w1,w2,w3)      # Keep improving weights.
print(calculate_pose(Y0,w1,w2,w3))             # Show final pose.

When I run this I get the following output.

(0.374606593415912, 0.0325864500902433, 0.274508345689911)
(0.2852955778665301, -0.14244640527322044, 0.0029191436234602963)

These are the lateral displacements of the end of each arm segment. Since the overall objective was to get the end of my robot arm to line up with the base (which was zero), we were hoping that the final number would come down close to zero and it did!

I ran this with a bunch more different starting weights so we can see how and how well the algorithm finds the desired solution. These diagrams show the starting pose as a red line and the final solution pose in green. This one shows an arm with the first joint at 10 degrees, the second set to 15, and the third set to 10 (these angles are all with respect to absolute horizontal, not the previous segment).



As you can see the initial pose quickly converges on the correct pose. The learning rate will influence how jumpy the transition is. The number of epochs controls how persistent it is and how many intermediate poses are attempted before returning a best guess final answer.



This one shows that even when the error is negative, this strategy still tries to minimize it back to a horizontal zero.

Here are some diverse examples showing that it can pretty reliably and sensibly find a solution.







These next two show that the algorithm isn’t perfect. By prioritizing adjustments based on the derivatives, you can see that this cosine strategy penalizes valid improvements where the angles are close to 90. When the angle is close to 90, the cosine (derivative of the position’s function, sine) is close to zero so not much gets improved at that location even though it could theoretically be doing more to help.





This one seems even worse even if it did manage to find a solution.



This next one did struggle to find a satisfactory solution in the number of epochs I allowed.



For this next one, I changed the Y0 value to be 0.7 which merely shifts the whole thing up.

-75,-55,0 with Y0=.7


We could easily set up this system so that the target (Y3) and the input (Y0) could be different and this would allow us to move a robot arm to arbitrary elevations. Traditionally the input to the system (not the weights which are the system) is called X but in a graphical geometric example, that is a bit confusing.

The big leap from this simple example to proper machine learning are systems where the input vector X (Y0 here) can be novel previously unseen circumstances and, because the weights are set (trained) so cleverly, the output reflects some useful insight. For example, you could imagine putting the input (Y0) at the number of legs a creature has and training the system with a lot of examples until the system’s weights can position the end of the arm below zero for mammals and above zero for insects. We know from general experience that the math is pretty simple there (4 or less legs, probably not an insect) but that is something the system can start to figure out on its own if you keep giving it known examples (number of legs and correct invertebrate status). The functions of how the joint angles are set by the weights (purely geometric and in the simplest way possible in my example) may need to be upgraded to allow more complexity and quirky outcomes but that’s exactly what you’ll find in proper neural network architectures.

Machine learning involves going through lots of examples just like this one and finding the best ways to adjust the weights so that the entire collection of these training examples produce results as close to what you want as possible. Then, and this is the entire point, you can give it a new input and its best guess about it will hopefully be pretty useful.


For older posts and RSS feed see the blog archives.
Chris X Edwards © 1999-2017