On the interpretability of models

May 28, 2021 in Deep Learning, Tech, Science

A common criticism of deep learning models is that they are 'black boxes'. You put data in one end as your inputs, the argument goes, and you get some predictions or results out the other end, but you have no idea why the model gave your those predictions.

Ways of interpreting learning in computer vision models - credit https://thedatascientist.com/what-deep-learning-is-and-isnt/

This has something to do with how neural networks work: you often have many layers that are busy with the 'learning', and each successive layer may be able to interpret or recognise more features or greater levels of abstraction. In the above image, you can get a sense of how the earlier layers (on the left) are learning basic contour features and then these get abstracted together in more general face features and so on.

Some of this also has to do with the fact that when you train your model, you do so assuming that the model will be used on data that the model hasn't seen. In this (common) use case, it becomes a bit harder to say exactly why a certain prediction was made, though there are a lot of ways we can start to open up the black box.

Arthur Samuel and the 'Frontier of Automation'

May 26, 2021 in Deep Learning, Science

The use of neural networks / architectures is a powerful pattern, but it's worth remembering that this pattern is part of the broader category of machine learning. (You can think of 'deep learning' as a rebranding of neural networks or what was once more commonly referred to as connectionism).

In a classic essay published in 1962, an IBM researcher called Arthur Samuel proposed a way to have computers 'learn', a different process from how we normally code things up imperatively (see my previous post for more on this):

"Suppose we arrange for some automatic means of testing the effectiveness of any current weight assignment in terms of actual performance and provide a mechanism for altering the weight assignment so as to maximise the performance. We need not go into the details of such a procedure to see that it could be made entirely automatic and to see that a machine so programmed would "learn" from its experience"

Within this essay and this quote specifically, we can find some of the key building blocks of machine learning:

We have our inputs (our data) and our weights. Our weights (or the weight assignments) are variables that allow for different configurations and behaviours of our model. Our results are what the computer has assumed based on the weights and the model, and we have some kind of a metric (our performance) to judge whether this model was accurate or not. The computer then updates the weights based on that performance, tweaking it such that it tries to get better performance.

This is a slightly amended version which language or jargon that are more commonly found today. As you might expect would happen, the language used in the 1960s is in many cases different from what gets used today:

The main difference here is that we have some labels which are used to know whether the predictions are correct or not. The loss is a way of measuring the performance of our model that is suited for updating our parameters (that used to be referred to as weights).

Diagnosing Diabetes with Weka & Machine Learning

June 18, 2018 in Coding, Science, Tech

[I mentioned two weeks ago that I was working to dive into the practical uses of machine learning algorithms. This is the first of a series of posts where I show what I’ve been working on.]

The Pima Indians dataset is well-known among beginners to machine learning because it is a binary classification problem and has nice, clean data. The simplicity made it an attractive option. In what follows I’ll be mostly following a process outlined by Jason Brownlee on his blog.

The Pima Indian population are based near Phoenix, Arizona (USA). They have been heavily studied since 1965 on account of high rates of diabetes. This dataset contains measurements for 768 female subjects, all aged 21 years and above. The attributes are as follows, and I list them here since they weren’t explicitly stated in the version of the data that came with Weka and I only found them after a bit of digging online:

preg - the number of times the subject had been pregnant
plan - the concentration of blood plasma glucose (two hours after drinking a glucose solution)
pres - diastolic blood pressure in mmHg
skin - triceps skin fold thickness in mm
insu - serum insulin (two hours after drinking glucose solution)
mass - body mass index ((weight/height)**2)
pedi - ‘diabetes pedigree function’ (a measurement I didn’t quite understand but it relates to the extent to which an individual has some kind of hereditary or genetic risk of diabetes higher than the norm)
age - in years

This video gives a bit of helpful context to the data and the test subjects:

https://www.youtube.com/watch?v=pN4HqWRybwk

I also came across a book by David H. DeJong called “Stealing the Gila: The Pima Agricultural Economy and Water Deprivation, 1848-1921” which describes how the diverting of water and other policies “reduced [the Pima] to cycles of poverty, their lives destroyed by greed and disrespect for the law, as well as legal decisions made for personal gain.” It looks like a really interesting read.

The Problem

The idea with this data set is to take the attributes listed above, combine them with the labelling (i.e. we know who has been diagnosed with diabetes and who hasn’t) and figure out the pattern as much as we can. Can we figure out if someone is likely to have diabetes just by taking a few of these measurements?

The promise of machine learning and other related statistical tools is that we can learn from the data that we have to make testing more useful. Perhaps we only need your height, genetic risk factor and skin thickness to make such a prediction? (Unlikely, but still, perhaps…). If we emerge from our study with a statistical model, how well does it perform? How much can we generalise from the data? What would be an acceptable error rate in the medical context? Is it 80% or is it 99.99%? The former would save millions of dollars in test costs but would throw lots of errors; the latter would be highly accurate but it might be expensive to calculate the model.

The use case for this specific case would maybe be to identify at-risk individuals who are on the way to a diagnosis of diabetes and intervene somehow. Our motivation here is clear: people don’t want to be diabetic, so how early can we catch this transition? It would save governments money, expose fewer people to unnecessary tests and improve their quality of life.

I’m not a doctor, but to solve this problem manually would seem to require monitoring of blood tests (glucose and insulin levels), perhaps looking at exercise and diet, and also weight. At scale across the population of an entire country, for example, this seems like it might get expensive and/or too much for one person to process in their head. The data isn’t too large or complex, but it still seems to be useful you’d want to automate it to some extent.

There are some potential ethical issues around the data. Everything offered as part of the table of data is anonymised, but there are some outliers (see below) that I have to believe wouldn’t be too hard to find. The applicability of whatever model comes from this data will likely only have a limited application — the data is drawn only from women, after all. I also noticed that while the data is no longer available on the UCI Machine Learning Repository website, it still comes packaged with Weka. There was a notice on the UCI site (which I can no longer seem to be able to locate) stating that the permission to host the data had expired. It is unclear to me what’s going on with the permission there.

Data Preparation

Exploring the data using Weka’s explorer tool plus the attribute list above we can see that we have some blood test data, some non-blood body measurements and this genetic marker (presumably achieved through either blood tests or interview questions about family history). As I was working to understand the various attributes, it occurred to me that for this to be really useful, we’d want our model to work on data that wasn’t derived from blood tests; they’re expensive and they’re invasive. I didn’t get round to doing that for this round of exploration but it’d be high up on my wishlist next time I return to this data.

There are only 768 instances, so it’s still quite a small data set, especially in the context of machine learning examples. This is probably explained by the fact that it’s real medical data (so there are consent issues) plus the fact that it is several decades old and the processing power available then didn’t lend itself to processing mega-huge sets.

Thinking about what attributes might be removed to make a simpler model, I first thought that maybe the number-of-pregnancies might be dispensable, but then I thought to the number of hormonal and other changes that happen and I guess actually it is probably quite important.

There were some outliers in the data that I identified as needing further consideration / processing before we get our model trained:

There were some women who had been pregnant 16 or 17 times. They were on the far edge of the long tail, but I ended up leaving them in for the model rather than deleting them completely.
There were 5 people who had 0 as their result for ‘plus’, which seems to be an error. I decided to remove these.
There were 35 people who had 0 as their blood pressure, which seems to be an error.
There were 227 people with 0mm skin thickness. This is possible, but I think it’s more likely that no measurement was taken, at least for a lot of them.
There were 11 people who are listed as weighing 0kg. That seems to be an error.

After I’d identified these various outliers I decided to make a series of transformations to the whole set. From this I’d emerge with three broad versions of the data:

the baseline dataset, with nothing removed or changed
the outliers removed completely and replaced with NaN values
the outliers replaced with mean averages for each particular attribute

For each of these broad versions, moreover, I prepared three separate versions:

all values normalised (ranges and values for all attributes transformed to being from 0-1 instead of being in their original ranges. i.e. maximum weight as 1 and minimum weight as 0 etc)
all values standardised (set the mean for the data as being zero and the standard deviation to 1)
all values normalised and standardised (i.e. both transformations applied)

Producing these various versions of the data was something I learned from Brownlee’s book, “Machine Learning Mastery With Weka”. It turned out to be somewhat fiddly to do in Weka. In particular, every time you want to open up a file to apply transformations the default folder it remembers is often several folders down in the folder hierarchy. By the ninth transformation (there were nine sets in total, by the end of this process) I was ready for a more functional / automated approach to these data conversions!

Weka does offer some nice tools for the initial exploration of the data. Here you can see two charts that are generated in the ‘explorer’ application. First we have a series of simple bar charts visualising all the individual attributes. Then we have a plot matrix showing how all the various attributes correlate to each other (or not, as was mostly the case for this data set).

Visualisation of Pima Indians dataset attributes (auto-generated in Weka)

Plot matrix showing visualisation of correlations between all attributes (auto-generated using Weka)

Choosing Algorithms and Training the Model

Given that I’m very much at the beginning of my machine learning journey, I don’t have any strong sense of which algorithms might be more appropriate or not for this particular data set. I knew that this is a classification problem and not regression (i.e. we’re trying to decide whether people have diabetes or not — two categories — instead of predicting where people fall on a scale / spectrum) so that ruled out a few options, but really the field was wide open.

Jason Brownlee advises taking a sample of around ten different algorithm families to get an initial sense of whether there are any clear outliers (either over-performers or under-performers). Once I have a better sense of the overall space, I can then tweak things, or double down on a particular algorithm family to select a more limited feature set perhaps.

For this algorithmic spot-check I chose 12 algorithms, sampling from all the main families as I currently understand them. Running this set, I immediately came across an error message: Weka was telling me that the function.LinearRegression algorithm doesn’t function for classification algorithms. I removed that and reran the tests.

When doing this kind of test, it helps to have a baseline accuracy figure against which you can compare how much these fancy algorithms are improving predictions. In Weka, this is called a ZeroR algorithm and I think it basically says that everyone has no diabetes. For this dataset, it got 65.11% accuracy, which isn’t bad all things considered!

(Note that everything here is being run through k-fold cross-validation where training and test data are kept separately, and then this is repeated ten times. The final results are averaged out between them. Weka does this all with great ease, making it pleasant to conform to best practices when it comes to data science).

This figure shows how logistic regression was the best performing algorithm out of the box at 77.47% accuracy. I read somewhere that it often performs well on binary classification problems, so this didn’t surprise me. Support Vector Machines (listed as SMO in the Weka GUI) are also supposedly quite good for binary problems and it was only two-thirds of a percent behind logistic regression. Using Weka’s tools for statistical analysis of the result, I came to the conclusion that LMT, logistic regression, SGD and SMO were all worth further exploration and tinkering.

For example, I tried the following with the Support Vector Machines algorithm:

tweaking the value of c (complexity) to see if 0.25 performed better than 0.75, for example. It turned out that 0.5 was the sweet spot for the c value.
trying different kernels - I tried most of the options listed in Weka and they all performed pretty poorly. In particular, RBF (radial basis) was really poor.

None of my tweaks really seemed to improve the accuracy of the model. I imagine that some of the algorithms function better with more data, but I am not in a position to generate more.

The next step was to try some ensemble methods where the predictions made by multiple models are combined. In particular, bagging/bootstrap, boosting and voting were all recommended as options to try out.

You can see here that ultimately none of those outperformed logistic regression, which was surprising to me. I’m not at a place where my statistical understanding can explain why that’s the case — ensemble methods seemed to offer the best of many worlds — but I can’t really argue against the results.

Finally, I tried the MultiLayerPerceptron to throw one possible implementation of Deep Learning at the problem. This performed pretty poorly as per default configuration.

Findings / Conclusions

The best accuracy I was able to achieve on this data set was using a logistic regression model. This performed with 77.47% accuracy (standard deviation of 4.39%). We can restate this as an accuracy of between 68.96% and 86.25% accuracy on unseen data. This is slightly disappointing since it isn’t that much better than the ZeroR algorithm.

Towards the latter stages of my work on this problem, I came across a blog post by someone who used a neural network to reach results of 95% accuracy on this same data set, showing that there are models that bring dramatically improved performance. I don’t understand neural networks enough to be able to evaluate what he did (i.e. to know whether this is simply overfitting or actually a performant / real improvement on my results). Nevertheless, it seems like a significant improvement.

As my first big push to work on a real data set using machine learning tools, this process was instructional in the following ways:

Weka is easy to use and it makes some of the best practices in data science no-brainers to implement
Constructing the various data sets, implementing the experiments to compare the algorithms and so on was made slightly tedious by the GUI interface. If I wanted to run through many more variations it would have been prohibitively tiresome to have to manually click through all the options.
Weka is slow (or maybe my linux laptop is slow). Some of the algorithm sets I tried (Support Vector Machines, for example, or ensemble methods using SVMs) took 20+ minutes to run. The data set wasn’t huge at all, so I have to imagine that a real ‘big data’ set would make this kind of quick incremental exploration and iteration difficult to practice. Weka is, of course, a Java app and I’m running that on my Mac. I suspect that if I were to run similar algorithms through Python (or even better, C) on my Mac I’d get significant performance improvements.
I have very little sense of the variation between the various algorithms, what each one does and where the strengths and weaknesses lie. I want to tackle this from two directions: improving my baseline understanding of statistics and also just getting more experience implementing them for practical problems such as this one.

The next problem I want to tackle is that of the UCI soybean dataset). Each instance describes properties of a crop of soybeans and the task is to predict which of the 19 diseases the crop suffers. Again, the dataset isn’t huge but it is a multivariate classification problem so there are new challenges to be tackled there.

Into Eternity: the construction of Onkalo's tunnels

December 16, 2016 in Environment, General, Science

I watched a gripping documentary this evening on the construction of a tunnel / radioactive waste burial facility. Onkalo is located on the west coast of Finland and is designed to store the country’s national nuclear waste until it is no longer harmful to life. This amounts to around 100,000 years.

If you think that human life has been around for a very small fraction of that (see the images above for a sense of the scale), and most of the designed ‘things’ have never lasted that long, this makes the design and implementation of such a storage facility a really huge proposition.

The tunnel and related storage areas will only be completed post-2100, but the design team take into account things like their assumption that there Earth will be due another ice age in around 60,000 years from now. They consider the various possibilities of new or different civilisations coming and going.

Into Eternity is a fascinating documentary. Find it online if you get the chance.

Knot 1: Solitary Confinement & Digital Security

December 02, 2016 in Knot, Books, Science, Tech

This is the first of what I hope will be a regular feature on this blog. Knot will link to recent things that I’ve been reading. It will include short articles as well as books and other things on the internet.

Articles

Martin Garbus — “America’s Invisible Inferno” (NYRB)

This NYRB review of ‘Hell Is A Very Small Place’ covers the use of solitary confinement in the American prison system. It ends with unearned optimism, but nevertheless is a useful reminder of the importance of the issue.

Mary Catherine O’Connor — “The latest weapon in the fight against illegal fishing? Artificial intelligence” (The Guardian)

Interesting short piece on the use of machine learning combined with data science competition site, Kaggle, to try to improve the efficiency and accuracy of fishing inspections. Hardly a day passes when a new initiative like this is launched, merging data science and crowdsourced solutions. I look forward to being able to contribute (one day in the hopefully-not-so-distant future).

Meeri Kim — “How a researcher used big data to beat her own ovarian cancer” (Washington Post)

More data science, though this is more at the N=1 edge of things than the previous fishery data story. The more data we are able to access and generate, the more there will be need for ways of analysing and processing these complex interactions. A lot of money is being invested in finding ways to automate this for non-technical users, though I imagine we’re still a little way away from that utopia.

Check Point — “More Than 1 Million Google Accounts Breached by Gooligan” (CheckPoint Blog)

I’m not quite sure why this isn’t a bigger story. I’ve long believed that anything stored by Google (in particular, emails) will at some point be leaked in a hack or by some disgruntled employee. This latest hack is pretty close to that scenario, with the caveat that we don’t know what was taken. Takeaway: move away from the Google ecosystem where possible.

Scott Gilbertson — “HTTPS is not a magic bullet for Web security” (Ars Technica)

This is a few months old but offers a useful overview on HTTPS technology, what it can help with (and what it is less useful for). The headline is actually a bit deceptive since the article seems to make a strong case for the use of HTTPS.

Shane Parrish — "The Simple Plan To Read More” (Farnham Street / Personal Growth)

I’ve long ago taken this advice on board, but it’s a useful reminder. I enjoy reading, and I enjoy the diversity that comes from reading widely and broadly, so keeping up the pace and seeing it more as a habit to be done most days is, to my mind, the right way of looking at it.

Books

I read three books this past week. Jim Klopman’s Balance Is Power was my easy introduction to some of the science behind why it is useful to train balance in a focused way as a skill.

I finished the second volume in Elena Ferrante’s tetralogy, The Story of a New Name, though I didn’t enjoy it quite as much as the first. I suspect I’ll have to return to it in a few weeks after the dust has settled.

Finally, I devoured Paul Kalanathi’s When Breath Becomes Air, a beautifully written exploration of death, medicine, how we go through all these things as people, how illness affects not only the body but the mind and the spirit as well. It reminded me of Atul Gawande’s Being Mortal, which covered similar themes albeit with less of an intimately tragic outcome. I highly recommend giving When Breath Becomes Air a read.

Chondrichthyan Fun: a review of Shark MOOC

July 24, 2016 in General, Science

Everyone needs a good distraction. For the past four weeks, mine has been an online edX course run by the University of Cornell and the University of Queensland (Australia) entitled, Sharks! Global Diversity, Biology and Conservation. With my PhD newly submitted to external examiners and nothing to do but to wait until the oral viva examination, this was my reward to myself. The course would be a way to engage with the world in a different way, through the medium of biology, and maybe open myself to new possibilities.

I’ve been interested in ecology, environmental science and the hard sciences in general for a while, but as an outsider to the discipline, it’s hard to find a way in. This course (the “Shark MOOC”, as it became known) was a way to engage with biology, ecology and start to expose myself more to scientists doing their work.

I’ll spare you my raw enthusiasm for the course and tell you every fact I learnt about Chondrichthyes — yes, that’s the label that refers to sharks, rays, skates and a weird group of mainly deep-sea fishes called chimaeras. The course wasn’t really a deep dive into any aspect of shark biology, but surveyed the whole territory with moments of detail and depth. I learnt about:

Shark Anatomy — this was presented in a decent amount of detail. I found it a bit overwhelming at the beginning, having to draw (and then test my recall again by redrawing the same diagrams without reference to books) shark brains, the different types of tails and so on.
Shark Senses — did you know that sharks have two extra senses?!
The Diversity of Shark Habitats and Behaviour — getting to know all of the different places where sharks hang out and what they do there, and the similarities and differences between all the different species.
Palaeontology — we examined old fossils of sharks and rays, connecting us to this group of ancient fishes that predates (and mostly outlived the extinction of) the big dinosaurs.
Ecology — the final week of the course was wise to connect the particularities of what we had just been studying (i.e. sharks and rays and their anatomy and such) to the bigger picture of ecosystems, conservation, ethical questions around shark ‘ecotourism’, the realities of what industrial fishing has done to shark populations and so on. This way, the detail of previous weeks brought us to a deeper understanding of topics and issues that otherwise might have only been covered superficially.
Mythology — the course blended in a lot of 'mythbusting', since media and entertainment have shaped the way most of us (including yours truly, prior to taking this course) think about sharks. It was fascinating to explore how this demonisation had played out (since the release of Jaws in 1975/6) in public debate and understanding.

The course was mainly taught by William E. Bemis, Professor of Ecology and Evolutionary Biology at Cornell University, and Joshua Moyer, a research technician at Cornell University. Both transmitted great enthusiasm for their subject to their students and seemed to have an endless list of friends and colleagues on speed-dial from whom students got to learn. This was one of the best things about the course, the fact that we got to (virtually) hang out in so many different countries, with so many different scientists, both under the water and back in the lab.

I knew nothing about nothing when it came to sharks before the course, and so I was surprised to learn that there is a very active enthusiasm for the study of Chondrichthyes in the USA, especially among children. There were primary-school-age students participating in course discussion forums, and this was great to see. I learnt about the activities of The Gills Club, a “STEM-based education initiative dedicated to connecting girls with female scientists from around the world, sharing knowledge, and inspiring shark and ocean conservation”. What a great initiative!

Most of the course was presented through videos and/or text with diagrams and photos. Having taken a number of other MOOCs, I know that two of the challenges in teaching people through this medium are (a) keeping it interesting and (b) making sure that learning is active where possible. There were some multiple choice questions which formed the entirety of the course assessment, but they covered only a tiny fraction of the material being taught and were sometimes oddly worded.

Nevertheless, it came as no surprise to me that the course authors took two years (and probably a decent amount of money) to develop the MOOC, film all the different interviews and lessons and so on. It was well worth it, and I hope whoever funded the project will consider this a really worthwhile investment as something that can inspire a new generation of scientists.

If you study a course like this on your own, you have to find ways to make the learning process active. For me, this included:

adding important concepts, anatomy diagrams and new words into Anki in order to test myself throughout. (If you haven’t heard of Anki before, it is software that uses a spaced repetition algorithm to super-charge your learning and retention of information into the long-term. I consider it an essential piece in the toolkit of anyone studying anything in the twenty-first century. Check out more here.) Anki's useful stats and charts tells me that I added 289 cards relating to the course, most of which are now headed into my long-term memory courtesy of the spaced-repetition algorithm.
drawing and redrawing diagrams of anatomy or timelines or whatever it was that I was studying. Where possible I tried to find a way to keep it visual. All of my notes were taken by hand, something I’ve been doing more and more recently after studies have shown it can help recall.
talking to other people about Chondrichthyes. Explaining the lessons I’d been studying on a particular day was fun to do and it meant my brain got another chance to synthesise the material. Teaching newly-learned concepts to others is a great way to do this.
finding other ways to deepen my exposure to the material and concepts being studied. I immersed myself in documentaries and fictional representations of sharks and rays. Luckily, there is a real richness available here, from the semi-numinous expience of the BBC's 2015 three-parter Shark to the no-less entertaining Sharknado quadrilogy.

I learnt on the first day of the course that there are sharks with twitter accounts. OCEARCH is one organisation (profiled in the New Yorker back in 2013) that attaches real-time satellite trackers to sharks. You can go to their website to follow the movements of a bunch of different sharks, most of whom are given cutesy names like Carolina or Johnny.

We were instructed to pick a shark and follow her throughout the weeks that followed. I selected a female white shark called Katharine (watch her being tagged in this video) and saw as she swam up the Eastern coast of the USA. Last time she surfaced she was headed out into the Atlantic ocean. (As a reminder of the realities of industrial fishing, some of the sharks chosen by fellow students ended up ‘pinging’ in their signals from fishing docks in west Africa or east Asia after being caught and killed, probably for their fins but possibly just as so-called ‘bycatch’).

It was fun to get to know a new community of enthusiasts for the study of a particular topic. I had deleted my @strickvl twitter account a few months back, frustrated by the noise:signal ratio and generally finding less and less reason to justify ‘keeping up with the news’. This shark course, though, saw me a bit more active on my (previously entirely passively-tweeted) @stricklinks account. I discovered fellow Chondrichthyes scientists, photographers and institutions this way and I was reminded of what Twitter felt like back in 2008 when I first joined. It was a place to meet other people, to find people who shared the same interests (back then I was all-Kandahar-all-day-all-night) around the world, to connect and cross-pollinate in a way whose speed was impossible even decades before. I think this is one of the best uses for twitter: in the beginnings of an interest in a topic, you can use it to generate a community round you, to find others who can inspire and drive you forward into learning more, people who can challenge your assumptions and model best practices in their fields.

As you can probably see, I found the course incredibly stimulating. I’m not sure my future is in marine biology (who knows?), but I’ve taken away a far greater appreciation for the ocean, the creatures in it and the beginnings of an understanding of how all of this starts to fit together: the history of how species grow and develop, the interactions between species, the threats and challenges to this whole system…

For now, though, I’d strongly recommend anyone with even a passing interest in science or in the ocean or in sharks to enroll in this course. It’s free, immensely entertaining and enriching and who knows where it will take you.