Tuesday, February 7, 2017

Libre Clinical Study and discussion

The blog has slowed to a crawl, I apologize. The reasons behind my relative silence are

  1. Max has reached the tender age of 16. That means that teen issues and behaviors have become more common, impacting his control and our mood. I believe every T1D or T1D caregiver can relate to that situation which means I will leave it at that. Our latest HbA1c, a week ago, was still 5.5% but I believe this will be one of the last time we’ll see values below 6%. I will try not to despair, as there definitely are trade-offs one must accept if a kid is to have a semi-normal adolescence.
  2. We are going through an extensive remodeling of our environment and that takes time.
  3. rant alert
    Finally, as much as I hate to write this, I have lost interest in most open-source, community driven projects. I think I need to qualify that statement a bit before I get a lot of flak. As far as making data accessible everywhere and anywhere, I am still extremely grateful to the community as a whole, and especially the core members of the Nightscout project who made that data conveniently and cheaply available. The open source, or semi-open source community is great at developing features that actual T1D and T1D caregivers need or want. What really deeply annoys me, however, is how little attention is paid the the delivery of accurate results. Adding a new display device, check. Adding new minor features or screens, check. Accuracy, not so much. Assuming one want to deliver accurate results from raw data, there is a bit more to it than jumping from a single point calibration to another, or calculating an arbitrarily constrained slope. Occasionally, two open or semi-open source solutions are compared: they show a 50 mg/dL difference, eventually absurdly amplified by the lever effect of a bad slope, devices are rebooted, restarted and the community moves on. That is not to say that I would or that I privately do better, at least in a way that is applicable to a general population but that is precisely because I am aware of the potential issues that I decided not to inflict my experiments on innocent bystanders. On top of that, in the Libre world, the “semi-open source” approach, consisting of an incomplete github source dump that often misses all the computation parts, irritates me. Don’t think for a minute that those effectively closed source solutions are hiding some miraculous sauce: they aren’t.  The reason for the omission is often that they simply want to hide how they turn a very nice sensor like the Libre into something that behaves and performs like a second generation Medtronic sensor…
    end rant alert

The study

Let’s now have a look at the study, recently published in the British Medical Journal, “An alternative sensor-based method for glucose monitoring in children and young people with diabetes.” which you can download here.

The work was sponsored by Abbott: they were involved in the planning, the funding and the provision of devices used in the study. Except from the possible cherry picking of sensors used in the study, slight cherry picking of the competitors studies cited I did not spot any obvious red light. The population studied was a set of 4-17 yo children and teens that, according to the additional data (for example 7.6% mean HbA1c) seems to be a bit better controlled than the average population since 75% of that normal population does not meet the 7.5% target. Such a small bias may have had some impact on the study (more on this below) but it is probably because the authors of the study deliver better than average care. 

The conclusion of the study were, in short, MARD vs SMBG (capillary) 13.9% in that population (vs a previous 11.4% in a previous adult study) and 99.4% in the AB zone of the CEG. That is in line with the reported accuracy of the Dexcom G4 505 in some studies, although Dexcom likes to focus on its best study exclusively. 

The general conclusion was that the device could be trusted, was well accepted and, usual scientific caveat, could be beneficial long term. Well, there is nothing groundbreaking here, we all knew that, didn’t we? The benefit of that study is to be found elsewhere: respected researchers and clinicians, a fair number of cutaneous adverse effects (unlike in some previous studies), a protocol that does not smell of manipulation – that will drive acceptance and adds argument for funding and full coverage. 

Some personal comments

We do consistently get better accuracy than what the study reported on average. This is probably attributable to the fact that our “bad” weeks were 80% in range, our “good“ weeks were 90% in range while the population studied only stayed 50% in range. Incidentally, as a non T1D, when I ran sensors we had purchased in France on myself, I stayed at an 8% MARD for 12 days before the sensor started to drift. Variability, and the more frequent and usually rapid change of range it implies, definitely affect the CGM accuracy numbers.

The “acceptance” part of the study is very positive for Abbott. Again, we all know that. In fact, despite the overwhelming satisfaction expressed by the participants in the study, I believe the benefits to be understated. I always come back to our tennis experience on that issue: being able to play a full tennis tournament on a single daily SMBG check (as opposed to 10 to 15 checks per match) was just amazing. This was due both on the general accuracy of the device but also on its delay which was, in our carefully documented experience, 9 minutes shorter than the Dexcom G4 delay. For us, the Libre wasn’t merely a well accepted replacement, it changed our experience of T1D for the better.

On the delay side, the authors of the paper note “no delay”. This is really where I want to nitpick a bit. There definitely is a delay (quite visible in RAW data at stable temperature). It is simply partially compensated and partially obfuscated by the behavior of the Abbott’s algorithm.

It is extremely visible in chart B of the paper

as you can see, the sensor is – on average, note this is MRD not MARD – essentially perfect in stable or near stable conditions. The most significant relative differences occur in dynamic conditions and in the same direction.

In other words, when you are falling quickly, the Libre trails the fall and reads higher (probably missing some hypos), almost as a non delay compensated CGM would do. When you are rising quickly, the Libre leads and overshoots the rise (overestimating some hypers).
This is a behavior we noticed immediately (see herehere and here for some of our 2014 reports) and have consistently observed since.

I believe, just as I believed in 2014 that this is mostly the result of the Abbott delay compensation algorithm. It is not necessarily a failure of the algorithm (although looking at the raw data is appears it could be improved) but possibly a conscious decision by Abbott, either based on a technical issue such as an eventual lower signal to noise ratio in low ranges, or based on physiological issues they have identified in the BG to IG dynamics on falls. 

I am of course quite happy and a bit proud to have identified the issue in 2014 , while remaining aware our test population was n=2.

One last point on the delay issue is that the authors noted that the granularity of their time measurement was 5 mins. Timing issues are really critical as far as delay computations are concerned, which is why when we tested SMBGs vs Libre we always used immediate spot checks (because that is what matters to the patient) and I had to programmatically resynchronize the clocks on each checks (both the Libre and our BG Meter had drifting internal clocks). I used the same constant resynchronization technique with the Libre vs Dexcom comparison in order to maximize accuracy. Ballpark figures give a 15 minutes delay on the Dexcom G4, with a 9 minutes advantage on the Libre you end up with a six minutes average delay for the Libre vs SMBG (confirmed by our Libre vs SMBG tests in slow rises and slow drops), which would be hard to demonstrate with a 5 minutes granularity, especially if the comparison is not versus spot checks but versus inferred values from the 15 minutes averages.

Last comment: in absolute terms, you should keep in mind that the MARD given in that paper is most probably Libre CGM vs Libre BGM (or other Abbot BGM) and might be a bit biased as the same fundamental decisions have obviously driven the design of both devices. I do like that bias myself as the use of different BG meters would have muddied the algorithmic issue even further and would probably have required a set of Bland Altman plots to debias/detrend the data.

Apologies if I sound obsessed by speed issues, but as far as we are concerned, that was and probably remains (until a full CGM is available one way or another) the defining advantage of the Libre versus the Dexcom G4 or Dexcom G4 505.


  1. Hi Pierre!

    First of all, great blog, appreciate all the work you put into it. Given that I am a T1D who has used several Libre-related open or semi-open projects, and being an IT person by profession and an engineer by trade, I felt the desire to comment on your observations and maybe a little also agree on your rant... :)

    Anyway, first on the Libre behaviour described by yourself and also observed by myself and other folks anecdotally: the Libre reader prediction algorithm seems generally good during a stable BG within an intermediate range (e.g. 60 to 160, roughly) even though it seems to overstate high values. But why have a continuous metering at all if the BG is stable anyway? So, for me a CGM/FGM's behaviour under variable BG is actually more important and somewhat of a litmus test for its usability. Having cross-checked a few prediction algorithms based on Libre raw data (Abbott's reader, Liapp, LibreAlarm), I am somewhat mystified about Abbott's own algorithmic choices and would really love to know their design decisions, as on average they seem poorer than some of the hacked examples... At least for me personally a linear regression of the raw data extrapolated by 15 mins (as LibreAlarm uses and can be verified in its source code) seems to do a much better job at predicting both moderately rising as well as falling values. These predictions are still not aggressive enough for rapidly falling values such as those happening to me when I have sudden massive unaccounted for exercise... needless to say the Libre Reader totally fails at predicting this in any way close to accurately.

    Now, while I had been almost already completely bought into these simple predictions, I came across a specific situation where this algorithm completely failed: I was entering a nightly hypo with very slowly falling values into the 40s (verified by prick test). The sensor raw data still showed values around 70, and consequently LibreAlarm did indeed not alarm me. The Abbott reader, however, was able to identify the abnormal values and corrected them to displayed spot values very near the actual ones. Furthermore, the historical data was also already corrected! This was revealing to me: it appears the sensor was able to deliver some accompanying information around its state, and the Libre reader was able to react to it while LibreAlarm wasn't. My personal theory is that I had been lieing on the sensor, it therefore had been drifting, and some data (temperature?) was used to compensate for the drift, I guess.

    So, from these anecdotes you can see that I both do see some good value in these open projects but agree on the difficulties arising from not being able to systematically validate the application space. I wholeheartedly agree therefore on the criticism of the weird hiding of the prediction algorithm in some semi-open projects which doesn't allow for any constructive criticism and improvement. Now, as an engineer I am very much used to making useful things out of incomplete parts so I still very much follow these projects and am willing to figure out how to overcome their deficiencies if possible as the stock Abbott reader's spot values are really only good for me to validate that I am somewhat in range and stable anyway.

    With all of the criticism on Abbott's reader, however, I have to make clear that I absolutely give kudos to the company for coming up with an overall solution (including its business model) that is able to find a price point making such a therapeutic device available in a widespread manner. To me that aspect is actually equally innovative to all the technology behind it and a high bar for any competitor to beat.

    Cheers, Markus

  2. Hi Pierre, I was linked to your blog some time ago, and have not managed to read much about it, because of time constraints, but you are clearly more expert than me on the actual medical side of diabetes (I'm a derivative type.)

    Your "rant alert" reminded me a lot of my own blog post a few months ago, that came to for similar reasons: https://blog.flameeyes.eu/2016/04/last-words-on-diabetes-and-software/

    I wonder if we can actually have a parallel community of people who actually *care* for diabetes, rather than quantified self enthusiasts who just want something that look nice.

    1. Lots of people do care and develop nice stuff without any commercial afterthoughts: it is just that accuracy and accuracy validations aren't very high on their list...
      Ah - and I too enjoyed your IPv6 and FOSDEM rant ;)

  3. What is the primary technical reason behind the differing lag between Dexcom & Libre?

  4. there are three components in the lag
    - physiological lag (let's say 5 to 15 min depending on conditions, sites, etc) all sensors are equal
    - sensing lag (1-2 min) depends on architecture, diffusion through membranes, electron transfer speed. The Libre has a small advantage on paper here (wired enzyme)
    - algorithmic lag (where most of the differences lie) G4 would kind of average the past in order to smooth noise. G4 505 is more clever and uses a more direct interpretation of the last value, modulated by the past (similar to what is described in the Smart Sensor paper by Facchinetti), Libre is more dynamic, not afraid to issue "predictions" (which Abbott apparently prefers to call "delay compensation").
    - sampling frequency plays a major role of course: the Dexcom returns a value every 5 minute, the Libre every minute. From my experience with both, I'd say that the Libre wired enzyme offers a better SNR but is more prone to noise. The 5 min limitation on the Dexcom could be either a decision (people up there feeling it is good enough) or a power or signal constraint due to the technology used. I have an opinion, but not much to back it up. Very low level operation details are definitely on the realm of secret IP/confidential info.

    Incidentally, the predictive part can be found in the Facchinetti paper and, at least partly, in the Dexcom as well, but it is not activated.

  5. Hi Pierre,

    Regarding your rant ... I can halfway understand the feeling but:

    I have to say that I do care for quality, therefore I do not like the manufacturer-approved solution from Abbot. ;-)
    For our tests with n=1.2 I can see that Abbott uses a consistently pretty high gain factor/slope and overshoots many hypers and hypos by ~0.2 compared to finger pricks (Roche). I also see a big influence of sensor variation (some are very high, some only sometimes high), compared to "calibrated to selected finger values".

    My current favourite App is apparently just performing linear regression with the reference points given and I feed it only with trustworthy values in not too "steep" regions. This system seems accurate enough to detect finger-prick fails (sugar residues on hand, broken test stripe) and is most of the time within 5-10%.
    Of course: I know what I'm doing (day job in measurement processing), if calibrated curves change in unplausible ways, I drop the reference.

    Anyway: thanks a lot for the shared information in this blog. This helped me a lot while doing my own development.