Tuesday, February 24, 2015

Dexcom G4 (non AP) - calibration in real life (part 2)

No time to read the blog post? Want a summary? Here you go: do not calibrate your Dexcom in when in low range.

After looking at the impact of blood meters consistency in the first part of this look at the Dexcom calibration issues, let's have a look at the average observer errors and their importance in different ranges. (two sets of data collection processes were run here: 1) comparing the calibration value entered with the last CGM value available in the 5 previous minutes 2) comparing the calibration value entered in the CGM with an extrapolated value based on the slope of the 3/4 previous CGM values to take into account the 15/20 mins delay - initial calibrations obviously don't have anterior CGM values, their impact was solely evaluated on the calibration chain that followed)

When all the calibration points were compared to the immediate value before calibration, I was surprised to discover the above result. On average, the Dexcom was higher by 20 mg/dl than the blood meter in the low to 60 mg/dL range and lower by 50 mg/dL in the 300-400 mg/dL range. In fact, the bias seems really linear, as if the meters and the Dexcom had been tuned to a different response slope.

The root cause could, of course, be somewhat comportemental: for example, users could tend to correct a low and then recalibrate after a while. Or act on a high and again recalibrate a bit later.

But lets have another look at the Bland - Altmann plot of the Libre vs the Dexcom (which I didn't have at the time of the Dexcom calibration test)

 You can't avoid noticing that the Libre systemic bias compared to the Dexcom (on another data set) goes in the opposite direction. In other words, the Libre is a bit more trigger happy than the Dexcom in lows and a lot more trigger happy in high ranges. And that happens to be the exact behavior we observe when we compare lots of independent calibration blood meter values and the Dexcom data at that point.

Now, let's look at the error in terms of percentage of error. This can be deduced from the first chart but it doesn't hurt to visualize it.

At 20 mg/dL difference is, relatively to a 50 mg/dl value a 40% difference, while a 60 mg/dL difference on a 300 mg/dL value is only 20%....

In low ranges, whether for intrinsic technical reasons or for behavioral reasons, the Dexcom is much less accurate than it is in physiological or high ranges. Core technical reasons could be similar to what Emil Martinec described in his Noise, Dynamic Range and Exposure paper (a very good read). There are dozens of possible behavioral reasons. But, in our daily lives, we don't care about the real primary cause, the practical result is the same.

Yet another view of the issue can be found below.

What if we look at the MARD of all individual data sets data sets and see if it is correlated with the frequency of calibrations in a certain range. We discover that, the frequency at which you calibrate in low range is quite positively correlated with the magnitude of the average error of the data set and that the frequency of calibrations in physiological range is negatively correlated with the magnitude of the error. In other words, the more you calibrate in normal range, the more accurate your global CGM session will be.

After looking at the data, my work hypothesis became "low calibrations are worse than about anything else"


To test that hypotheses, I decided to run a few experiments. While I could post dozens of similar examples, here are two examples taken from my own pre-experiment Dexcom traces.

 Here is a not so ideal double initial calibration in high range. The consequences aren't too bad and the Dexcom tracks the following calibrations decently. (I have multiple similar examples)

Here is an extremely accurate initial calibration in low range (that one was triple checked actually). The error is extremely significant on the next calibration and still significant the next evening (I'll get back to why this happens later)

 And here are a few random samples extracted from the global data set. Look at example 1 and 5 for large errors. (EDIT - CHART ONE DUPLICATE OF ABOVE - IMAGE UPLOAD ERROR - WILL FIX)


The main lesson for me here has been to avoid low calibrations like the plague, especially low double initial calibrations.

This is not a revelation, the typical calibration guidelines that could be summarized as "calibrate when stable and in range" implicitly tell you not to calibrate when you are not in range. Stick with that simple rule, and you will be fine. However, that rule does imply that being out of range or unstable is equally bad, wherever you stand. The data shows that this is not the case....

If I have to choose between calibrating stable in low or calibrating unstable in high, I definitely know what I will do...

In one of the next posts, I will look at the effect of the rate of change on the calibration accuracy. Much to my surprise, it had much less impact than what I would have thought.

Additional notes:

I am aware that I could go in more details about the numbers analyzed and the statistics applied. I have a ton of spreadsheets and "pickles' of the data, ANOVA, MANOVA, etc... but the goal of this experiment was to tackle the issue from a practical point of view, not to publish a scientific paper.

I am also aware that there is a certain circularity in removing inaccurate low calibrations with large errors in order to obtain a more "correct" global file. I am not solving the issue, I am just avoiding it so it does not impact ulterior accuracy. As long as we need to calibrate CGMs, there should be a standard method to evaluate the impact of calibration (in)accuracy on the data stream that follows. I suspect some artificial pancreas teams are looking into this and I really hope they do: user calibration will be, I believe, a significant obstacle on the road to an autonomous AP.

No comments:

Post a Comment