Update
I had made an error in coding for the HADCRUT/C&W example  see below. The agreement with C&W is now much improved.
In my
previous post, I introduced the idea of hierarchical, or nested gridding. In earlier posts, eg
here and
here, I had described using platonic solids as a basis for grids on the sphere that were reasonably uniform, and free of the pole singularities of latitude/longitude. I gave data files for icosahedral hexagon meshes of various degrees of resolution, usually proceeding by a factor of two in cell number or length scale. And in that previous post, I emphasised the simplicity of a scheme for working out which cell a point belonged by finding the nearest centre point. I foreshadowed the idea of embedding each such grid in a coarser parent, with grid averaging proceeding downward, and using the progressive information to supply estimates for empty cells.
The following graph from HADCRUT illustrates the problem. It shows July 2017 temperature anomalies on a 5°x5° grid, with colors for cells that have data, and white otherwise. They average the area colored, and omit the rest from the average. As I often argue, as a global estimate, this effectively replaces the rest by the average value. HADCRUT is aware of this, because they actually average by hemispheres, which means the infilling is done with the hemisphere average rather than global. As they point out, this has an important benefit in earlier years when the majority of missing cells were in the SH, which was also behaving differently, so the hemisphere average is more eappropriate than global. On the right, I show the same figure, but this time with my crude coloring in (with Paint) of that hemisphere average. You can assess how appropriate the infill is:
A muchdiscussed paper by
Cowtan and Way 2013 noted that this process led to bias in that the areas thus infilled tended not to have the behaviour of the average, but were warming faster, and this was underestimated particularly since 2000 because of the Arctic. They described a number of remedies, and I'll concentrate on the use of kriging. This is a fairly elaborate geostatistical interpolation method. When applied, HADCRUT databased trends increased to be more in line with other indices which did some degree of interpolation.
I think the right way to look at this is getting infilling right. HADCRUT was on the right track in using hemisphere averages, but it should be much more local. Every missing cell should be assigned the best estimate based on local data. This is in the spirit of spatial averaging. The cells are chosen as regions of proximity to a finite number of measurement points, and are assigned an average from those points because of the proximity. Proximity does not end at an artificial cell boundary.
In the previous post, I set up a grid averaging based on an inventory of about 11000 stations (including GHCN and ERSST) but integrated not temperature but a simple function sin(latitude)^2, which should give 1/3. I used averaging omitting empty cells, and showed that at coarse resolution the correct value was closely approximated, but this degraded with refinement, because of the accession of empty cells. I'll now complete that table using nested integration with hexagonal grid. At each successive level, if a cell is empty, it is assigned the average value of the smallest cell from a previous integration that includes it. (
I have fixed the which.min error here too; it made little difference).
 level  Numcells  Simple average  Nested average 
1  1  32  0.3292  0.3292

2  2  122  0.3311  0.3311

3  3  272  0.3275  0.3317

4  4  482  0.3256  0.3314

5  5  1082  0.3206  0.3317

6  6  1922  0.3167  0.332

7  7  3632  0.311  0.3313

8  8  7682  0.3096  0.3315 
The simple average shows that there is an optimum; a grid fine enough to resolve the (small) variation, but coarse enough to have data in most cells. The function is smooth, so there is little penalty for too coarse, but a larger one for too fine, since the areas of empty cells coincides with the function peak at the poles. The merit of the nested average is that it removes this downside. Further refinement may not help very much, but it does no harm, because a near local value is always used.
The actual coding for nested averaging is quite simple, and I'll give a more complete example below.
HADCRUT and Cowtan/Way
Cowtan and Way thankfully released a
very complete data set with their paper, so I'll redo their calculation (with kriging) with nested gridding and compare results. They used HADCRUT 4.1.1.0, data ending at end 2012. Here is a plot of results from 1980, with nested integration of the HADCRUT gridded data at centres (but on a hex grid). I'm showing every even step as hier14, with hier4 being the highest resolution at 7682 cells. All anomalies relative to 196190..
UpdateI had made an error in coding for the HADCRUT/C&W example  see code. I had used which.min instead of which.max. This almost worked, because it placed locations in the cells on the opposite side of the globe, consistently. However, the result is now much more consistent with C&W. With refining, the integrals now approach from below, and also converge much more tightly.
The HADCRUT 4 published monthly average (V4.1.1.0) is given in red, and the Cowtan and Way Version 1 kriging in black.
The nested integration makes even more difference than C&W, mainly in the time from 1995 to early 2000's. Otherwise, a As with C&W, it adheres closely to HADCRUT in earlier years, when presumably there isn't much bias associated with the missing data. C&W focussed on the effect on OLS trends, particularly since 1/1997. Here is a table, in °C/Century:
 Trend 19972012  Trend 19802012

HAD 4  0.462  1.57

Hier1  0.902  1.615

Hier2  0.929  1.635

Hier3  0.967  1.625

Hier4  0.956  1.624

C&W krig  0.97  1.689

Convergence is very good to the C&W trend I calculate. In their paper, for 19972012 C&W give a trend of 1.08 °C/Cen (table III)
which would agree very well with the nested results. C&W used ARMA(1,1) rather than OLS, but the discrepancy seems too large for that.
Update: Kevin Cowtan has explained the difference in a comment below.
Method and Code
This is the code for the integration of the monthly sequence. I'll omit the reading of the initial files and the graphics, and assume that we start with the HADCRUT 4.1.1.0 gridded 19802012 data reorganised into an array had[72,36,12,33] (lon,lat,month,year). The hexmin[[]] lists are as described and
posted previously. The 4 columns of $cells are the cell centres and areas (on sphere). The first section is just making pointer lists from the anomaly data into the grids, and from each grid into its parent. If you were doing this regularly, you would store the pointers and just reuse as needed, since it only has location data. The result is the gridpointer and datapointer lists. The code takes a few seconds.
monthave=array(NA,c(12,33,8)) #array for monthly averages
datapointer=gridpointer=cellarea=list(); g=0;
for(i in 1:8){ # making pointer lists for each grid level
g0=g; # previous g
g=as.matrix(hexmin[[i]]$cells); ng=nrow(g);
cellarea[[i]]=g[,4]
if(i>1){ # pointers to coarser grid i1
gp=rep(0,ng)
for(j in 1:ng)gp[j]=which.max(g0[,1:3]%*%g[j,1:3])
gridpointer[[i]]=gp
}
y=inv; ny=nrow(y); dp=rep(0,ny) # y is list of HAD grid centres in 3D cartesian
for(j in 1:ny) dp[j]=which.max(g[,1:3]%*%y[j,])
datapointer[[i]]=dp # datapointers into grid i
}
Update: Note the use of which.max here, which is the key instruction locating points in cells. I had originally used which.min, which actually almost worked, because it places ponts on the opposite side of the globe, and symmetry nearly makes that OK. But not quite. Although the idea is to minimise the distance, that is implemented as maximising the scalar product.
The main data loop just loops over months, counting and adding the data in each cell (using datapointer); forming a cell average. It then inherits from the parent grid values (for empty cells) from the parent average vector using gridpointer to find the match, so at each ave is complete. There is an assumption that the coarsest level has no empty cells. It is then combined with area weighting (cellarea, from hexmin) for the monthly average. Then on to the next month. The result is the array monthave[month, year, level] of global averages.
for(I in 1:33)for(J in 1:12){ # looping over months in data from 1980
if(J==1)print(Sys.time())
ave=rep(NA,8) # initialising
#tab=data.frame(level=ave,Numcells=ave,average=ave)
g=0
for(K in 1:8){ # over resolution levels
ave0=ave
integrand=c(had[,,J,I+130]) # Set integrand to HAD 4 for the month
area=cellarea[[K]];
cellsum=cellnum=rep(0,length(area)) # initialising
dp=datapointer[[K]]
for(i in 1:n){ # loop over "stations"
ii=integrand[i]
if(is.na(ii))next # no data in cell
j=dp[i]
cellsum[j]=cellsum[j]+ii
cellnum[j]=cellnum[j]+1
}
j=which(cellnum==0) # cells without data
gp=gridpointer[[K]]
if(K>1)for(i in j){cellnum[i]=1;cellsum[i]=ave0[gp[i]]}
ave=cellsum/cellnum # cell averages
Ave=sum(ave*area)/sum(area) # global average (areaweighted)
if(is.na(Ave))stop("A cell inherits no data")
monthave[J,I,K] = round(Ave,4) # weighted average
}
}# end I,J
Data
Moyhuhexmin has the hex cell data and was given in the earlier post. I have put a new zipped ascii version
here