Image Processing Reference
between bias and fly number in our estimates ( Figure 1 ). While the raw output overestimates
the number of flies on a patch at low fly numbers, it tends to underestimate fly numbers
when there are more flies on a patch (Blob bias: est = − 0.034, df = 442,173, t = − 309.1, P < 0.001).
However, TABU does show evidence of a consistent bias towards over-counting, which be-
comes slightly stronger at high numbers of flies (Tabu bias: est = 0.0075375, df = 495,300,
t = 71.67, P < 0.001). Application of the Tabu algorithm reduces the number of spurious patch
joining and leaving events to about 30% over the raw blob data ( Table 1 ). However, even for
the TABU output, the number of inferred joining and leaving events is still more than 2 × the
actual data, offering potential for improvement through subsequent application of ML.
FIGURE 1 Heat map of the distribution of per-fly over- and under-counts ( D ) as function of
the number of flies on a patch for each frame across five test videos.
We now investigate whether application of ML methods to our TABU trajectories can identi-
fy miscalled blob counts B N . Threefold cross-validation model-fit results are shown in Table 2 .
Here algorithms were trained using a period of 10 K frames. We see that all models have an
accuracy above 0.98. The two SVM models rank highly on almost all metrics, while logistic re-
gression ranks poorly on most metrics. While JAABA is not top ranked on any metric, we note
that it performs very well overall.
Performance Measures of ML Algorithms for Multifly Calling on Threefold Cross Validation
Accuracy Sensitivity Specificity Precision AUC
The accuracy, sensitivity, specificity, and area under the curve scores are shown for each. Ranks among ML meth-
ods for each performance score are given in brackets.
The critical practical question is whether models trained on one part of a video will be
equally effective when applied to later periods of the same video, or to completely new video.
Fly behavior is known to change over time, and varies among different genotypes and in dif-
ferent social contexts. We tested the performance of all algorithms on four videos that were not
used in the training of the algorithm. This included different genotypes and sex ratios, as well
as slightly different lighting and focus, than the algorithms were trained on. Results are shown
in Table 3 . The performance of all ML methods dropped slightly under these new conditions.
All the ML methods improved upon the trajectory input data from TABU. The performance