1 State-of-the-art
1.2 Datasets
1.2.1 Requirements
Before presenting the different annotated trees datasets and the reasons why they were not fully usable for the project, let’s enumerate the different conditions and requirements I was looking for to properly train the model:
- Multiple types of data:
- Aerial RGB images
- LiDAR point clouds (preferably aerial)
- Aerial infrared (CIR) images (optional)
- Tree crown annotations or bounding boxes
- High-enough resolution:
- For images, about 25 cm
- For point clouds, about 10 cm
Here are the explanations for these requirements. As for the types of data, RGB images and point clouds are required to experiment on the ability of the model to combine the two very different kinds of information they hold. Infrared data can also improve tree detection, but it was optional for this work because RGB images are enough to study the combination. Regarding tree annotations, it is necessary to have a way to spatially identify them individually, using crown contours or simply bounding boxes. Since the model outputs bounding boxes, any kind of other format can easily be transformed to bounding boxes. Finally, the resolution has to be high enough to identify all individual trees, including the smallest ones. For the point clouds especially, the whole idea is to see if and how the topology of the trees can be learnt, using at least the trunks and even the biggest branches if possible. Therefore, even if they are not really comparable, this is the reason why the required resolution is more precise for the point clouds.
1.2.2 Existing datasets with annotated trees
As explained above, there are quite a lot of requirements to fulfill to have a complete dataset with annotated trees which is suitable for the task. In practice, almost all the available datasets with annotated trees are missing something, as they are mainly focusing on using one kind of raw data (either spectral/hyperspectral images or LiDAR point clouds) and try to make the most out of it, instead of trying to use all the types of data together.
The most comprehensive list of tree annotations datasets was published in OpenForest (Ouaknine et al. 2023). FoMo-Bench (Bountos, Ouaknine, and Rolnick 2023) also lists several interesting datasets, even though most of them can also be found in OpenForest. Without enumerating all of them, there are multiple kinds of datasets that all have their own flaws regarding the requirements of this work.
Firstly, there are the forest inventories. TALLO (Jucker et al. 2022) is probably the most interesting one in this category, because it contains a lot of spatial information about almost 500K trees, with their locations, their crown radii and their heights. Therefore, everything needed to localize trees is in the dataset. However, I didn’t manage to find RGB images or LiDAR point clouds of the areas where the trees are located, making it impossible to use these annotations to train tree detection.
Secondly, there are the RGB datasets. Two examples of these datasets with a high quality of image are ReforesTree (Reiersen et al. 2022) and MillionTrees (B. Weinstein 2023). The only but major drawback of these datasets is obviously that they don’t provide any kind of point cloud, which makes them unsuitable for the task.
Thirdly, there are the LiDAR datasets, such as (Kalinicheva et al. 2022) and (Puliti et al. 2023). Similarly to RGB datasets, they lack one of the data source for the task I worked on. But unlike them, they have the advantage that the missing data could be much easier to acquire from another source, since RGB aerial or satellite images are much more common than LiDAR point clouds. However, this solution was abandoned for two main reasons. First it is often quite challenging to find the exact locations where the point clouds were acquired. Then, even when the location is known, it is often in the middle of a forest where the quality of openly available satellite imagery very low.
Finally, I also found two datasets that had RGB and LiDAR components. The first one is MDAS (Hu et al. 2023). This benchmark dataset encompasses RGB images, hyperspectral images and Digital Surface Models (DSM). There are however two major flaws. The obvious one is that this dataset was created with land semantic segmentation tasks in mind, so there is no tree annotations. The less obvious one is that a DSM is not a point cloud, even though it is some kind of 3D information and is often created using a LiDAR point cloud. As a consequence, this substantially limits the ability to experiment with the point cloud.
The only real dataset with RGB and LiDAR comes from NEON (B. Weinstein, Marconi, and White 2022). This dataset contains exactly all the data I was looking for, with RGB images, hyperspectral images and LiDAR point clouds. With 30975 tree annotations, it is also a quite large dataset, spanning across multiple various forests. The main reason why I decided not to use it in the end is the quality of the data, which is not bad but not as great as the one from the data available for the Netherlands, which I will talk about in the next section Section 1.2.3.
1.2.3 Public data
After rejecting all the available datasets I had found, the only remaining solution was to create my own dataset. I won’t dive too much in this process that I will explain in Chapter 3. I just want to mention all the publicly available raw data that I used or could have used to create this custom dataset.
For practical reasons, the two countries where I mostly searched for available data are France and the Netherlands. I was looking for three different data types independently:
- RGB (and if possible CIR) images
- LiDAR point clouds
- Tree annotations
These three types of data are available in similar ways in both countries, although the Netherlands have a small edge over France. RGB images are really easy to find in France with the BD ORTHO (Institut national de l’information géographique et forestière (IGN) 2021) and in the Netherlands with the Luchtfotos (Beeldmateriaal Nederland 2024), but the resolution is better in the Netherlands (8 cm vs 20 cm). Hyperspectral images are also available in both countries, although for those the resolution is only 25 cm in the Netherlands.
As for LiDAR point clouds, the Netherlands have a small edge over France, because they have already completed their forth version covering the whole country with AHN4 (Actueel Hoogtebestand Nederland 2020), and are working on the fifth version. In France, data acquisition for the first LiDAR point cloud covering the whole country started a few years ago (Institut national de l’information géographique et forestière (IGN) 2020). It is not yet finished, even though the data is already available for half of the country. The other advantage of the data from the Netherlands regarding LiDAR point clouds is that all flights are performed during winter, which allows light beams to penetrate more deeply in trees and reach trunks and branches. This is not the case in France, were data is acquired during the whole year, adding a level of variation and therefore more difficulty.
The part that is missing in both countries is related to tree annotations. Many municipalities have datasets containing information about all the public trees they handle. This is for example the case for Amsterdam (Gemeente Amsterdam 2024) and Bordeaux (Bordeaux Métropole 2024). However, these datasets cannot really be used as ground truth for a custom dataset for several reasons. First, many of them do not contain coordinates indicating the position of each tree in the city. Then, even those that contain coordinates are most of the time missing any kind of information allowing to deduce a bounding box for the tree crowns. Finally, even if they did contain everything, they only focus on public trees, and are missing every single tree located in a private area. Since public and private areas are obviously imbricated in all cities, it means that any area we try to train the model on would be missing all the private trees, making the training process impossible because we cannot have only a partial annotation of images.
The other tree annotation source that we could have used is the Boomregister (Coöperatief Boomregister U.A. 2014). This work covers the whole of the Netherlands, including public and private trees. However, the precision of the masks is far from perfect, and many trees are missing or incorrectly segmented, especially when they are less than 9 m high or have a crown diameter smaller than 4 m. Therefore, even though it is a very impressive piece of work, I decided that it could not be used as training data for deep learning models due to its biases and imperfections. Therefore, the only remaining solution was to annotate trees by myself, to create my own dataset of annotated trees using the available data.
1.2.4 Dataset augmentation techniques
When a dataset is too small to train a model, there are several ways of artificially enlarging it.
The most common way is to randomly apply deterministic or random transformations to the data, during the training process, to be able to generate several unique and different realistic data instances from one real data instance. There are a lot of different transformations that can be applied to images, divided into two categories: pixel-level and spatial-level (Buslaev et al. 2020). Pixel-level transformations modify the value of individual pixels, by applying different filters, such as random noise, color shifts and more complex effects like fog and sun flare. Spatial-level transformations modify the spatial arrangement of the image, without changing the pixel values. In other words, these transformations move the pixels in the image. These transformations range from simple rotations and croppings to complex spatial distortions. In the end, all these transformations are simply producing one artificial image out of one real image.
Another way to enlarge a dataset is to instead generate completely new input data sharing the same properties as the initial dataset. This can be done using Generative Adversarial Networks (GAN). These models usually have two parts, a generator and a discriminator, which are trained in parallel. The generator learns to produce realistic artificial data, while the discriminator learns to discriminate between real data and artificial data produced by the generator. If the training is successful, we can then use the generator and random seeds to generate random but realistic artificial data similar to the dataset. This method has for example been successfully used to generate artificial tree height maps (Sun et al. 2022).
However, training GANs can be very unstable, and I haven’t found any paper applying this technique to generate LiDAR and RGB data at the same time. The artificial instances would need to be consistent between the two types of data, which might be very difficult. Therefore, I only used the random image transformations during the training process, because the chances of training of successful GAN seemed too low to be worth it.
1.3 Algorithms and models
In this section, the different algorithms and methods are grouped according to the type of data they use as input.
1.3.1 Images only
First, there are methods that perform tree detection using only visible or hyperspectral images or both. Several different algorithms have been developed to analytically delineate tree crowns from images, by using the particular shape of the trees and its effect on images (Gomes and Maillard 2016). Without diving into the details, here are a few of them. The watershed algorithm identifies trees to inverted watersheds using the grey-scale image and tree crowns frontiers are found by incrementally flooding the watersheds (Vincent and Soille 1991). The local maxima filtering uses the intensity of the pixels in the grey-scale image to identify the brightest points locally and use them as treetops (Wulder, Niemann, and Goodenough 2000). Reversely, the valley-following algorithm uses the darkest pixels which are considered as the junctions between the trees since shaded areas are the lower part of the tree crowns (Gougeon et al. 1998). Another interesting algorithm is template matching. This algorithm simulates the appearance of simple tree templates with the light effects, and tries to identify similar patterns in the grey-scale image (Pollock 1996). Combinations of these techniques and others have also been proposed.
But with the recent developments of deep learning in image analysis, deep learning models are increasingly used to detect trees using RGB images. In some cases, deep learning is used to extract features that can then be the input of one of the algorithms described above. One example is the use of two neural networks to predict masks, outlines and distance transforms which can then be the input of a watershed algorithm (Freudenberg, Magdon, and Nölke 2022). In other cases, a deep learning model is responsible of directly detecting tree masks or bounding boxes, often using CNNs, given the images (B. G. Weinstein et al. 2020).
1.3.2 LiDAR only
Reversely, some of the methods to identify individual trees use LiDAR data only. There are a lot of different ways to use and analyze point clouds, but the one that is mostly used for trees is based on height maps, or Canopy Height Models (CHM).
A CHM is a raster computed as the subtraction of the Digital Terrain Model (DTM) to the Digital Surface Model (DSM). What it means is that a CHM contains the height above ground of the highest point in the area corresponding to each pixel. This CHM can for example be used as the input raster for the watershed algorithm, as it contains the height values that can be used to determine local maxima (Kwak et al. 2007). The other analytical methods described in the previous section (Section 1.3.1) also have their equivalents using the CHM.
A lot of different analytical methods and variations of the simple CHM were proposed to perform individual tree detection, but in the end, most of them still use the concept of local maxima (Eysn et al. 2015; Wang et al. 2016). A CHM can also be used as the input of any kind of convolutional neural network (CNN) because it is shaped exactly like any image. This allows to use a lot of different techniques usually applied to object detection in images.
Then, even though I finally used an approach similar to the CHM, I want to mention other kinds of deep learning techniques that exist and could potentially leverage all the information contained in a point cloud. These techniques can be divided in two categories: projection-based and point-based methods (Diab, Kashef, and Shaker 2022). The main difference between the two is that projection-based techniques are based on grids while point-based methods take unstructured point clouds as input. Among projection-based methods, the most basic method is 2D CNN, which is how CHM can be processed. Then, multiview representation tries to tackle the 3D aspect by projecting the point cloud in multiple directions before merging them together. To really deal with 3D data, volumetric grid representation consists in using 3D occupancy grids, which are processed using 3D CNNs. Among point-based methods, there are methods based on PointNet, which are able to extract features and perform the classical computer vision tasks by taking point clouds as input. Finally, Convolutional Point Networks use a continuous generalization of convolutions to apply convolution kernels to arbitrarily distributed point clouds.
1.3.3 LiDAR and images
Let’s now talk about the models of interest for this work, which are machine learning pipelines using both LiDAR point cloud data and RGB images.
One example is a pipeline which uses a watershed algorithm to extract crown boundaries, before extracting individual tree features from the LiDAR point cloud, hyperspectral and RGB images (Qin et al. 2022). These features are then used by a random forest classifier to identify which species the tree belongs to. This pipeline therefore makes the most out of all data to identify species, but sticks to an improved variant of the watershed algorithm for individual tree segmentation, which only uses a CHM raster.
Other works focused on using a model from end to end that is able to take both the CHM and the RGB data as input and combine them to make the most out of all the available data. Among other examples, there are ACE R-CNN (Li et al. 2022), an evolution of Mask region-based convolution neural network (Mask R-CNN), and AMF GD YOLOv8 (Zhong et al. 2024), an evolution of YOLOv8. These two models have proven to give much better results when using both the images and the LiDAR data as a CHM than when using only one of them.
In this work, I focus on AMF GD YOLOv8, which looks promising and flexible thanks to the flexibility of YOLOv8 (Redmon et al. 2016). Indeed, YoloV8 has different variants allowing the model to deal with different computer vision tasks.