Mapping recreational marine traffic from Sentinel-2 imagery using YOLO object detection models
  1. Data creation
  • Introduction
  • Data creation
  • Converting the annotations to computer vision datasets
  • Inference example

Data creation

Author

Janne Mäyrä

Published

May 19, 2025

Code
import warnings
warnings.filterwarnings("ignore")

import rasterio as rio
import rasterio.plot as rio_plot
import matplotlib.pyplot as plt
import os
from pathlib import Path
import numpy as np
import geopandas as gpd
import pandas as pd
from shapely.geometry import Point
import shapely
import seaborn as sns
from itertools import product

Satellite data

Mostly cloudless Sentinel 2 mosaics were acquired from Copernicus Open Access Hub as L1C products, which were then converted to RGB mosaics with similar color corrections that are used in Tarkka service.

The following mosaics were used as the training and validation data:

Location Product name Number of annotations
Archipelago sea S2A_MSIL1C_20220515T100031_N0400_R122_T34VEM_20220515T120450 183
Archipelago sea S2B_MSIL1C_20220619T100029_N0400_R122_T34VEM_20220619T104419 519
Archipelago sea S2A_MSIL1C_20220721T095041_N0400_R079_T34VEM_20220721T115325 1518
Archipelago sea S2A_MSIL1C_20220813T095601_N0400_R122_T34VEM_20220813T120233 1371
Gulf of Finland S2B_MSIL1C_20220606T095029_N0400_R079_T35VLG_20220606T105944 277
Gulf of Finland S2B_MSIL1C_20220626T095039_N0400_R079_T35VLG_20220626T104321 1205
Gulf of Finland S2B_MSIL1C_20220703T094039_N0400_R036_T35VLG_20220703T103953 746
Gulf of Finland S2A_MSIL1C_20220721T095041_N0400_R079_T35VLG_20220721T115325 971
Bothnian Bay S2A_MSIL1C_20220627T100611_N0400_R022_T34WFT_20220627T134958 122
Bothnian Bay S2B_MSIL1C_20220712T100559_N0400_R022_T34WFT_20220712T121613 162
Bothnian Bay S2B_MSIL1C_20220828T095549_N0400_R122_T34WFT_20220828T104748 98

The following mosaics were used as test data:

Location Product name Number of annotations
Bothnian sea S2B_MSIL1C_20210714T100029_N0301_R122_T34VEN_20210714T121056 450
Bothnian sea S2B_MSIL1C_20220619T100029_N0400_R122_T34VEN_20220619T104419 66
Bothnian sea S2A_MSIL1C_20220624T100041_N0400_R122_T34VEN_20220624T120211 424
Bothnian sea S2A_MSIL1C_20220813T095601_N0400_R122_T34VEN_20220813T120233 399
Kvarken S2B_MSIL1C_20210714T100029_N0301_R122_T34VEN_20210714T121056 83
Kvarken S2B_MSIL1C_20220712T100559_N0400_R022_T34VER_20220712T121613 184
Kvarken S2A_MSIL1C_20220826T100611_N0400_R022_T34VER_20220826T135136 88

Ship and boat data

We manually annotated all visible boats and ships from the RGB satellite mosaics from 18 Sentinel 2 products by drawing a bounding box around them, and saved them into shapefiles so that each shapefile corresponded to a single RGB mosaic. These data cover five separate grid tiles, and each grid tile was annotated from three separate time periods in order to capture more diverse conditions in the sea and different cloud coverage. Images from multiple acquisition dates were also used to identify smaller boats: As these boats only cover a couple of pixels, the identification was done by comparing several images and if a potential boat was not present in all images, it was annotated as a boat (Figure 1), as long as it covered at least 4 pixels. The grid tiles were split into train, validation and test sets so that training and validation set contained 3 tiles and test set contained 2 tiles. The training and validation sets were further split into five equal sized folds.

Figure 1: Example how the annotations were constructed. We compared images from three different acquisition dates in order to detect even the smallest potential boats.

As it is easier to just leave id empty in QGIS, add the label information to dataset.

Code
image_path = Path('../data/rgb_mosaics/')
shp_path = Path('../dataset')

for t_id in ['34VEM', '35VLG', '34WFT', '34VEN', '34VER']:
    mosaics = [f for f in os.listdir(image_path/t_id) if f.endswith('tif')]
    for t in mosaics:
        ships = gpd.read_file(shp_path/f'{t_id}.gpkg', layer=t.split('.')[0])
        ships['id'] = 'boat'
        ships.to_file(shp_path/f'{t_id}.gpkg', layer=t.split('.')[0])

Train and validation areas

Code
fig, axs = plt.subplots(2,2, dpi=200, figsize=(10,10))
for t, a in zip(['20220515.tif', '20220619.tif','20220721.tif', '20220813.tif'], axs.flatten()):
    a.axis('off')
    with rio.open(image_path/'34VEM'/t) as src:
        rio_plot.show(src, ax=a)
    ships = gpd.read_file(shp_path/'34VEM.gpkg', layer=t.split('.')[0])
    ships['geometry'] = ships.geometry.centroid
    ships.plot(ax=a, color='red', markersize=.5)
    n_ships = len(ships)
    ships = None
    a.set_title(f'{t[:-4]}: {n_ships} vessels')
plt.tight_layout()
plt.show()
Figure 2: Archipelago sea
Code
fig, axs = plt.subplots(2,2, dpi=200, figsize=(10,10))
for t, a in zip(['20220606.tif','20220626.tif', '20220703.tif', '20220721.tif'], axs.flatten()):
    a.axis('off')
    with rio.open(image_path/'35VLG'/t) as src:
        rio_plot.show(src, ax=a)
    ships = gpd.read_file(shp_path/'35VLG.gpkg', layer=t.split('.')[0])
    ships['geometry'] = ships.geometry.centroid
    ships.plot(ax=a, color='red', markersize=.5)
    n_ships = len(ships)
    ships = None
    a.set_title(f'{t[:-4]}: {n_ships} vessels')
plt.tight_layout()
plt.show()
Figure 3: Gulf of Finland
Code
fig, axs = plt.subplots(1,3, dpi=200, figsize=(10,5))
for t, a in zip(['20220627.tif','20220712.tif', '20220828.tif'], axs):
    a.axis('off')
    with rio.open(image_path/'34WFT'/t) as src:
        rio_plot.show(src, ax=a)
    ships = gpd.read_file(shp_path/'34WFT.gpkg', layer=t.split('.')[0])
    ships['geometry'] = ships.geometry.centroid
    ships.plot(ax=a, color='red', markersize=.5)
    n_ships = len(ships)
    ships = None
    a.set_title(f'{t[:-4]}: {n_ships} vessels')
plt.tight_layout()
plt.show()
Figure 4: Bothnian bay

Test areas

Code
fig, axs = plt.subplots(2,2, dpi=200, figsize=(10,10))
for t, a in zip(['20210714.tif', '20220619.tif', '20220624.tif', '20220813.tif'], axs.flatten()):
    a.axis('off')
    with rio.open(image_path/'34VEN'/t) as src:
        rio_plot.show(src, ax=a)
    ships = gpd.read_file(shp_path/'34VEN.gpkg', layer=t.split('.')[0])
    ships['geometry'] = ships.geometry.centroid
    ships.plot(ax=a, color='red', markersize=.5)
    n_ships = len(ships)
    ships = None
    a.set_title(f'{t[:-4]}: {n_ships} vessels')
plt.tight_layout()
plt.show()
Figure 5: Bothnian sea
Code
fig, axs = plt.subplots(1,3, dpi=200, figsize=(10,5))
for t, a in zip(['20220617.tif', '20220712.tif', '20220826.tif'], axs):
    a.axis('off')
    with rio.open(image_path/'34VER'/t) as src:
        rio_plot.show(src, ax=a)
    ships = gpd.read_file(shp_path/'34VER.gpkg', layer=t.split('.')[0])
    ships['geometry'] = ships.geometry.centroid
    ships.plot(ax=a, color='red', markersize=.5)
    n_ships = len(ships)
    ships = None
    a.set_title(f'{t[:-4]}: {n_ships} vessels')
plt.tight_layout()
plt.show()
Figure 6: Kvarken

Quick data analysis for the annotated ships

First collate all ships into a single dataframe and project them to EPSG:3067.

Code
all_ships = None

for t_id in ['34VEM', '35VLG', '34WFT', '34VEN', '34VER']:
    mosaics = [f for f in os.listdir(image_path/t_id) if f.endswith('tif')]
    for t in mosaics:
        ships = gpd.read_file(shp_path/f'{t_id}.gpkg', layer=t.split('.')[0])
        ships['tile'] = f'{t_id}_{t[:-4]}'
        if all_ships is None: all_ships = ships.to_crs('epsg:3067')
        else: all_ships = pd.concat([all_ships, ships.to_crs('epsg:3067')])
            
def get_meas(geom, meas='diam'):
    box = geom.minimum_rotated_rectangle
    x, y = box.exterior.coords.xy
    edge_length = (Point(x[0], y[0]).distance(Point(x[1], y[1])), Point(x[1], y[1]).distance(Point(x[2], y[2])))
    if meas == 'min_edge': return min(edge_length)
    elif meas == 'max_edge': return max(edge_length)
    elif meas == 'diam': return np.sqrt(edge_length[0]**2 + edge_length[1]**2)
    
def get_aspect_ratio(geom):
    box = shapely.geometry.polygon.orient(geom).minimum_rotated_rectangle
    x, y = box.exterior.coords.xy
    w = Point(min(x), min(y)).distance(Point(max(x), min(y)))
    h = Point(min(x), min(y)).distance(Point(min(x), max(y)))
    return w/h
    
all_ships['bbox_area'] = all_ships.geometry.area
all_ships['min_edge'] = all_ships.geometry.apply(lambda row: get_meas(row, 'min_edge'))
all_ships['max_edge'] = all_ships.geometry.apply(lambda row: get_meas(row, 'max_edge'))
all_ships['diam'] = all_ships.geometry.apply(lambda row: get_meas(row, 'diam'))
all_ships['aspect_ratio'] = all_ships.geometry.apply(get_aspect_ratio)

Plot histograms for bounding box area, diameter, shorter edge and longer edge (Figure 7).

Code
fig, axs = plt.subplots(2,2, dpi=150, figsize=(12,12))
for a in axs.flatten(): a.grid(True)
sns.histplot(all_ships.bbox_area, log_scale=False, ax=axs[0,0], kde=True).set_title('Bbox area (m²)')
sns.histplot(all_ships.diam, log_scale=False, ax=axs[0,1], kde=True).set_title('Diameter (m)')
sns.histplot(all_ships.min_edge, log_scale=False, ax=axs[1,0], kde=True).set_title('Shorter edge (m)')
sns.histplot(all_ships.max_edge, log_scale=False, ax=axs[1,1], kde=True).set_title('Longer edge (m)')
plt.show()
Figure 7: Histograms for bounding box areas, diameter and edge lengths.

Bounding box area

As most of the annotations cover also most of the wake of the marine vessel, the bounding boxes are significantly larger than a typical boat. There are a few annotations larger than 100 000 m², which are either cruise or cargo ships that are travelling along ordinal directions instead of cardinal directions, instead of e.g. smaller leisure boats.

Code
all_ships.bbox_area.describe()
count      8866.000000
mean       5167.597207
std       10840.414965
min         567.880229
25%        1567.112875
50%        2266.456435
75%        4970.987639
max      414795.692722
Name: bbox_area, dtype: float64

Diameter

Annotations typically have diameter less than 100 meters, and the largest diameters correspond to similar instances than the largest bounding box areas.

Code
all_ships.diam.describe()
count    8866.000000
mean       91.045323
std        58.957569
min        33.721374
25%        56.718731
50%        68.382853
75%       105.899522
max       913.890413
Name: diam, dtype: float64

Aspect ratio

Aspect ratio measures the ratio of the bounding boxes width to its height. Ratios close to 1 (\(10^0\)) mean near-square bounding boxes, ratios less than 1 correspond to boxes whose height is larger than width, and ratios larger than 1 correspond to boxes with height less than width. Overall the distribution of aspect ratios looks quite normal, and most boxes are either square or close to it.

Code
sns.histplot(all_ships.aspect_ratio, log_scale=True, kde=True).set_title('Aspect ratio')
plt.show()
Figure 8: Histogram for the aspect ratio of bounding boxes.

Training and validation data generation

geo2ml package provides utilities for converting geospatial data to formats accepted by different deep learning frameworks. In this work, YOLOv8 is used as the detection model, so the data must be converted to YOLO format.

Training data is also split train and val sets spatially. We do this the following way:

  1. Split the tile area into a 5x5 grid
  2. Mark the data into five folds diagonally
  3. Use the first fold as the validation data

If needed, this also makes it possible to do 5-fold CV.

Code
def make_fold_grid(poly, num_edge, crs) -> gpd.GeoDataFrame:
    xmin, ymin, xmax, ymax = poly.bounds
    dx = (xmax-xmin)//num_edge
    dy = (ymax-ymin)//num_edge
    polys = []
    folds = []
    for x, y in product(range(0, num_edge), range(0, num_edge)):
        xm = xmin+x*dx
        ym = ymin+y*dy
        polys.append(shapely.geometry.box(*(xm, ym, xm+dx, ym+dy)))
        folds.append(f'fold_{(x-y)%num_edge + 1}')
    return gpd.GeoDataFrame(data={'fold': folds, 'geometry': polys}, crs=crs)
Code
with rio.open(image_path/'34VEM/20220619.tif') as src:
    bounds = src.bounds
    crs = src.crs
Code
grid = make_fold_grid(shapely.geometry.box(*bounds), 5, crs)
grid['val'] = grid.fold.apply(lambda row: 1 if row == 'fold_1' else 0)
grid.plot(column='val', edgecolor='white')

The full data generation process can be found in scripts/make_yolo_data.py.

Introduction
Converting the annotations to computer vision datasets
  • View source