Data creation

Author

Janne Mäyrä

Published

May 19, 2025

Code

import warnings
warnings.filterwarnings("ignore")

import rasterio as rio
import rasterio.plot as rio_plot
import matplotlib.pyplot as plt
import os
from pathlib import Path
import numpy as np
import geopandas as gpd
import pandas as pd
from shapely.geometry import Point
import shapely
import seaborn as sns
from itertools import product

Satellite data

Mostly cloudless Sentinel 2 mosaics were acquired from Copernicus Open Access Hub as L1C products, which were then converted to RGB mosaics with similar color corrections that are used in Tarkka service.

The following mosaics were used as the training and validation data:

Location	Product name	Number of annotations
Archipelago sea	S2A_MSIL1C_20220515T100031_N0400_R122_T34VEM_20220515T120450	183
Archipelago sea	S2B_MSIL1C_20220619T100029_N0400_R122_T34VEM_20220619T104419	519
Archipelago sea	S2A_MSIL1C_20220721T095041_N0400_R079_T34VEM_20220721T115325	1518
Archipelago sea	S2A_MSIL1C_20220813T095601_N0400_R122_T34VEM_20220813T120233	1371
Gulf of Finland	S2B_MSIL1C_20220606T095029_N0400_R079_T35VLG_20220606T105944	277
Gulf of Finland	S2B_MSIL1C_20220626T095039_N0400_R079_T35VLG_20220626T104321	1205
Gulf of Finland	S2B_MSIL1C_20220703T094039_N0400_R036_T35VLG_20220703T103953	746
Gulf of Finland	S2A_MSIL1C_20220721T095041_N0400_R079_T35VLG_20220721T115325	971
Bothnian Bay	S2A_MSIL1C_20220627T100611_N0400_R022_T34WFT_20220627T134958	122
Bothnian Bay	S2B_MSIL1C_20220712T100559_N0400_R022_T34WFT_20220712T121613	162
Bothnian Bay	S2B_MSIL1C_20220828T095549_N0400_R122_T34WFT_20220828T104748	98

The following mosaics were used as test data:

Location	Product name	Number of annotations
Bothnian sea	S2B_MSIL1C_20210714T100029_N0301_R122_T34VEN_20210714T121056	450
Bothnian sea	S2B_MSIL1C_20220619T100029_N0400_R122_T34VEN_20220619T104419	66
Bothnian sea	S2A_MSIL1C_20220624T100041_N0400_R122_T34VEN_20220624T120211	424
Bothnian sea	S2A_MSIL1C_20220813T095601_N0400_R122_T34VEN_20220813T120233	399
Kvarken	S2B_MSIL1C_20210714T100029_N0301_R122_T34VEN_20210714T121056	83
Kvarken	S2B_MSIL1C_20220712T100559_N0400_R022_T34VER_20220712T121613	184
Kvarken	S2A_MSIL1C_20220826T100611_N0400_R022_T34VER_20220826T135136	88

Ship and boat data

We manually annotated all visible boats and ships from the RGB satellite mosaics from 18 Sentinel 2 products by drawing a bounding box around them, and saved them into shapefiles so that each shapefile corresponded to a single RGB mosaic. These data cover five separate grid tiles, and each grid tile was annotated from three separate time periods in order to capture more diverse conditions in the sea and different cloud coverage. Images from multiple acquisition dates were also used to identify smaller boats: As these boats only cover a couple of pixels, the identification was done by comparing several images and if a potential boat was not present in all images, it was annotated as a boat (Figure 1), as long as it covered at least 4 pixels. The grid tiles were split into train, validation and test sets so that training and validation set contained 3 tiles and test set contained 2 tiles. The training and validation sets were further split into five equal sized folds.

Figure 1: Example how the annotations were constructed. We compared images from three different acquisition dates in order to detect even the smallest potential boats.

As it is easier to just leave id empty in QGIS, add the label information to dataset.

Code

image_path = Path('../data/rgb_mosaics/')
shp_path = Path('../dataset')

for t_id in ['34VEM', '35VLG', '34WFT', '34VEN', '34VER']:
    mosaics = [f for f in os.listdir(image_path/t_id) if f.endswith('tif')]
    for t in mosaics:
        ships = gpd.read_file(shp_path/f'{t_id}.gpkg', layer=t.split('.')[0])
        ships['id'] = 'boat'
        ships.to_file(shp_path/f'{t_id}.gpkg', layer=t.split('.')[0])

Train and validation areas

Code

fig, axs = plt.subplots(2,2, dpi=200, figsize=(10,10))
for t, a in zip(['20220515.tif', '20220619.tif','20220721.tif', '20220813.tif'], axs.flatten()):
    a.axis('off')
    with rio.open(image_path/'34VEM'/t) as src:
        rio_plot.show(src, ax=a)
    ships = gpd.read_file(shp_path/'34VEM.gpkg', layer=t.split('.')[0])
    ships['geometry'] = ships.geometry.centroid
    ships.plot(ax=a, color='red', markersize=.5)
    n_ships = len(ships)
    ships = None
    a.set_title(f'{t[:-4]}: {n_ships} vessels')
plt.tight_layout()
plt.show()

Code

fig, axs = plt.subplots(2,2, dpi=200, figsize=(10,10))
for t, a in zip(['20220606.tif','20220626.tif', '20220703.tif', '20220721.tif'], axs.flatten()):
    a.axis('off')
    with rio.open(image_path/'35VLG'/t) as src:
        rio_plot.show(src, ax=a)
    ships = gpd.read_file(shp_path/'35VLG.gpkg', layer=t.split('.')[0])
    ships['geometry'] = ships.geometry.centroid
    ships.plot(ax=a, color='red', markersize=.5)
    n_ships = len(ships)
    ships = None
    a.set_title(f'{t[:-4]}: {n_ships} vessels')
plt.tight_layout()
plt.show()

Code

fig, axs = plt.subplots(1,3, dpi=200, figsize=(10,5))
for t, a in zip(['20220627.tif','20220712.tif', '20220828.tif'], axs):
    a.axis('off')
    with rio.open(image_path/'34WFT'/t) as src:
        rio_plot.show(src, ax=a)
    ships = gpd.read_file(shp_path/'34WFT.gpkg', layer=t.split('.')[0])
    ships['geometry'] = ships.geometry.centroid
    ships.plot(ax=a, color='red', markersize=.5)
    n_ships = len(ships)
    ships = None
    a.set_title(f'{t[:-4]}: {n_ships} vessels')
plt.tight_layout()
plt.show()

Test areas

Code

fig, axs = plt.subplots(2,2, dpi=200, figsize=(10,10))
for t, a in zip(['20210714.tif', '20220619.tif', '20220624.tif', '20220813.tif'], axs.flatten()):
    a.axis('off')
    with rio.open(image_path/'34VEN'/t) as src:
        rio_plot.show(src, ax=a)
    ships = gpd.read_file(shp_path/'34VEN.gpkg', layer=t.split('.')[0])
    ships['geometry'] = ships.geometry.centroid
    ships.plot(ax=a, color='red', markersize=.5)
    n_ships = len(ships)
    ships = None
    a.set_title(f'{t[:-4]}: {n_ships} vessels')
plt.tight_layout()
plt.show()

Code

fig, axs = plt.subplots(1,3, dpi=200, figsize=(10,5))
for t, a in zip(['20220617.tif', '20220712.tif', '20220826.tif'], axs):
    a.axis('off')
    with rio.open(image_path/'34VER'/t) as src:
        rio_plot.show(src, ax=a)
    ships = gpd.read_file(shp_path/'34VER.gpkg', layer=t.split('.')[0])
    ships['geometry'] = ships.geometry.centroid
    ships.plot(ax=a, color='red', markersize=.5)
    n_ships = len(ships)
    ships = None
    a.set_title(f'{t[:-4]}: {n_ships} vessels')
plt.tight_layout()
plt.show()

Quick data analysis for the annotated ships

First collate all ships into a single dataframe and project them to EPSG:3067.

Code

all_ships = None

for t_id in ['34VEM', '35VLG', '34WFT', '34VEN', '34VER']:
    mosaics = [f for f in os.listdir(image_path/t_id) if f.endswith('tif')]
    for t in mosaics:
        ships = gpd.read_file(shp_path/f'{t_id}.gpkg', layer=t.split('.')[0])
        ships['tile'] = f'{t_id}_{t[:-4]}'
        if all_ships is None: all_ships = ships.to_crs('epsg:3067')
        else: all_ships = pd.concat([all_ships, ships.to_crs('epsg:3067')])
            
def get_meas(geom, meas='diam'):
    box = geom.minimum_rotated_rectangle
    x, y = box.exterior.coords.xy
    edge_length = (Point(x[0], y[0]).distance(Point(x[1], y[1])), Point(x[1], y[1]).distance(Point(x[2], y[2])))
    if meas == 'min_edge': return min(edge_length)
    elif meas == 'max_edge': return max(edge_length)
    elif meas == 'diam': return np.sqrt(edge_length[0]**2 + edge_length[1]**2)
    
def get_aspect_ratio(geom):
    box = shapely.geometry.polygon.orient(geom).minimum_rotated_rectangle
    x, y = box.exterior.coords.xy
    w = Point(min(x), min(y)).distance(Point(max(x), min(y)))
    h = Point(min(x), min(y)).distance(Point(min(x), max(y)))
    return w/h
    
all_ships['bbox_area'] = all_ships.geometry.area
all_ships['min_edge'] = all_ships.geometry.apply(lambda row: get_meas(row, 'min_edge'))
all_ships['max_edge'] = all_ships.geometry.apply(lambda row: get_meas(row, 'max_edge'))
all_ships['diam'] = all_ships.geometry.apply(lambda row: get_meas(row, 'diam'))
all_ships['aspect_ratio'] = all_ships.geometry.apply(get_aspect_ratio)

Plot histograms for bounding box area, diameter, shorter edge and longer edge (Figure 7).

Code

fig, axs = plt.subplots(2,2, dpi=150, figsize=(12,12))
for a in axs.flatten(): a.grid(True)
sns.histplot(all_ships.bbox_area, log_scale=False, ax=axs[0,0], kde=True).set_title('Bbox area (m²)')
sns.histplot(all_ships.diam, log_scale=False, ax=axs[0,1], kde=True).set_title('Diameter (m)')
sns.histplot(all_ships.min_edge, log_scale=False, ax=axs[1,0], kde=True).set_title('Shorter edge (m)')
sns.histplot(all_ships.max_edge, log_scale=False, ax=axs[1,1], kde=True).set_title('Longer edge (m)')
plt.show()

Bounding box area

As most of the annotations cover also most of the wake of the marine vessel, the bounding boxes are significantly larger than a typical boat. There are a few annotations larger than 100 000 m², which are either cruise or cargo ships that are travelling along ordinal directions instead of cardinal directions, instead of e.g. smaller leisure boats.

Code

all_ships.bbox_area.describe()

count      8866.000000
mean       5167.597207
std       10840.414965
min         567.880229
25%        1567.112875
50%        2266.456435
75%        4970.987639
max      414795.692722
Name: bbox_area, dtype: float64

Diameter

Annotations typically have diameter less than 100 meters, and the largest diameters correspond to similar instances than the largest bounding box areas.

Code

all_ships.diam.describe()

count    8866.000000
mean       91.045323
std        58.957569
min        33.721374
25%        56.718731
50%        68.382853
75%       105.899522
max       913.890413
Name: diam, dtype: float64

Aspect ratio

Aspect ratio measures the ratio of the bounding boxes width to its height. Ratios close to 1 (\(10^0\)) mean near-square bounding boxes, ratios less than 1 correspond to boxes whose height is larger than width, and ratios larger than 1 correspond to boxes with height less than width. Overall the distribution of aspect ratios looks quite normal, and most boxes are either square or close to it.

Code

sns.histplot(all_ships.aspect_ratio, log_scale=True, kde=True).set_title('Aspect ratio')
plt.show()

Figure 8: Histogram for the aspect ratio of bounding boxes.

Training and validation data generation

geo2ml package provides utilities for converting geospatial data to formats accepted by different deep learning frameworks. In this work, YOLOv8 is used as the detection model, so the data must be converted to YOLO format.

Training data is also split train and val sets spatially. We do this the following way:

Split the tile area into a 5x5 grid
Mark the data into five folds diagonally
Use the first fold as the validation data

If needed, this also makes it possible to do 5-fold CV.

Code

def make_fold_grid(poly, num_edge, crs) -> gpd.GeoDataFrame:
    xmin, ymin, xmax, ymax = poly.bounds
    dx = (xmax-xmin)//num_edge
    dy = (ymax-ymin)//num_edge
    polys = []
    folds = []
    for x, y in product(range(0, num_edge), range(0, num_edge)):
        xm = xmin+x*dx
        ym = ymin+y*dy
        polys.append(shapely.geometry.box(*(xm, ym, xm+dx, ym+dy)))
        folds.append(f'fold_{(x-y)%num_edge + 1}')
    return gpd.GeoDataFrame(data={'fold': folds, 'geometry': polys}, crs=crs)

Code

with rio.open(image_path/'34VEM/20220619.tif') as src:
    bounds = src.bounds
    crs = src.crs

Code

grid = make_fold_grid(shapely.geometry.box(*bounds), 5, crs)
grid['val'] = grid.fold.apply(lambda row: 1 if row == 'fold_1' else 0)
grid.plot(column='val', edgecolor='white')

The full data generation process can be found in scripts/make_yolo_data.py.