geo2ml
  1. Tabular data
  2. Tabular data
  • geo2ml
  • Examples
    • Tabular data workflow
    • Unet workflow
    • COCO workflow
    • YOLOv8 workflow
  • Tabular data
    • Tabular data
  • Image data
    • Tiling
    • Coordinate transformations
    • Image data
    • Postprocessing
  • Plotting
  • CLI
    • Dataset creation

On this page

  • Data conversion and processing
    • array_to_longform
    • drop_small_classes
  • Sampling utilities
    • sample_raster_with_points
    • sample_raster_with_polygons
  • Report an issue
  1. Tabular data
  2. Tabular data

Tabular data

Utilities to process remote sensing image data into tabular format
example_data_path= Path('example_data')
example_points = example_data_path/'points/points.geojson'
example_polys = example_data_path/'polygons/polygons.geojson'
example_raster = example_data_path/'s2_lataseno_ex.tif'

Data conversion and processing

Utility functions to process dataframes.


source

array_to_longform

 array_to_longform (a:pandas.core.frame.DataFrame, columns:list)

Convert pd.DataFrame a to longform array


source

drop_small_classes

 drop_small_classes (df:pandas.core.frame.DataFrame, min_class_size:int,
                     target_column:str|int=0)

Drop rows from the dataframe if their target_column value has less instances than `min_class_size

Generate random data to test.

ex_df = pd.DataFrame({'label': np.random.randint(1, 10, 200)})
ex_df.label.value_counts()
label
4    32
3    23
2    22
9    21
1    21
6    21
5    21
8    21
7    18
Name: count, dtype: int64

Column name can be either specified with string or int. If not provided it defaults to first column.

filtered = drop_small_classes(ex_df, 20, 'label')
assert filtered.label.value_counts().min() >= 20
filtered.label.value_counts()
label
4    32
3    23
2    22
9    21
1    21
6    21
5    21
8    21
Name: count, dtype: int64

If not specified, defaults to first column.

filtered = drop_small_classes(ex_df, 20)
filtered.label.value_counts()
label
4    32
3    23
2    22
9    21
1    21
6    21
5    21
8    21
Name: count, dtype: int64

Sampling utilities

These functions enable sampling of raster values using either point or polygon features.


source

sample_raster_with_points

 sample_raster_with_points (sampling_locations:pathlib.Path,
                            input_raster:pathlib.Path, target_column:str,
                            gpkg_layer:str=None,
                            band_names:list[str]=None,
                            rename_target:str=None)

Extract values from input_raster using points from sampling_locations. Returns a gpd.GeoDataFrame with columns target_column, geometry and bands

sample_raster_with_points is an utility to sample point values from a raster and get the results into a gpd.GeoDataFrame.

out_gdf = sample_raster_with_points(example_points, example_raster, 'id')
out_gdf.head()
id geometry band_0 band_1 band_2 band_3 band_4 band_5 band_6 band_7 band_8
0 10.0 POINT (311760.599 7604880.391) 334 591 439 1204 2651 3072 3177 2070 1046
1 14.0 POINT (312667.464 7605426.442) 183 359 282 759 1742 2002 2037 1392 669
2 143.0 POINT (313619.160 7604550.762) 281 478 427 900 1976 2315 2423 2139 1069
3 172.0 POINT (311989.967 7605411.190) 287 530 393 1078 2446 2761 2978 1949 950
4 224.0 POINT (313386.009 7604304.917) 204 379 327 753 1524 1747 1771 1322 663

It is also possible to provide band_names to rename the columns.

band_names = ['blue', 'green', 'red', 'red_edge1', 'red_edge2', 'nir', 'narrow_nir', 'swir1', 'swir2']
out_gdf = sample_raster_with_points(example_points, example_raster, 'id', band_names=band_names)
out_gdf.head()
id geometry blue green red red_edge1 red_edge2 nir narrow_nir swir1 swir2
0 10.0 POINT (311760.599 7604880.391) 334 591 439 1204 2651 3072 3177 2070 1046
1 14.0 POINT (312667.464 7605426.442) 183 359 282 759 1742 2002 2037 1392 669
2 143.0 POINT (313619.160 7604550.762) 281 478 427 900 1976 2315 2423 2139 1069
3 172.0 POINT (311989.967 7605411.190) 287 530 393 1078 2446 2761 2978 1949 950
4 224.0 POINT (313386.009 7604304.917) 204 379 327 753 1524 1747 1771 1322 663

Or rename target column

out_gdf = sample_raster_with_points(example_points, example_raster, 'id', rename_target='target')
out_gdf.head()
target geometry band_0 band_1 band_2 band_3 band_4 band_5 band_6 band_7 band_8
0 10.0 POINT (311760.599 7604880.391) 334 591 439 1204 2651 3072 3177 2070 1046
1 14.0 POINT (312667.464 7605426.442) 183 359 282 759 1742 2002 2037 1392 669
2 143.0 POINT (313619.160 7604550.762) 281 478 427 900 1976 2315 2423 2139 1069
3 172.0 POINT (311989.967 7605411.190) 287 530 393 1078 2446 2761 2978 1949 950
4 224.0 POINT (313386.009 7604304.917) 204 379 327 753 1524 1747 1771 1322 663

source

sample_raster_with_polygons

 sample_raster_with_polygons (sampling_locations:pathlib.Path,
                              input_raster:pathlib.Path,
                              target_column:str=None, gpkg_layer:str=None,
                              band_names:list[str]=None,
                              rename_target:str=None,
                              stats:list[str]=['min', 'max', 'mean',
                              'count'], categorical:bool=False)

Extract values from input_raster using polygons from sampling_locations with rasterstats.zonal_stats for all bands

Example polygons here are previous points buffered by 40 meters.

out_gdf = sample_raster_with_polygons(example_polys, example_raster, 'id')
out_gdf.iloc[0]
id                                                           10.0
geometry        MULTIPOLYGON (((311800.59915342694 7604880.390...
band_0_min                                                  266.0
band_0_max                                                  415.0
band_0_mean                                            335.708333
band_0_count                                                   48
band_1_min                                                  351.0
band_1_max                                                  696.0
band_1_mean                                              582.1875
band_1_count                                                   48
band_2_min                                                  412.0
band_2_max                                                  699.0
band_2_mean                                            524.520833
band_2_count                                                   48
band_3_min                                                  885.0
band_3_max                                                 1462.0
band_3_mean                                              1237.125
band_3_count                                                   48
band_4_min                                                 1310.0
band_4_max                                                 2888.0
band_4_mean                                           2479.291667
band_4_count                                                   48
band_5_min                                                 1565.0
band_5_max                                                 3317.0
band_5_mean                                               2880.25
band_5_count                                                   48
band_6_min                                                 1579.0
band_6_max                                                 3665.0
band_6_mean                                           3127.166667
band_6_count                                                   48
band_7_min                                                 1860.0
band_7_max                                                 2214.0
band_7_mean                                           2076.895833
band_7_count                                                   48
band_8_min                                                 1024.0
band_8_max                                                 1144.0
band_8_mean                                              1075.375
band_8_count                                                   48
Name: 0, dtype: object

As sample_raster_with_polygons utilizes rasterstats.zonal_statistics, all stats supported by it can be provided with parameter stats. More information here.

out_gdf = sample_raster_with_polygons(example_polys, example_raster, 'id', stats=['min', 'max', 'sum', 'median', 'range'])
out_gdf.iloc[0]
id                                                            10.0
geometry         MULTIPOLYGON (((311800.59915342694 7604880.390...
band_0_min                                                   266.0
band_0_max                                                   415.0
band_0_sum                                                 16114.0
band_0_median                                                338.0
band_0_range                                                 149.0
band_1_min                                                   351.0
band_1_max                                                   696.0
band_1_sum                                                 27945.0
band_1_median                                                590.0
band_1_range                                                 345.0
band_2_min                                                   412.0
band_2_max                                                   699.0
band_2_sum                                                 25177.0
band_2_median                                                524.0
band_2_range                                                 287.0
band_3_min                                                   885.0
band_3_max                                                  1462.0
band_3_sum                                                 59382.0
band_3_median                                               1255.5
band_3_range                                                 577.0
band_4_min                                                  1310.0
band_4_max                                                  2888.0
band_4_sum                                                119006.0
band_4_median                                               2625.5
band_4_range                                                1578.0
band_5_min                                                  1565.0
band_5_max                                                  3317.0
band_5_sum                                                138252.0
band_5_median                                               3020.5
band_5_range                                                1752.0
band_6_min                                                  1579.0
band_6_max                                                  3665.0
band_6_sum                                                150104.0
band_6_median                                               3270.5
band_6_range                                                2086.0
band_7_min                                                  1860.0
band_7_max                                                  2214.0
band_7_sum                                                 99691.0
band_7_median                                               2087.0
band_7_range                                                 354.0
band_8_min                                                  1024.0
band_8_max                                                  1144.0
band_8_sum                                                 51618.0
band_8_median                                               1070.0
band_8_range                                                 120.0
Name: 0, dtype: object
  • Report an issue