Skip to content

Commit 78e1f88

Browse files
authored
Merge pull request #1 from BioDT/species_mod
Final touches
2 parents f98b06d + f5cf5ca commit 78e1f88

26 files changed

Lines changed: 2973 additions & 81 deletions

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
slurm-*.out
2+
git-lfs/
23

34
### Python ###
45
# Byte-compiled / optimized / DLL files

.pre-commit-config.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ repos:
77
- id: end-of-file-fixer
88
- id: check-yaml
99
- id: check-json
10+
exclude_types: [jupyter]
1011
- id: check-added-large-files
1112

1213
- repo: https://github.com/psf/black

README.md

Lines changed: 37 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,8 @@
33

44

55
## Description
6-
This repository contains the code used to engineer **BioCube: A Multimodal Dataset for Biodiversity**. The paper accompanying this repository can be found at: TBA
6+
This repository contains the code used to engineer **BioCube: A Multimodal Dataset for Biodiversity Research**. The produced dataset, can be found on the [BioCube's Hugging Face](https://huggingface.co/datasets/BioDT/BioCube) page with detailed descriptions of the modalities it contains.
7+
78

89
This codebase offers the below core functionalities:
910
- Download
@@ -61,8 +62,10 @@ At this point, we can select any kind of modalities and slice them for specific
6162

6263
![Alt text](img/data_batch.png "Data Batch Description")
6364

65+
Creating **Batches** can be done in two settings based on the sampling frequence (daily and monthly) and requires that you have downloaded BioCube and setted up the path variables appropriately.
6466

65-
To create Batches, just call the function:
67+
### Daily
68+
To create daily Batches, just call the function:
6669

6770
```python
6871
create_dataset(
@@ -76,6 +79,24 @@ create_dataset(
7679
)
7780
```
7881

82+
### Monthly
83+
84+
To download BioCube and create monthly Batches, just run the below script:
85+
```bash
86+
bfm_data/dataset_creation/batch_creation/create_batches.sh
87+
```
88+
Or a step-by-step workflow:
89+
```bash
90+
# First run
91+
python bfm_data/dataset_creation/batch_creation/scan_biocube.py --root biocube_data/data --out catalog_report.parquet
92+
# Then run
93+
python bfm_data/dataset_creation/batch_creation/build_batches_monthly.py
94+
```
95+
96+
You can inspect the created Batches by using the `streamlit run batch_viewer.py --data_dir ./batches` that is located on the same folder as the previous scripts.
97+
98+
To produce statistics from the Batches that can be used for downstream tasks (e.g. normalization), just run `python batch_stats.py --batch_dir batches --out batches_stats.json`
99+
79100
## Storage
80101

81102
`Data` folder contains raw data.
@@ -99,6 +120,20 @@ This publication is part of the project Biodiversity Foundation Model of the res
99120
This work used the Dutch national e-infrastructure with the support of the SURF Cooperative using grant no. EINF-10148*
100121

101122

123+
## Citation
124+
125+
If you find our work useful, please consider citing us!
126+
127+
```
128+
@article{stasinos2025biocube,
129+
title={BioCube: A Multimodal Dataset for Biodiversity Research},
130+
author={Stasinos, Stylianos and Mensio, Martino and Lazovik, Elena and Trantas, Athanasios},
131+
journal={arXiv preprint arXiv:2505.11568},
132+
year={2025}
133+
}
134+
```
135+
136+
102137
## Useful commands
103138

104139
Copy files between clusters: cluster_1=a cluster , cluster_2=SURF Snellius

bfm_data/data_ingestion/ingestion_scripts/copernicus_land.py

Lines changed: 122 additions & 55 deletions
Original file line numberDiff line numberDiff line change
@@ -3,11 +3,13 @@
33
import csv
44
import os
55
from concurrent.futures import ThreadPoolExecutor
6-
6+
import geopandas as gpd
77
import netCDF4 as nc
88
import numpy as np
9+
import pandas as pd
910
import requests
1011
import reverse_geocode
12+
from shapely.geometry import Point
1113

1214
from bfm_data.config import paths
1315
from bfm_data.utils.geo import (
@@ -63,6 +65,10 @@ def load_region_bounding_boxes(self, region: str):
6365
region (str): The region name (e.g., 'Europe', 'Latin America').
6466
"""
6567
_, iso_codes = get_countries_by_continent(region)
68+
69+
if 'CY' not in iso_codes:
70+
iso_codes.append('CY')
71+
6672
self.country_rectangles = get_bounding_boxes_for_countries(iso_codes)
6773

6874
def filter_11th_day_files(self, file_urls: list) -> list:
@@ -140,61 +146,122 @@ def extract_ndvi_locations(self, nc_file_path: str):
140146
lon_points = np.arange(-180, 180 + 0.25, 0.25)
141147

142148
if self.global_mode:
143-
ndvi_data = {}
144-
145-
for lat_val in lat_points:
146-
for lon_val in lon_points:
147-
i = np.abs(lat - lat_val).argmin()
148-
j = np.abs(lon - lon_val).argmin()
149-
ndvi_value = ndvi[i, j]
150-
151-
if ndvi_value != 255 and ndvi_value > self.ndvi_threshold:
152-
transformed_lon = lon_val if lon_val >= 0 else lon_val + 360
153-
coord = (lat_val, transformed_lon)
154-
country = reverse_geocode.get(coord)["country"]
155-
if country not in ndvi_data:
156-
ndvi_data[country] = []
157-
ndvi_data[country].append(
158-
(lat_val, transformed_lon, ndvi_value)
159-
)
149+
world_gdf = gpd.read_file("/projects/prjs1134/data/projects/biodt/storage/geoBoundaries/geoBoundaries CGAZ ADM0.geojson").set_crs("EPSG:4326")
150+
world_gdf["shapeName"] = world_gdf["shapeName"].str.strip()
151+
152+
lat_points = np.arange(-90, 90.25, 0.1)
153+
lon_points = np.arange(-180, 180.25, 0.1)
154+
155+
grid_points = [Point(lon, lat) for lat in lat_points for lon in lon_points]
156+
grid_df = pd.DataFrame({
157+
"Latitude": [pt.y for pt in grid_points],
158+
"Longitude": [pt.x for pt in grid_points],
159+
"geometry": grid_points
160+
})
161+
grid_gdf = gpd.GeoDataFrame(grid_df, geometry="geometry", crs="EPSG:4326")
162+
163+
joined = gpd.sjoin(grid_gdf, world_gdf[["geometry", "shapeName"]], predicate="within", how="inner")
164+
165+
def snap_to_grid(x, res=0.25):
166+
return np.round(x / res) * res
167+
168+
ndvi_data = {country: [] for country in joined["shapeName"].unique()}
169+
tmp_country_data = {country: [] for country in joined["shapeName"].unique()}
170+
171+
for _, row in joined.iterrows():
172+
lat_val = row["Latitude"]
173+
lon_val = row["Longitude"]
174+
country = row["shapeName"]
175+
176+
i = np.abs(lat - lat_val).argmin()
177+
j = np.abs(lon - lon_val).argmin()
178+
ndvi_value = ndvi[i, j]
179+
180+
if ndvi_value != 255 and ndvi_value > self.ndvi_threshold:
181+
lat_025 = snap_to_grid(lat_val, 0.25)
182+
lon_025 = snap_to_grid(lon_val, 0.25)
183+
lon_025 = lon_025 if lon_025 >= 0 else lon_025 + 360
184+
tmp_country_data[country].append((lat_025, lon_025, ndvi_value))
185+
186+
for country, values in tmp_country_data.items():
187+
if values:
188+
df = pd.DataFrame(values, columns=["lat", "lon", "ndvi"])
189+
df = df.groupby(["lat", "lon"], as_index=False)["ndvi"].mean()
190+
ndvi_data[country] = list(df.itertuples(index=False, name=None))
191+
192+
return month_year, ndvi_data
160193

161194
else:
162-
ndvi_data = {
163-
get_country_name_from_iso(country): []
164-
for country in self.country_rectangles.keys()
195+
world_gdf = gpd.read_file("/projects/prjs1134/data/projects/biodt/storage/geoBoundaries/geoBoundaries CGAZ ADM0.geojson").set_crs("EPSG:4326")
196+
world_gdf["shapeName"] = world_gdf["shapeName"].str.strip()
197+
198+
target_names = [get_country_name_from_iso(code).strip() for code in self.country_rectangles]
199+
if "Cyprus" not in target_names:
200+
target_names.append("Cyprus")
201+
202+
region_gdf = world_gdf[world_gdf["shapeName"].isin(target_names)].reset_index(drop=True)
203+
204+
if "Cyprus" not in region_gdf["shapeName"].values:
205+
cyprus_row = world_gdf[world_gdf["shapeName"] == "Cyprus"]
206+
if not cyprus_row.empty:
207+
region_gdf = pd.concat([region_gdf, cyprus_row], ignore_index=True)
208+
209+
manual_countries = {
210+
"Bosnia and Herzegovina": "Bosnia and Herzegovina",
211+
"North Macedonia": "North Macedonia",
212+
"Moldova": "Moldova"
165213
}
214+
for country in manual_countries:
215+
if country not in region_gdf["shapeName"].values:
216+
row = world_gdf[world_gdf["shapeName"] == country]
217+
if not row.empty:
218+
region_gdf = pd.concat([region_gdf, row], ignore_index=True)
219+
220+
lat_points = np.arange(32, 72, 0.1)
221+
lon_points = np.arange(-25, 45, 0.1)
222+
223+
grid_points = [Point(lon, lat) for lat in lat_points for lon in lon_points]
224+
grid_df = pd.DataFrame({
225+
"Latitude": [pt.y for pt in grid_points],
226+
"Longitude": [pt.x for pt in grid_points],
227+
"geometry": grid_points
228+
})
229+
grid_gdf = gpd.GeoDataFrame(grid_df, geometry="geometry", crs="EPSG:4326")
230+
231+
joined = gpd.sjoin(grid_gdf, region_gdf[["geometry", "shapeName"]], predicate="within", how="inner")
232+
233+
matched_countries = sorted(joined["shapeName"].unique())
234+
expected_countries = sorted(region_gdf["shapeName"].unique())
235+
236+
def snap_to_grid(x, res=0.25):
237+
return np.round(x / res) * res
238+
239+
ndvi_data = {country: [] for country in expected_countries}
240+
tmp_country_data = {country: [] for country in expected_countries}
241+
242+
for _, row in joined.iterrows():
243+
lat_val = row["Latitude"]
244+
lon_val = row["Longitude"]
245+
country = row["shapeName"]
246+
247+
i = np.abs(lat - lat_val).argmin()
248+
j = np.abs(lon - lon_val).argmin()
249+
ndvi_value = ndvi[i, j]
250+
251+
if ndvi_value != 255 and ndvi_value > self.ndvi_threshold:
252+
lat_025 = snap_to_grid(lat_val, 0.25)
253+
lon_025 = snap_to_grid(lon_val, 0.25)
254+
lon_025 = lon_025 if lon_025 >= 0 else lon_025 + 360
255+
tmp_country_data[country].append((lat_025, lon_025, ndvi_value))
256+
257+
for country, values in tmp_country_data.items():
258+
if values:
259+
df = pd.DataFrame(values, columns=["lat", "lon", "ndvi"])
260+
df = df.groupby(["lat", "lon"], as_index=False)["ndvi"].mean()
261+
ndvi_data[country] = list(df.itertuples(index=False, name=None))
166262

167-
for country_iso, bbox in self.country_rectangles.items():
168-
min_lon, min_lat, max_lon, max_lat = bbox
169-
170-
for lat_val in lat_points:
171-
for lon_val in lon_points:
172-
if min_lon > max_lon:
173-
in_lon_range = lon_val >= min_lon or lon_val <= max_lon
174-
else:
175-
in_lon_range = min_lon <= lon_val <= max_lon
176-
177-
if min_lat <= lat_val <= max_lat and in_lon_range:
178-
i = np.abs(lat - lat_val).argmin()
179-
j = np.abs(lon - lon_val).argmin()
180-
ndvi_value = ndvi[i, j]
181-
182-
if (
183-
ndvi_value != 255
184-
and ndvi_value > self.ndvi_threshold
185-
):
186-
transformed_lon = (
187-
lon_val if lon_val >= 0 else lon_val + 360
188-
)
189-
country_name = get_country_name_from_iso(
190-
country_iso
191-
)
192-
ndvi_data[country_name].append(
193-
(lat_val, transformed_lon, ndvi_value)
194-
)
195-
196-
dataset.close()
197-
return month_year, ndvi_data
263+
dataset.close()
264+
return month_year, ndvi_data
198265

199266
except Exception as e:
200267
print(f"Error processing the file {nc_file_path}: {e}")
@@ -313,12 +380,12 @@ def run_data_download(
313380
land_dir = paths.LAND_DIR
314381

315382
if global_mode:
316-
csv_file = f"{land_dir}/global_ndvi.csv"
383+
csv_file = f"{land_dir}/global_ndvi_data.csv"
317384
elif region:
318385
region_cleaned = region.replace(" ", "_")
319-
csv_file = f"{land_dir}/{region_cleaned}_ndvi_test.csv"
386+
csv_file = f"{land_dir}/{region_cleaned}_ndvi_data.csv"
320387
else:
321-
csv_file = f"{land_dir}/default_ndvi.csv"
388+
csv_file = f"{land_dir}/default_ndvi_data.csv"
322389

323390
downloader = CopernicusLandDownloader(
324391
links_url=links_url,

bfm_data/data_ingestion/ingestion_scripts/forest.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ def run_forest_data_processing(region: str = None, global_mode: bool = True):
1616
data_file = paths.FOREST_LAND_FILE
1717

1818
if region:
19-
output_csv = f"{data_dir}/{region}_forest_data_test.csv"
19+
output_csv = f"{data_dir}/{region}_forest_data.csv"
2020
else:
2121
output_csv = f"{data_dir}/global_forest_data.csv"
2222

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
import pandas as pd
2+
from bfm_data.config import paths
3+
4+
mode = "europe" # or "global"
5+
6+
input_files = {
7+
"Agriculture": f"{mode.capitalize()}_agriculture_data.csv",
8+
"Agriculture_Irrigated": f"{mode.capitalize()}_agriculture_irrigated_data.csv",
9+
"Arable": f"{mode.capitalize()}_arable_data.csv",
10+
"Cropland": f"{mode.capitalize()}_cropland_data.csv"
11+
}
12+
13+
all_dataframes = []
14+
15+
for variable_name, filepath in input_files.items():
16+
df = pd.read_csv(filepath)
17+
df.insert(0, "Variable", variable_name)
18+
all_dataframes.append(df)
19+
20+
combined_df = pd.concat(all_dataframes, ignore_index=True)
21+
22+
output_filename = f"{mode.capitalize()}_combined_agriculture_data.csv"
23+
output_path = paths.AGRICULTURE_DIR / output_filename
24+
combined_df.to_csv(output_path, index=False)
25+
26+
print(f"Saved to '{output_path}'")
Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
import pandas as pd
2+
from bfm_data.config import paths
3+
4+
mode = "europe" # or "global"
5+
6+
land_file = f"{mode.capitalize()}_land_data.csv"
7+
ndvi_file = f"{mode.capitalize()}_ndvi_monthly_un_025.csv"
8+
9+
land_df = pd.read_csv(land_file)
10+
ndvi_df = pd.read_csv(ndvi_file)
11+
12+
for df in [land_df, ndvi_df]:
13+
df["Latitude"] = df["Latitude"].apply(lambda x: round(float(x), 4))
14+
df["Longitude"] = df["Longitude"].apply(lambda x: round(float(x), 4))
15+
16+
merged_df = pd.merge(
17+
land_df,
18+
ndvi_df,
19+
on=["Country", "Latitude", "Longitude"],
20+
how="outer", # Use 'inner' if only common coordinates are needed
21+
suffixes=("", "_ndvi")
22+
)
23+
24+
output_filename = f"{mode.capitalize()}_combined_land_data.csv"
25+
output_path = paths.LAND_DIR / output_filename
26+
merged_df.to_csv(output_path, index=False)
27+
28+
print(f"Merged file saved to '{output_path}'")

0 commit comments

Comments
 (0)