Intro

I like fish. In every sense of the way. The EU keeps some stats about them, which I have found here

Let’s see how ML can help identifying which country will catch the most fish, based on the costal length of the country.

Norway, has to be Norway.

Well, after some quick plots, it’s clearly visible that the amount of fish caught, is somehow weirdly related to the length of the coastline of the country. Norway peaks!

But we also have some over-achievers. Namely: Denmark, Spain, France and The Netherlands. These countries always peak above the mean() on all tested datasets.

Linear Regression

The datasets contains data from 2011 onwards. To about 2022. (?) Why not 2023 I don’t know. Anyhow, as a silly exercise, I’v setup a small pipeline to keep these numbers under control over time. At the moment, the evil machine brain is set on LinearRegression(), but I plan to try several other models on this, how to say, limited dataset.

Python

import random
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import ElasticNet
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score


# Get the dataset
data = pd.read_csv('fisheu.csv')

# Split in X and y
X = df[['coastline', '2011', '2012', '2013','2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021' ]]
y = df[['2022']]

# Split again, using randomized random_state
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=random.randrange(5,60,5))

# Define the model
model = LinearRegression()


# Fit the model
# Currently the model is underfit, due to limited data
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

Result

Lot’s of negative data. This usually MAY mean a lot of things, but generally points towards the not enough data for Linear Regression.

BUT, the important part with ML is to understand the start and end point. Reading data is not easy, and mistakes can be made on the way. Once the pipeline is set up, and the basic things work(do they?), one can start looking on how to improving the model, the pipeline, the whole process. And that’s the real power of Machine Learning. Building it, brick by brick.

To do

Simple:

  • Gather larger datasets
  • Fine tune the model
  • Check other models
  • Rework the output to prepare proper stats / country.