SpecCode | GenusSpecies | BodyShapeI | DemersPelag | AirBreathing | DepthRangeShallow | DepthRangeDeep | LongevityWild | Length | Weight | Importance | PriceCateg | MainCatchingMethod | UsedforAquaculture | GameFish | Dangerous | Electrogenic | Neritic | SupraLittoralZone | Saltmarshes | LittoralZone | TidePools | Intertidal | SubLittoral | Caves | Oceanic | Epipelagic | Mesopelagic | Bathypelagic | Abyssopelagic | Hadopelagic | Estuaries | Mangroves | MarshesSwamps | CaveAnchialine | Stream | Lakes | Cave | Cave2 | FeedingType | Parasitism | Solitary | Symbiosis | Symphorism | Commensalism | Mutualism | Epiphytic | Schooling | Shoaling | Benthic | Sessile | Mobile | Demersal | Endofauna | Pelagic | Megabenthos | Macrobenthos | Meiobenthos | SoftBottom | Sand | Coarse | Fine | Level | Sloping | Silt | Mud | Ooze | Detritus | Organic | HardBottom | Rocky | Rubble | Gravel | SexualAttributes | SexMorphology | SexColors | StrikingFeatures | Forehead | OperculumPresent | TypeofEyes | TypeofMouth | PosofMouth | MandibleTeeth | MaxillaTeeth | VomerineTeeth | Palatine | PharyngealTeeth | TeethonTongue | TypeofScales | Scutes | Keels | HorStripesTTI | HorStripesTTII | VerStripesTTI | VerStripesTTII | VerStripesTTIII | DiaStripesTTI | DiaStripesTTII | DiaStripesTTIII | CurStripesTTI | CurStripesTTII | CurStripesTTIII | SpotsTTI | SpotsTTII | SpotsTTIII | DorsalFinI | DorsalFinII | CaudalFinI | CaudalFinII | AnalFinI | AnalFinII | LateralLinesNo | LLinterrupted | ScalesLateralmin | ScalesLateralmax | PoredScalesMin | PoredScalesMax | LatSeriesMin | LatSeriesMax | ScaleRowsAboveMin | ScaleRowsAboveMax | ScaleRowsBelowMin | ScaleRowsBelowMax | ScalesPeduncMin | ScalesPeduncMax | BarbelsNo | BarbelsType | GillCleftsNo | Spiracle | GillRakersLowMin | GillRakersLowMax | GillRakersUpMin | GillRakersUpMax | GillRakersTotalMin | GillRakersTotalMax | Vertebrae | VertebraePreanMin | VertebraePreanMax | VertebraeTotalMin | VertebraeTotalMax | DorsalAttributes | Dfinno | DorsalSpinesMin | DorsalSpinesMax | Notched | DorsalSoftRaysMin | DorsalSoftRaysMax | Adifin | DFinletsmin | DFinletsmax | VFinletsmin | VFinletsmax | CShape | Attributes | Afinno | AnalFinSpinesMin | AnalFinSpinesMax | Araymin | Araymax | PectoralAttributes | Pspines2 | Praymin | Praymax | PelvicsAttributes | VPosition | VPosition2 | Vspines | Vraymin | Vraymax | StandardLengthCm | Forklength | Totallength | HeadLength | PreDorsalLength | PrePelvicsLength | PreAnalLength | PostHeadDepth | PostTrunkDepth | MaximumDepth | PeduncleDepth | PeduncleLength | CaudalHeight | PreorbitalLength | EyeLength | GasBladder | ReproMode | Fertilization | MatingSystem | MonogamyType | MatingQuality | SpawnAgg | Spawning | BatchSpawner | ParentalCare | MainCommonName | IsOfConcern |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4 | Engraulis ringens | elongated | pelagic-neritic | WaterAssumed | 3 | 80 | 3 | 20.0 | NA | highly commercial | low | seines | never/rarely | 0 | harmless | no special ability | -1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | -1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | filtering plankton | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | no special organs | males alike females | males alike females | none | more or less straight | -1 | more or less normal | more or less normal | sub-terminal/inferior | NA | NA | NA | NA | NA | NA | NA | absent | NA | absent | NA | absent | NA | NA | absent | NA | NA | absent | NA | NA | no spots | NA | NA | no spots or stripes | NA | no spots or stripes | NA | no spots or stripes | NA | 1 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 0 | NA | 0 | absent | 38 | 49 | NA | NA | NA | NA | NA | NA | NA | NA | NA | no striking attributes | 1 | 0 | 0 | 0 | NA | NA | absent | 0 | 0 | 0 | 0 | forked | more or less normal | 1 | 0 | 0 | NA | NA | more or less normal | 0 | NA | NA | more or less normal | abdominal | before origin of D1 | 0 | 7 | 7 | NA | NA | NA | NA | NA | NA | NA | NA | NA | 20.0 | NA | NA | NA | NA | NA | NA | dioecism | external | NA | NA | NA | -1 | Two seasonal peaks per year | -1 | none | Peruvian Anchoveta | 0 |
5 | Orthopristis chrysoptera | short and / or deep | demersal | WaterAssumed | 10 | NA | 4 | 46.0 | 900 | minor commercial | low | seines | never/rarely | 0 | harmless | no special ability | -1 | 0 | 0 | 0 | 0 | -1 | 0 | 0 | -1 | 0 | 0 | 0 | 0 | 0 | -1 | -1 | 0 | 0 | 0 | 0 | 0 | 0 | hunting macrofauna (predator) | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | -1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | -1 | -1 | 0 | 0 | no special organs | males alike females | males alike females | none | more or less straight | -1 | more or less normal | more or less normal | terminal | NA | NA | NA | NA | NA | NA | ctenoid scales | absent | 0 | present | mainly ventral | absent | NA | NA | present | mainly dorsal | ending before ventral contour | absent | NA | NA | no spots | NA | NA | more than one spot or stripe | NA | no spots or stripes | NA | no spots or stripes | NA | 1 | 0 | 55 | 58 | NA | NA | NA | NA | 10 | 10 | 19 | 19 | NA | NA | 0 | NA | 0 | absent | 12 | 13 | 8 | 8 | NA | NA | NA | NA | NA | NA | NA | no striking attributes | 1 | 12 | 13 | 0 | 15 | 16 | absent | 0 | 0 | 0 | 0 | forked | more or less normal | 1 | 3 | 3 | 12 | 13 | more or less normal | NA | 17 | 19 | more or less normal | thoracic | behind origin of D1 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | dioecism | external | monogamy | NA | 1 | 0 | one clear seasonal peak per year | -1 | none | Pigfish | 0 |
6 | Coryphaena hippurus | elongated | pelagic-neritic | WaterAssumed | 0 | 85 | 4 | 210.0 | 40000 | highly commercial | high | various gears | commercial | 1 | reports of ciguatera poisoning | no special ability | -1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | -1 | -1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | hunting macrofauna (predator) | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | no special organs | always different morphology between mature adults | males alike females | striking shape of body | clearly convex | -1 | more or less normal | more or less normal | terminal | present | present | NA | NA | NA | present | cycloid scales | absent | 0 | absent | NA | absent | NA | NA | absent | NA | NA | absent | NA | NA | more than one spot | dorsal and ventral | on trunk and tail | no spots or stripes | NA | no spots or stripes | NA | no spots or stripes | NA | 1 | 0 | 200 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 0 | NA | 0 | absent | NA | NA | NA | NA | NA | NA | NA | 13 | 14 | 31 | 31 | extending over most of the back length | 1 | 0 | 0 | 0 | 58 | 66 | absent | 0 | 0 | 0 | 0 | forked | more or less normal | 1 | 0 | 0 | 25 | 31 | more or less normal | 0 | NA | NA | more or less normal | thoracic | behind origin of D1 | 0 | NA | NA | NA | NA | 123 | 22.3 | 7.8 | 23.3 | 51.5 | 22.3 | 18.4 | 23.3 | 4.9 | 4.9 | 24.3 | 30.4 | 13 | NA | dioecism | external | NA | NA | NA | -1 | Variable throughout range | -1 | none | Common Dolphinfish | 0 |
7 | Coryphaena equiselis | fusiform / normal | pelagic-oceanic | WaterAssumed | 0 | 400 | 4 | 145.7 | NA | minor commercial | unknown | various gears | never/rarely | 1 | harmless | no special ability | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | -1 | -1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | hunting macrofauna (predator) | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NA | NA | NA | none | clearly convex | -1 | more or less normal | more or less normal | terminal | present | present | present | present | NA | present | cycloid scales | absent | 0 | absent | NA | absent | NA | NA | absent | NA | NA | absent | NA | NA | more than one spot | dorsal and ventral | on trunk and tail | no spots or stripes | no colored margin | no spots or stripes | no colored margin | no spots or stripes | no colored margin | 1 | 0 | 160 | 200 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 0 | NA | NA | NA | 8 | 9 | 1 | 1 | 9 | 10 | NA | 13 | 14 | 33 | 33 | extending over most of the back length | 1 | 0 | 0 | 0 | 52 | 59 | absent | 0 | 0 | 0 | 0 | forked | more or less normal | 1 | 0 | 0 | 24 | 28 | more or less normal | 0 | 18 | 20 | more or less normal | thoracic | behind origin of D1 | 1 | 5 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | dioecism | external | NA | NA | NA | 0 | NA | -1 | none | Pompano Dolphinfish | 0 |
9 | Abalistes stellatus | short and / or deep | demersal | WaterAssumed | 7 | 350 | NA | 60.0 | NA | commercial | medium | trawls | never/rarely | 0 | harmless | no special ability | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | hunting macrofauna (predator) | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | -1 | -1 | 0 | 0 | 0 | 0 | -1 | -1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | no special organs | males alike females | males alike females | striking fins | clearly convex | -1 | more or less normal | more or less normal | terminal | present | present | NA | NA | NA | NA | NA | absent | 0 | absent | NA | absent | NA | NA | absent | NA | NA | absent | NA | NA | more than one spot | mainly dorsal | mainly on trunk | more than one spot or stripe | no colored margin | no spots or stripes | no colored margin | more than one spot or stripe | no colored margin | NA | 0 | NA | NA | NA | NA | 33 | 41 | NA | NA | NA | NA | NA | NA | 0 | NA | 0 | absent | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | first rays forming locking device | 2 | 3 | 3 | 0 | 25 | 27 | absent | 0 | 0 | 0 | 0 | more or less truncate | more or less normal | 1 | 0 | 0 | 24 | 26 | more or less normal | 0 | 13 | 16 | joint to one spine only | abdominal | behind origin of D1 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | dioecism | external | monogamy | NA | 1 | 0 | NA | 0 | NA | Starry Triggerfish | 0 |
10 | Alectis indica | short and / or deep | reef-associated | WaterAssumed | 20 | 100 | NA | 165.0 | 25000 | commercial | medium | hooks and lines | never/rarely | 1 | harmless | no special ability | -1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | -1 | 0 | 0 | 0 | 0 | 0 | -1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | hunting macrofauna (predator) | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | -1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | no special organs | males alike females | males alike females | none | more or less straight | -1 | more or less normal | more or less normal | terminal | present | present | NA | NA | NA | NA | scales embedded or partly/completely absent | along lateral line | 0 | absent | NA | absent | NA | NA | absent | NA | NA | absent | NA | NA | no spots | NA | NA | no spots or stripes | colored margin | no spots or stripes | colored margin | no spots or stripes | colored margin | 1 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 0 | NA | 0 | absent | 21 | 26 | 8 | 11 | 29 | 37 | NA | 10 | 10 | 24 | 24 | other | 2 | 7 | 7 | 0 | 18 | 20 | absent | 0 | 0 | 0 | 0 | forked | more or less normal | 2 | 3 | 3 | 15 | 20 | more or less normal | NA | 18 | 18 | more or less normal | abdominal | before origin of D1 | 1 | 5 | 5 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | dioecism | external | NA | NA | NA | 0 | NA | 0 | none | Indian Threadfish | 0 |
12 | Cephalopholis cruentata | fusiform / normal | reef-associated | WaterAssumed | 0 | 170 | 13 | 42.6 | 1130 | minor commercial | very high | seines | never/rarely | 0 | reports of ciguatera poisoning | no special ability | -1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | -1 | 0 | 0 | 0 | 0 | 0 | -1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | hunting macrofauna (predator) | 0 | -1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | -1 | 0 | 0 | no special organs | males alike females | males alike females | none | more or less straight | -1 | more or less normal | more or less normal | superior | present | present | present | present | NA | NA | ctenoid scales | absent | 0 | absent | NA | absent | NA | NA | absent | NA | NA | absent | NA | NA | more than one spot | lateral | on trunk and tail | more than one spot or stripe | no colored margin | more than one spot or stripe | no colored margin | more than one spot or stripe | no colored margin | 1 | 0 | 47 | 51 | NA | NA | 69 | 81 | NA | NA | NA | NA | NA | NA | 0 | NA | 0 | absent | NA | NA | NA | NA | 18 | 25 | NA | NA | NA | NA | NA | no striking attributes | 1 | 9 | 9 | 0 | 13 | 15 | absent | 0 | 0 | 0 | 0 | more or less truncate | more or less normal | 1 | 3 | 3 | 8 | 8 | more or less normal | 0 | 16 | NA | more or less normal | abdominal | behind origin of D1 | 1 | 5 | 5 | NA | NA | NA | NA | NA | NA | NA | NA | NA | 37.0 | NA | NA | NA | NA | NA | NA | protogyny | external | NA | NA | NA | -1 | one clear seasonal peak per year | 0 | none | Graysby | 0 |
14 | Epinephelus adscensionis | fusiform / normal | demersal | WaterAssumed | 1 | 120 | 25 | 65.0 | 4500 | highly commercial | very high | NA | never/rarely | 1 | reports of ciguatera poisoning | no special ability | -1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | -1 | 0 | 0 | 0 | 0 | 0 | -1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | hunting macrofauna (predator) | 0 | -1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | -1 | 0 | 0 | no special organs | males alike females | males alike females | none | more or less straight | -1 | more or less normal | more or less normal | terminal | NA | NA | NA | NA | NA | NA | ctenoid scales | absent | 0 | absent | NA | absent | NA | NA | absent | NA | NA | absent | NA | NA | more than one spot | lateral | on trunk and tail | more than one spot or stripe | colored margin | more than one spot or stripe | colored margin | more than one spot or stripe | colored margin | 1 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 0 | NA | 0 | absent | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | no striking attributes | 1 | 11 | 11 | 0 | 16 | 17 | absent | 0 | 0 | 0 | 0 | more or less truncate | NA | 1 | 3 | 3 | 8 | 8 | more or less normal | NA | 18 | 19 | more or less normal | abdominal | behind origin of D1 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | protogyny | external | NA | NA | NA | -1 | NA | 0 | none | Rock Hind | 0 |
15 | Epinephelus guttatus | fusiform / normal | reef-associated | WaterAssumed | 100 | NA | 22 | 76.0 | 25000 | highly commercial | high | hooks and lines | never/rarely | 1 | reports of ciguatera poisoning | no special ability | -1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | -1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | hunting macrofauna (predator) | 0 | -1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | -1 | 0 | 0 | NA | NA | NA | none | clearly convex | -1 | more or less normal | more or less normal | superior | present | present | NA | present | NA | NA | ctenoid scales | absent | 0 | absent | NA | absent | NA | NA | absent | NA | NA | absent | NA | NA | more than one spot | dorsal and ventral | NA | more than one spot or stripe | NA | more than one spot or stripe | NA | more than one spot or stripe | NA | 1 | 0 | 65 | 74 | NA | NA | 92 | 104 | NA | NA | NA | NA | NA | NA | 0 | NA | NA | NA | 16 | 18 | 8 | 9 | 24 | 26 | NA | NA | NA | NA | NA | no striking attributes | 1 | 11 | 11 | -1 | 15 | 16 | absent | 0 | 0 | 0 | 0 | more or less truncate | more or less normal | 1 | 3 | 3 | 8 | 8 | more or less normal | 0 | 16 | 18 | more or less normal | thoracic | beneath origin of D1 | 1 | 5 | 5 | NA | NA | NA | NA | NA | NA | NA | NA | NA | 34.5 | NA | NA | NA | NA | NA | NA | protogyny | external | NA | NA | NA | -1 | one clear seasonal peak per year | 0 | none | Red Hind | 0 |
16 | Epinephelus itajara | fusiform / normal | reef-associated | WaterAssumed | 0 | 100 | 37 | 250.0 | 455000 | minor commercial | very high | hooks and lines | never/rarely | 1 | traumatogenic | no special ability | -1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | -1 | 0 | 0 | 0 | 0 | 0 | -1 | -1 | 0 | 0 | 0 | 0 | 0 | 0 | hunting macrofauna (predator) | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | -1 | 0 | 0 | 0 | 0 | 0 | 0 | -1 | 0 | 0 | 0 | 0 | -1 | 0 | 0 | no special organs | males alike females | different colors in juveniles and adults | none | more or less straight | -1 | more or less normal | more or less normal | superior | present | present | NA | NA | NA | NA | ctenoid scales | absent | 0 | absent | NA | present | dorsal and ventral | reaching ventral contour | absent | NA | NA | absent | NA | NA | more than one spot | lateral | on trunk and tail | more than one spot or stripe | no colored margin | more than one spot or stripe | no colored margin | more than one spot or stripe | no colored margin | 1 | 0 | 61 | 64 | NA | NA | 89 | 112 | NA | NA | NA | NA | NA | NA | 0 | NA | NA | NA | 13 | 15 | 8 | 9 | 21 | 24 | NA | NA | NA | NA | NA | no striking attributes | 1 | 11 | 11 | -1 | 15 | 16 | absent | 0 | 0 | 0 | 0 | more or less truncate | NA | 1 | 3 | 3 | 8 | 8 | more or less normal | 1 | 8 | 19 | more or less normal | abdominal | behind origin of D1 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 32.8 | NA | NA | NA | NA | NA | NA | protogyny | external | NA | NA | NA | -1 | NA | 0 | none | Atlantic Goliath Grouper | 1 |
Description: In this qmd, I evaluate different supervised machine learning algorithms for predicting IUCN Red List status of fish based on ecological and morphological characteristics. These characteristics were retrieved from FishBase and joined with the IUCN data in a separate script.
Introduction
Global human activity threatens many species with extinction. According to the International Union and Conservation of Nature (IUCN), “More than 41,000 species are threatened with extinction. That is still 28% of all assessed species.”1. Increased extinction and loss of biodiversity can have severe ecological, economic, and cultural impacts. Cardinale et al.’s deep dive into biodiversity and ecosystem services research conclude that biodiversity loss reduces ecological communities’ efficiency, stability, and productivity. Decreased productivity from ecosystem services can have a negative impact on ecosystem economics2. Additionally, cultures worldwide have strong ties to local flora and fauna, much of which now face extinction risk. Improving understanding of extinction risk is ecologically, economically, and culturally important.
The IUCN Red List classifies species into various categories based on how vulnerable they are to extinction. The Red List also has many species that are listed as “Data Deficient” or “Not Evaluated”. Filling in these data gaps is extremely important when it comes to conservation. In marine species, evaluating these populations can prove challenging. It can be helpful to build off of existing knowledge to inform where evaluation resources should be spent. Here, I propose to build various machine learning models that predict binary Red List status of saltwater fish based on their ecological and morphological traits according to FishBase. I apply the most successful model to Red List Data Deficient and Not Evaluated species.
This work builds off of my previous work Identifying Key Traits in Hawaiian Fish that Predict Risk of Extinction. However, here I am looking at all fish listed on the IUCN Red List – not just those in Hawaii – and I am using a Tidymodels machine learning approach.
The Data
For my analysis I use the IUCN Red List data accessed via the IUCN Red List API1 and package rredlist3. Consistent with Munstermann et al., living species listed as ‘Vulnerable’, ‘Endangered’, or ‘Critically Endangered’ were categorized as ‘Threatened’. Living species listed as ‘Least Concern’ and ‘Near Threatened’ were categorized as ‘Nonthreatened’4. I also chose to add ‘Extinct in the Wild,’ to the ‘Threatened’ category. Fully extinct species were not included. The IUCN Red List data are limited in that many marine species have not been listed yet or have been identified as too data deficient to be evaluated. The lack of data on elusive fish may introduce bias into the models.
Fish ecological data were accessed from FishBase5 via package rfishbase6. Different species in the FishBase data were originally described by different people, possibly leading to errors or biases. Measurement errors in length may be present, as there are various common ways to measure the length of a fish. The species recorded in FishBase may be biased towards fish with commercial value. Data were wrangled in R and formatted in a tidy data table with the following variables.
Methods
To get started, there are several packages I will be using. Tidyverse packages help with further cleaning and preparing data. Tidymodels packages have almost all of what I need for the machine learning steps. kknn helps me build my knn model. knitr is used to create kable tables. baguette is used in my bagging model. doParallel allows for parallel computing on my laptop. vip helps to identify variable importance.
# Load libraries
library(tidyverse)
library(tidymodels)
library(kknn)
library(knitr)
library(baguette)
library(doParallel)
library(vip)
First I read in the data. These data were cleaned and joined in a separate script, but they will still need a bit of preprocessing. The outcome variable in this dataset is labeled IsOfConcern and indicates if the species is at risk or extinction (1) or not (0). I start out by exploring the data dimensions.
# Read in data
<- read_csv("/Users/elkewindschitl/Documents/data-sci/fish_data.csv")
fish_dat_full <- fish_dat_full %>%
fish_dat filter(!is.na(IsOfConcern)) # remove columns that don't have outcome variable
# Explore some characteristics of the dataset
<- ncol(fish_dat)
cols <- nrow(fish_dat)
rows <- data.frame(
df_chars Metric = c("Number of Columns", "Number of Rows"),
Count = c(ncol(fish_dat), nrow(fish_dat)))
kable(df_chars,
col.names = c("", "Count"))
Count | |
---|---|
Number of Columns | 196 |
Number of Rows | 9875 |
%>%
fish_dat group_by(IsOfConcern) %>%
count() %>%
kable(col.names = c("Species threat is of concern", "Count"))
Species threat is of concern | Count |
---|---|
0 | 9156 |
1 | 719 |
Data Prep
There are a lot on NA values in this dataset. I have a lot of columns already, so I can reduce that by removing columns that have a high proportion of NA values. Here I only keep columns where less than 20% of rows have NA values.
# Calculate the proportion of NA values in each column
<- colMeans(is.na(fish_dat), na.rm = TRUE)
na_proportion
#I want to remove rows with extreme NA counts (more than 20%)
# Define the threshold (20% or 0.20)
<- 0.20
threshold
# Find columns with more than the threshold proportion of NA values
<- names(na_proportion[na_proportion <= threshold])
columns_meeting_threshold
# Print the column names that meet the threshold
%>% kable(col.names = "Columns that are below NA threshold") columns_meeting_threshold
Columns that are below NA threshold |
---|
SpecCode |
GenusSpecies |
BodyShapeI |
DemersPelag |
AirBreathing |
DepthRangeDeep |
Length |
PriceCateg |
UsedforAquaculture |
GameFish |
Dangerous |
Electrogenic |
MainCommonName |
IsOfConcern |
# Select for just those columns that meet my criteria
<- fish_dat %>%
fish_short select(all_of(columns_meeting_threshold))
There is still more to be done. I want to make sure numeric columns are numeric, and that character columns are treated as factors. I need to make sure my outcome variable is a factor as well. I have a small enough data frame that I am able to do this by looking at which columns are character and mutating them to be factors. I want unordered factors, and I need to remove columns that are identifiers rather than features. Because some of the algorithms I am working with here do not handle missing data well, I chose to remove all of the rows that had NA values. This did unfortunately cut down on the amount of data that I have to train and test the algorithms on.
# Find character columns that need to be converted to factor
sapply(fish_short, class) %>% kable(col.names = c("Column", "Class"))
Column | Class |
---|---|
SpecCode | numeric |
GenusSpecies | character |
BodyShapeI | character |
DemersPelag | character |
AirBreathing | character |
DepthRangeDeep | numeric |
Length | numeric |
PriceCateg | character |
UsedforAquaculture | character |
GameFish | numeric |
Dangerous | character |
Electrogenic | character |
MainCommonName | character |
IsOfConcern | numeric |
# List of character columns to convert to factors
<- c("GenusSpecies", "BodyShapeI", "DemersPelag", "AirBreathing", "PriceCateg", "UsedforAquaculture", "Dangerous", "Electrogenic", "MainCommonName")
character_columns_to_convert
# Convert the specified character columns to factors
<- fish_short %>%
fish mutate(across(all_of(character_columns_to_convert), as.factor))
# If feature is a factor DON'T order, remove identifying columns
<- fish %>% mutate_if(is.ordered, .funs = factor, ordered = F) %>%
fish select(-GenusSpecies) %>%
select(-SpecCode) %>%
select(-MainCommonName)
# Make outcome factor
$IsOfConcern <- as.factor(fish$IsOfConcern)
fish
# Remove rows with any remaining missing values
<- na.omit(fish)
fish
# Check the new df
sapply(fish, class) %>% kable(col.names = c("Column", "Class"))
Column | Class |
---|---|
BodyShapeI | factor |
DemersPelag | factor |
AirBreathing | factor |
DepthRangeDeep | numeric |
Length | numeric |
PriceCateg | factor |
UsedforAquaculture | factor |
GameFish | numeric |
Dangerous | factor |
Electrogenic | factor |
IsOfConcern | factor |
After my data are prepped, I need to split the data into training and testing data sets. I use a 70/30 split. I have unbalanced data, so I stratify by my outcome variable, IsOfConcern
set.seed(123)
# Initial split of data, default 70/30
<- initial_split(fish, prop = 0.7, strata = IsOfConcern)
fish_split <- training(fish_split) # Training data
fish_train <- testing(fish_split) # Test data fish_test
Preprocessing
I create a recipe for the preprocessing steps used. I use dummy columns to make all the factor (categorical) variables have their own column. I remove columns where there is no variation in the data. Then I normalize the numeric columns because the lasso and knn algorithms require normalization to avoid certain features dominating the model. I use the same preprocessing steps for all algorithms for adequate comparison.
set.seed(123)
# Preprocess the data within the recipe
<- recipe(IsOfConcern ~ ., data = fish_train) %>%
fish_recipe step_dummy(all_factor(), -all_outcomes(), one_hot = TRUE) %>%
step_zv(all_predictors()) %>%
step_normalize(all_numeric(), -all_outcomes())
# Check test and train dfs look as expected
<- fish_recipe %>%
prepped prep()
<- bake(prepped, fish_train)
fish_baked_train <- bake(prepped, fish_test)
fish_baked_test
# Use below to check for NA values in the entire dataframes
#any(is.na(fish_baked_train))
#any(is.na(fish_baked_test))
Dummy Classifier
Because my data are unbalanced with many more non-threatened species, if a model always chose non-threatened it would have a high accuracy. Of course, that is not very helpful when trying to predict which species are or might be threatened. Here I derive a dummy accuracy by calculating the accuracy of a model that always predicts non-threatened. This will serve as a baseline for if a model is performing well (better than the dummy) or not. However, because this dataset is so imbalanced, I will be using area under the curve (AUC) for model selection.
# Calculate dummy classifier for baseline comparison
# Calculate the number of rows where IsOfConcern is 0
<- sum(fish_test$IsOfConcern == 0)
num_is_0
# Calculate the number of rows where IsOfConcern is not 0
<- nrow(fish_test) - num_is_0
num_is_not_0
# Calculate the accuracy of the dummy classifier (always predicting the majority class)
<- num_is_0 / nrow(fish_test) dummy
The dummy classifier accuracy is 0.94. This will serve as the baseline for other algorithms. Now I will proceed with building various models and training with the training data. I will be building Lasso, K-Nearest Neighbors, Decision Tree, Bagged Decision Tree, Random Forest, and Gradient Boosted Decision Tree models.
Lasso for Classification
set.seed(123)
# Set up k-fold cross validation with 10 folds. This can be used for all the algorithms
= fish_train %>%
fish_cv vfold_cv(v = 10,
strata = IsOfConcern)
# Set specifications
<- logistic_reg(penalty = tune(), mixture = 1) %>%
tune_l_spec set_engine("glmnet")
# Define a workflow
<- workflow() %>%
wf_l add_model(tune_l_spec) %>%
add_recipe(fish_recipe)
# set grid
<- grid_regular(penalty(), levels = 50)
lambda_grid
::registerDoParallel()
doParallelset.seed(123)
# Tune lasso model
<- wf_l %>%
lasso_grid tune_grid(
add_model(tune_l_spec),
resamples = fish_cv,
grid = lambda_grid
)
# Plot the mean accuracy and AUC at each penalty
%>%
lasso_grid collect_metrics() %>%
ggplot(aes(penalty, mean, color = .metric)) +
geom_errorbar(aes(ymin = mean - std_err,
ymax = mean + std_err),
alpha = 0.5) +
geom_line(size = 1.5) +
facet_wrap(~.metric,
scales = "free",
strip.position = "left",
nrow = 2, labeller = as_labeller(c(`accuracy` = "Accuracy",
`roc_auc` = "Area under ROC curve"))) +
scale_x_log10(name = "Penalty") +
scale_y_continuous(name = "") +
scale_color_manual(values = c("#4a6c75", "#57ba72")) +
theme_minimal() +
theme(
strip.placement = "outside",
legend.position = "none",
panel.background = element_blank(),
plot.background = element_blank()
+
) labs(title = "Results of penalty tuning")
# View table
%>%
lasso_grid ::show_best(metric = "roc_auc") %>%
tuneslice_head(n = 5) %>%
kable(caption = "Performance of the best models and the associated estimates for the penalty parameter values.")
penalty | .metric | .estimator | mean | n | std_err | .config |
---|---|---|---|---|---|---|
0.0022230 | roc_auc | binary | 0.8411065 | 10 | 0.0163830 | Preprocessor1_Model37 |
0.0013895 | roc_auc | binary | 0.8406888 | 10 | 0.0174560 | Preprocessor1_Model36 |
0.0008685 | roc_auc | binary | 0.8404846 | 10 | 0.0176116 | Preprocessor1_Model35 |
0.0035565 | roc_auc | binary | 0.8397288 | 10 | 0.0156837 | Preprocessor1_Model38 |
0.0005429 | roc_auc | binary | 0.8392379 | 10 | 0.0172147 | Preprocessor1_Model34 |
# Select the model with the highest auc
<- lasso_grid %>%
best_lasso select_best("roc_auc")
<- wf_l %>%
final_l_wf finalize_workflow(best_lasso)
# Perform a last fit to see how the model performs on the test data
<- last_fit(final_l_wf, fish_split)
final_lasso_fit
# Collect metrics on the test data
<- final_lasso_fit %>% collect_metrics()
tibble_lasso %>%
tibble_lasso kable(caption = "Accuracy and area under ther receiver operator curve of the final fit.")
.metric | .estimator | .estimate | .config |
---|---|---|---|
accuracy | binary | 0.9433472 | Preprocessor1_Model1 |
roc_auc | binary | 0.8017930 | Preprocessor1_Model1 |
# Grab the model accuracy on the testing data
<- tibble_lasso %>%
final_lasso_accuracy filter(.metric == "accuracy") %>%
pull(.estimate)
<- tibble_lasso %>%
final_lasso_auc filter(.metric == "roc_auc") %>%
pull(.estimate)
# Bind predictions and original data
<- cbind(fish_test, final_lasso_fit$.predictions)[, -16]# Remove duplicate column
lasso_test_rs
# Compute a confusion matrix
<- lasso_test_rs %>% yardstick::conf_mat(truth = IsOfConcern, estimate = .pred_class)
cm_lasso
# Create a custom color palette
<- scale_fill_gradient(
custom_palette high = "#4a6c75",
low = "#d3e6eb"
)
# Create the confusion matrix heatmap plot
autoplot(cm_lasso, type = "heatmap") +
+ # Apply the custom color palette
custom_palette theme(
axis.text.x = element_text(size = 12),
axis.text.y = element_text(size = 12),
axis.title = element_text(size = 14),
panel.background = element_blank(),
plot.background = element_blank()
+
) labs(title = "Confusion matrix of lasso predictions on test data")
# Calculate rates of tru pos, false neg. etc. from the confusion matrix
<- cm_lasso$table[2, 2]
TP_las <- cm_lasso$table[2, 1]
FP_las <- cm_lasso$table[1, 1]
TN_las <- cm_lasso$table[1, 2]
FN_las <- TP_las / (TP_las + FN_las) # True Positive Rate
TPR_las <- FP_las / (FP_las + TN_las) # False Positive Rate
FPR_las <- TN_las / (TN_las + FP_las) # True Negative Rate
TNR_las <- FN_las / (TP_las + FN_las) # False Negative Rate
FNR_las
# Create cm df to hold all false pos, etc. metrics
<- c(TPR_las, FPR_las, TNR_las, FNR_las)
lasso_cm_vec <- c("True positive rate", "False positive rate", "True negative rate", "False negative rate")
row_names <- bind_cols(Metric = row_names, Lasso = lasso_cm_vec) cm_df
The accuracy for the lasso model was 0.943 which is slightly better than our dummy classifier that had an accuracy of 0.94. This model had an AUC of 0.802.
K-Nearest Neighbors
set.seed(123)
# Define the KNN model with tuning
<- nearest_neighbor(neighbors = tune()) %>% # tune k
knn_spec_tune set_mode("classification") %>%
set_engine("kknn")
# Define a new workflow
<- workflow() %>%
wf_knn_tune add_model(knn_spec_tune) %>%
add_recipe(fish_recipe)
# Fit the workflow on the predefined folds and hyperparameters
<- wf_knn_tune %>%
fit_knn_cv tune_grid(
fish_cv, grid = data.frame(neighbors = c(1,5,10,15,seq(20,200,10))))
# Use autoplot() to examine how different parameter configurations relate to accuracy
autoplot(fit_knn_cv) +
theme_light() +
labs(
x = "Number of neighbors (K)",
title = "Results of neighbor tuning"
+
) theme(
legend.position = "none",
panel.background = element_blank(),
plot.background = element_blank()
+
) facet_wrap(
~.metric,
nrow = 2,
labeller = labeller(.metric = c("accuracy" = "Accuracy", "roc_auc" = "Area under ROC curve"))
)
# View table
%>%
fit_knn_cv ::show_best(metric = "roc_auc") %>%
tuneslice_head(n = 5) %>%
kable(caption = "Performance of the best models and the associated estimates for the number of neighbors parameter values.")
neighbors | .metric | .estimator | mean | n | std_err | .config |
---|---|---|---|---|---|---|
150 | roc_auc | binary | 0.8630058 | 10 | 0.0132910 | Preprocessor1_Model18 |
130 | roc_auc | binary | 0.8628657 | 10 | 0.0131475 | Preprocessor1_Model16 |
140 | roc_auc | binary | 0.8628149 | 10 | 0.0132096 | Preprocessor1_Model17 |
160 | roc_auc | binary | 0.8627477 | 10 | 0.0133205 | Preprocessor1_Model19 |
170 | roc_auc | binary | 0.8625283 | 10 | 0.0133526 | Preprocessor1_Model20 |
# Select the model with the highest auc
<- fit_knn_cv %>%
best_knn select_best("roc_auc")
# The final workflow for our KNN model
<-
final_knn_wf %>%
wf_knn_tune finalize_workflow(best_knn)
# Use last_fit() approach to apply model to test data
<- last_fit(final_knn_wf, fish_split)
final_knn_fit
# Collect metrics on the test data
<- final_knn_fit %>% collect_metrics()
tibble_knn %>%
tibble_knn kable(caption = "Accuracy and area under the receiver operator curve of the final fit.")
.metric | .estimator | .estimate | .config |
---|---|---|---|
accuracy | binary | 0.9407484 | Preprocessor1_Model1 |
roc_auc | binary | 0.8511741 | Preprocessor1_Model1 |
# Store accuracy and AUC
<- tibble_knn %>%
final_knn_accuracy filter(.metric == "accuracy") %>%
pull(.estimate)
<- tibble_knn %>%
final_knn_auc filter(.metric == "roc_auc") %>%
pull(.estimate)
# Bind predictions and original data
<- cbind(fish_test, final_knn_fit$.predictions)[, -16]
knn_test_rs
# Compute a confusion matrix
<- knn_test_rs %>% yardstick::conf_mat(truth = IsOfConcern, estimate = .pred_class)
cm_knn
# Create the confusion matrix heatmap plot
autoplot(cm_knn, type = "heatmap") +
+
custom_palette theme(
axis.text.x = element_text(size = 12),
axis.text.y = element_text(size = 12),
axis.title = element_text(size = 14),
panel.background = element_blank(),
plot.background = element_blank()
+
) labs(title = "Confusion matrix of knn predictions on test data")
# Calculate rates from the confusion matrix
<- cm_knn$table[2, 2]
TP_knn <- cm_knn$table[2, 1]
FP_knn <- cm_knn$table[1, 1]
TN_knn <- cm_knn$table[1, 2]
FN_knn <- TP_knn / (TP_knn + FN_knn) # True Positive Rate
TPR_knn <- FP_knn / (FP_knn + TN_knn) # False Positive Rate
FPR_knn <- TN_knn / (TN_knn + FP_knn) # True Negative Rate
TNR_knn <- FN_knn / (TP_knn + FN_knn) # False Negative Rate
FNR_knn
# Add rates to cm df
<- c(TPR_knn, FPR_knn, TNR_knn, FNR_knn)
knn_cm_vec $KNN <- knn_cm_vec cm_df
The k-nearest neighbors model had nearly the same accuracy at predicting threat status than the dummy classifier. The accuracy of the model was 0.941. This model had an AUC of 0.851 which is better than the lasso model.
Decision Tree
# Tell the model that we are tuning hyperparams
<- decision_tree(
tree_spec_tune cost_complexity = tune(),
tree_depth = tune(),
min_n = tune()) %>%
set_engine("rpart") %>%
set_mode("classification")
# Set up grid
<- grid_regular(cost_complexity(), tree_depth(), min_n(), levels = 5)
tree_grid
# Check grid
#tree_grid
# Define a workflow with the recipe and specification
<- workflow() %>%
wf_tree_tune add_recipe(fish_recipe) %>%
add_model(tree_spec_tune)
::registerDoParallel(cores = 3) #build trees in parallel
doParallel
# Tune
<- tune_grid(
tree_rs
wf_tree_tune,~.,
IsOfConcernresamples = fish_cv,
grid = tree_grid,
metrics = metric_set(roc_auc)
)
# Use autoplot() to examine how different parameter configurations relate to auc
autoplot(tree_rs) +
theme_light() +
scale_color_manual(values = c("#4a6c75", "#57ba72", "#d596e0", "#e06d53", "#d6cf81")) +
labs(x = "Cost-complexity parameter",
y = "Area under the ROC curve",
title = "Results of tree tuning") +
theme(
plot.title = element_text(size = 16, hjust = 0.5),
panel.background = element_blank(),
plot.background = element_blank()
)
# View table
%>%
tree_rs ::show_best(metric = "roc_auc") %>%
tuneslice_head(n = 5) %>%
kable(caption = "Performance of the best models and the associated estimates for the tuned tree parameter values.")
cost_complexity | tree_depth | min_n | .metric | .estimator | mean | n | std_err | .config |
---|---|---|---|---|---|---|---|---|
0.0000000 | 11 | 11 | roc_auc | binary | 0.8010909 | 10 | 0.0156204 | Preprocessor1_Model041 |
0.0000000 | 11 | 11 | roc_auc | binary | 0.8010909 | 10 | 0.0156204 | Preprocessor1_Model042 |
0.0000032 | 11 | 11 | roc_auc | binary | 0.8010909 | 10 | 0.0156204 | Preprocessor1_Model043 |
0.0005623 | 11 | 11 | roc_auc | binary | 0.8010909 | 10 | 0.0156204 | Preprocessor1_Model044 |
0.0005623 | 15 | 11 | roc_auc | binary | 0.7960204 | 10 | 0.0160636 | Preprocessor1_Model049 |
# Finalize the model specs with the best hyperparameter result
<- finalize_model(tree_spec_tune, select_best(tree_rs))
final_tree
# Final fit to test data
<- last_fit(final_tree, IsOfConcern~., fish_split) # does training fit then final prediction as well
final_tree_fit
# Collect metrics from fit
<- final_tree_fit %>% collect_metrics()
tibble_tree %>% kable(caption = "Accuracy and area under ther receiver operator curve of the final fit.") tibble_tree
.metric | .estimator | .estimate | .config |
---|---|---|---|
accuracy | binary | 0.9438669 | Preprocessor1_Model1 |
roc_auc | binary | 0.7751052 | Preprocessor1_Model1 |
# Store accuracy and auc metrics
<- tibble_tree %>%
final_tree_accuracy filter(.metric == "accuracy") %>%
pull(.estimate)
<- tibble_tree %>%
final_tree_auc filter(.metric == "roc_auc") %>%
pull(.estimate)
# Bind predictions and original data
<- cbind(fish_test, final_tree_fit$.predictions)[, -16]
tree_test_rs
# Compute a confusion matrix
<- tree_test_rs %>% yardstick::conf_mat(truth = IsOfConcern, estimate = .pred_class)
cm_tree
# Create the confusion matrix heatmap plot
autoplot(cm_tree, type = "heatmap") +
+
custom_palette theme(
axis.text.x = element_text(size = 12),
axis.text.y = element_text(size = 12),
axis.title = element_text(size = 14),
panel.background = element_blank(),
plot.background = element_blank()
+
) labs(title = "Confusion matrix of decision tree predictions on test data")
# Calculate rates from the confusion matrix
<- cm_tree$table[2, 2]
TP_tree <- cm_tree$table[2, 1]
FP_tree <- cm_tree$table[1, 1]
TN_tree <- cm_tree$table[1, 2]
FN_tree <- TP_tree / (TP_tree + FN_tree) # True Positive Rate
TPR_tree <- FP_tree / (FP_tree + TN_tree) # False Positive Rate
FPR_tree <- TN_tree / (TN_tree + FP_tree) # True Negative Rate
TNR_tree <- FN_tree / (TP_tree + FN_tree) # False Negative Rate
FNR_tree
# Add rates to cm df
<- c(TPR_tree, FPR_tree, TNR_tree, FNR_tree)
tree_cm_vec $DecisionTree <- tree_cm_vec cm_df
The decision tree model had a higher accuracy at predicting threat status than the dummy classifier. The accuracy of the decision tree was 0.944. The AUC is 0.775 which is lower than the lasso model and the knn model.
Bagging
set.seed(123)
# Set bagging tuning specifications
<-
bag_spec bag_tree(cost_complexity = tune(),
tree_depth = tune(),
min_n = tune()) %>%
set_engine("rpart", times = 75) %>% # 25 ensemble members
set_mode("classification")
# Set up tuning grid
<- grid_regular(cost_complexity(), tree_depth(), min_n(), levels = 5)
bag_grid
# Check the grid space
#bag_grid
# Set up bagging workflow
<- workflow() %>%
wf_bag add_recipe(fish_recipe) %>%
add_model(bag_spec)
::registerDoParallel() #build trees in parallel
doParallel
# Run tuning
<- tune_grid(
bag_rs
wf_bag,~.,
IsOfConcernresamples = fish_cv,
grid = bag_grid,
metrics = metric_set(roc_auc)
)
# Use autoplot() to examine how different parameter configurations relate to auc
autoplot(bag_rs) +
theme_light() +
scale_color_manual(values = c("#4a6c75", "#57ba72", "#d596e0", "#e06d53", "#d6cf81")) +
labs(x = "Cost-complexity parameter",
y = "Area under the ROC curve",
title = "Results of bagging tuning") +
theme(
plot.title = element_text(size = 16, hjust = 0.5),
panel.background = element_blank(),
plot.background = element_blank()
)
# View table
%>%
bag_rs ::show_best() %>%
tuneslice_head(n = 5) %>%
kable(caption = "Performance of the best models and the associated estimates for the tuned tree parameter values.")
cost_complexity | tree_depth | min_n | .metric | .estimator | mean | n | std_err | .config |
---|---|---|---|---|---|---|---|---|
0.0005623 | 15 | 11 | roc_auc | binary | 0.8820775 | 10 | 0.0098248 | Preprocessor1_Model049 |
0.0005623 | 15 | 21 | roc_auc | binary | 0.8812587 | 10 | 0.0123964 | Preprocessor1_Model074 |
0.0000032 | 15 | 11 | roc_auc | binary | 0.8791856 | 10 | 0.0137413 | Preprocessor1_Model048 |
0.0000032 | 11 | 21 | roc_auc | binary | 0.8787266 | 10 | 0.0122258 | Preprocessor1_Model068 |
0.0000000 | 15 | 11 | roc_auc | binary | 0.8779793 | 10 | 0.0117088 | Preprocessor1_Model047 |
# Finalize the model specs witht he best performing model
<- finalize_model(bag_spec, select_best(bag_rs))
final_bag
# Perform a last fit of the model on the testing data
<- last_fit(final_bag, IsOfConcern~., fish_split)
final_bag_fit
# Collect metrics from fit
<- final_bag_fit %>% collect_metrics()
tibble_bag %>%
tibble_bag kable(caption = "Accuracy and area under ther receiver operator curve of the final fit.")
.metric | .estimator | .estimate | .config |
---|---|---|---|
accuracy | binary | 0.9438669 | Preprocessor1_Model1 |
roc_auc | binary | 0.8568390 | Preprocessor1_Model1 |
# Store model accuracy and auc on testing data
<- tibble_bag %>%
final_bag_accuracy filter(.metric == "accuracy") %>%
pull(.estimate)
<- tibble_bag %>%
final_bag_auc filter(.metric == "roc_auc") %>%
pull(.estimate)
# Bind predictions and original data
<- cbind(fish_test, final_bag_fit$.predictions)[, -16]
bag_test_rs
# Compute a confusion matrix
<- bag_test_rs %>% yardstick::conf_mat(truth = IsOfConcern, estimate = .pred_class)
cm_bag
# Create the confusion matrix heatmap plot
autoplot(cm_bag, type = "heatmap") +
+ # Apply the custom color palette
custom_palette theme(
axis.text.x = element_text(size = 12),
axis.text.y = element_text(size = 12),
axis.title = element_text(size = 14),
panel.background = element_blank(),
plot.background = element_blank()
+
) labs(title = "Confusion matrix of bagging predictions on test data")
# Calculate rates from the confusion matrix
<- cm_bag$table[2, 2]
TP_bag <- cm_bag$table[2, 1]
FP_bag <- cm_bag$table[1, 1]
TN_bag <- cm_bag$table[1, 2]
FN_bag <- TP_bag / (TP_bag + FN_bag) # True Positive Rate
TPR_bag <- FP_bag / (FP_bag + TN_bag) # False Positive Rate
FPR_bag <- TN_bag / (TN_bag + FP_bag) # True Negative Rate
TNR_bag <- FN_bag / (TP_bag + FN_bag) # False Negative Rate
FNR_bag
# Add rates to cm df
<- c(TPR_bag, FPR_bag, TNR_bag, FNR_bag)
bag_cm_vec $Bagging <- bag_cm_vec cm_df
The bagging model had a similar accuracy at predicting threat status as the decision tree and lasso model. The accuracy of the bagging was 0.944. The AUC is 0.857 which is similar to the knn model.
Random Forest
set.seed(123)
# Set forest specifications
<-
forest_spec rand_forest(min_n = tune(),
mtry = tune(),
trees = tune()) %>%
set_engine("ranger") %>%
set_mode("classification")
# Set grid for tuning
<- grid_regular(min_n(), mtry(c(1,44)), trees(), levels = 5)
forest_grid
# Check the grid space
#forest_grid
# Set the workflow with the recipe and forest specs
<- workflow() %>%
wf_forest add_recipe(fish_recipe) %>%
add_model(forest_spec)
::registerDoParallel()
doParallel
# Perform tuning
<- tune_grid(
forest_rs
wf_forest,~.,
IsOfConcernresamples = fish_cv,
grid = forest_grid,
metrics = metric_set(roc_auc)
)
# Use autoplot() to examine how different parameter configurations relate to auc
autoplot(forest_rs) +
theme_light() +
scale_color_manual(values = c("#4a6c75", "#57ba72", "#d596e0", "#e06d53", "#d6cf81")) +
labs(x = "# Randomly selected predictors",
y = "Area under the ROC curve",
title = "Results of random forest tuning") +
theme(
plot.title = element_text(size = 16, hjust = 0.5),
panel.background = element_blank(),
plot.background = element_blank()
)
# View table
%>%
forest_rs ::show_best() %>%
tuneslice_head(n = 5) %>%
kable(caption = "Performance of the best models and the associated estimates for the tree and forest parameter values.")
mtry | trees | min_n | .metric | .estimator | mean | n | std_err | .config |
---|---|---|---|---|---|---|---|---|
11 | 2000 | 11 | roc_auc | binary | 0.8855631 | 10 | 0.0134102 | Preprocessor1_Model107 |
11 | 500 | 2 | roc_auc | binary | 0.8850774 | 10 | 0.0133817 | Preprocessor1_Model031 |
11 | 1500 | 11 | roc_auc | binary | 0.8850390 | 10 | 0.0138726 | Preprocessor1_Model082 |
11 | 1500 | 2 | roc_auc | binary | 0.8847734 | 10 | 0.0135810 | Preprocessor1_Model081 |
11 | 1000 | 11 | roc_auc | binary | 0.8846533 | 10 | 0.0141003 | Preprocessor1_Model057 |
# Finalize model with specs and best hyperparameters
<- finalize_model(forest_spec, select_best(forest_rs))
final_forest
# Perform last fit
<- last_fit(final_forest, IsOfConcern~., fish_split)
final_forest_fit
# Collect performance metrics
<- final_forest_fit %>% collect_metrics()
tibble_forest %>%
tibble_forest kable(caption = "Accuracy and area under ther receiver operator curve of the final fit.")
.metric | .estimator | .estimate | .config |
---|---|---|---|
accuracy | binary | 0.9433472 | Preprocessor1_Model1 |
roc_auc | binary | 0.8240705 | Preprocessor1_Model1 |
# Store accuracy and auc metrics
<- tibble_forest %>%
final_forest_accuracy filter(.metric == "accuracy") %>%
pull(.estimate)
<- tibble_forest %>%
final_forest_auc filter(.metric == "roc_auc") %>%
pull(.estimate)
# Bind predictions and original data
<- cbind(fish_test, final_forest_fit$.predictions)[, -16]
forest_test_rs
# Compute a confusion matrix
<- forest_test_rs %>% yardstick::conf_mat(truth = IsOfConcern, estimate = .pred_class)
cm_forest
# Create the confusion matrix heatmap plot
autoplot(cm_forest, type = "heatmap") +
+ # Apply the custom color palette
custom_palette theme(
axis.text.x = element_text(size = 12),
axis.text.y = element_text(size = 12),
axis.title = element_text(size = 14),
panel.background = element_blank(),
plot.background = element_blank()
+
) labs(title = "Confusion matrix of random forest predictions on test data")
# Calculate rates from the confusion matrix
<- cm_forest$table[2, 2]
TP_forest <- cm_forest$table[2, 1]
FP_forest <- cm_forest$table[1, 1]
TN_forest <- cm_forest$table[1, 2]
FN_forest <- TP_forest / (TP_forest + FN_forest) # True Positive Rate
TPR_forest <- FP_forest / (FP_forest + TN_forest) # False Positive Rate
FPR_forest <- TN_forest / (TN_forest + FP_forest) # True Negative Rate
TNR_forest <- FN_forest / (TP_forest + FN_forest) # False Negative Rate
FNR_forest
# Add rates to cm df
<- c(TPR_forest, FPR_forest, TNR_forest, FNR_forest)
forest_cm_vec $RandomForest <- forest_cm_vec cm_df
The accuracy of the forest was 0.943. This is again similar to other models. This model had an auc of 0.824.
Boosting
# Tune learning rate first
# Set up specs for learning rate tuning
<- parsnip::boost_tree(mode = "classification",
lr_spec engine = "xgboost",
trees = 3000,
learn_rate = tune())
# Set up tuning grid
<- expand.grid(learn_rate = seq(0.0001, 0.5, length.out = 50))
lr_grid
# Set up workflow
<- workflow() %>%
wf_lr_tune add_recipe(fish_recipe) %>%
add_model(lr_spec)
::registerDoParallel()
doParallelset.seed(123)
# Tune
<- tune_grid(
lr_rs
wf_lr_tune,~.,
IsOfConcernresamples = fish_cv,
grid = lr_grid
)
# Use autoplot() to examine how different parameter configurations relate to accuracy
autoplot(lr_rs) +
theme_light() +
labs(
x = "Learning rate",
title = "Results from learning rate tuning"
+
) theme(
legend.position = "none",
panel.background = element_blank(),
plot.background = element_blank()
+
) facet_wrap(
~.metric,
nrow = 2,
labeller = labeller(.metric = c("accuracy" = "Accuracy", "roc_auc" = "Area under ROC curve"))
)
# Identify best values from the tuning process
%>%
lr_rs ::show_best(metric = "roc_auc") %>%
tuneslice_head(n = 5) %>%
kable(caption = "Performance of the best models and the associated estimates for the learning rate parameter values.")
learn_rate | .metric | .estimator | mean | n | std_err | .config |
---|---|---|---|---|---|---|
0.0103020 | roc_auc | binary | 0.8697860 | 10 | 0.0125356 | Preprocessor1_Model02 |
0.0307061 | roc_auc | binary | 0.8587251 | 10 | 0.0141796 | Preprocessor1_Model04 |
0.0205041 | roc_auc | binary | 0.8584129 | 10 | 0.0126799 | Preprocessor1_Model03 |
0.0511102 | roc_auc | binary | 0.8574748 | 10 | 0.0135668 | Preprocessor1_Model06 |
0.0409082 | roc_auc | binary | 0.8572797 | 10 | 0.0133747 | Preprocessor1_Model05 |
# Select best lr hyperparametes
<- lr_rs %>%
best_learn ::select_best("roc_auc")
tune
# Tune tree parameters next
# Create a new specification where setting the learning rate and tune the tree parameters
<- parsnip::boost_tree(mode = "classification",
boost_tree_spec engine = "xgboost",
trees = 3000,
learn_rate = best_learn$learn_rate,
min_n = tune(),
tree_depth = tune(),
loss_reduction = tune()
)
# Define parameters to be tuned
<- dials::parameters(
boost_params min_n(),
tree_depth(),
loss_reduction()
)
# Set up a tuning grid using grid_max_entropy() to get a representative sampling of the parameter space.
<- dials::grid_max_entropy(boost_params, size = 50)
boost_tree_grid
# Set up workflow
<- workflow() %>%
wf_boost_tree_tune add_recipe(fish_recipe) %>%
add_model(boost_tree_spec)
set.seed(123)
::registerDoParallel()
doParallel
# Tune
<- tune_grid(
boost_tree_rs
wf_boost_tree_tune,~.,
IsOfConcernresamples = fish_cv,
grid = boost_tree_grid
)
# Identify best values from the tuning process
%>%
boost_tree_rs ::show_best(metric = "roc_auc") %>%
tuneslice_head(n = 5) %>%
kable(caption = "Performance of the best models and the associated estimates for the tree parameter values.")
min_n | tree_depth | loss_reduction | .metric | .estimator | mean | n | std_err | .config |
---|---|---|---|---|---|---|---|---|
5 | 13 | 1.4059617 | roc_auc | binary | 0.8826142 | 10 | 0.0139348 | Preprocessor1_Model11 |
5 | 6 | 0.5640640 | roc_auc | binary | 0.8751618 | 10 | 0.0139514 | Preprocessor1_Model25 |
10 | 3 | 0.0000086 | roc_auc | binary | 0.8744951 | 10 | 0.0127641 | Preprocessor1_Model34 |
11 | 3 | 0.0000000 | roc_auc | binary | 0.8741408 | 10 | 0.0133400 | Preprocessor1_Model21 |
15 | 8 | 0.0000000 | roc_auc | binary | 0.8734961 | 10 | 0.0128297 | Preprocessor1_Model50 |
# Select best tree hyperparameters
<- boost_tree_rs %>%
boost_best_trees ::select_best("roc_auc")
tune
# Tune Stochastic Parameters
# Create another new specification where setting the learning rate and tree parameters and tune the stochastic parameters.
<- parsnip::boost_tree(mode = "classification",
boost_stoc_spec engine = "xgboost",
trees = 3000,
learn_rate = best_learn$learn_rate,
min_n = boost_best_trees$min_n,
tree_depth = boost_best_trees$tree_depth,
mtry = tune(),
loss_reduction = boost_best_trees$loss_reduction,
sample_size = tune(),
stop_iter = tune()
)
# Set up a tuning grid using grid_max_entropy() again.
# Define parameters to be tuned
<- dials::parameters(
boost_stoc_params finalize(mtry(),
select(fish_baked_train,-IsOfConcern)),
sample_size = sample_prop(c(.4, .9)),
stop_iter())
# Set up grid
<- dials::grid_max_entropy(boost_stoc_params, size = 50)
boost_stoc_grid
# Set up workflow
<- workflow() %>%
wf_boost_stoc add_recipe(fish_recipe) %>%
add_model(boost_stoc_spec)
set.seed(123)
::registerDoParallel()
doParallel
# Tune
<- tune_grid(
boost_stoc_rs
wf_boost_stoc,~.,
IsOfConcernresamples = fish_cv,
grid = boost_stoc_grid
)
# Identify best values from the tuning process
%>%
boost_stoc_rs ::show_best(metric = "roc_auc") %>%
tuneslice_head(n = 5) %>%
kable(caption = "Performance of the best models and the associated estimates for the stochastic parameter values.")
mtry | sample_size | stop_iter | .metric | .estimator | mean | n | std_err | .config |
---|---|---|---|---|---|---|---|---|
8 | 0.7172139 | 6 | roc_auc | binary | 0.8839072 | 10 | 0.0144816 | Preprocessor1_Model36 |
6 | 0.7955117 | 11 | roc_auc | binary | 0.8838613 | 10 | 0.0143617 | Preprocessor1_Model14 |
4 | 0.8387867 | 18 | roc_auc | binary | 0.8826144 | 10 | 0.0145272 | Preprocessor1_Model11 |
12 | 0.8546558 | 8 | roc_auc | binary | 0.8821623 | 10 | 0.0136444 | Preprocessor1_Model22 |
7 | 0.6991360 | 10 | roc_auc | binary | 0.8820257 | 10 | 0.0143307 | Preprocessor1_Model46 |
# Select best hyperparameters from tuning
<- boost_stoc_rs %>%
boost_best_stoch ::select_best("roc_auc")
tune
# Finalize workflow
# Assemble final workflow with all of the optimized parameters and do a final fit.
<- parsnip::boost_tree(mode = "classification",
boost_final_spec engine = "xgboost",
trees = 1000,
learn_rate = best_learn$learn_rate,
min_n = boost_best_trees$min_n,
tree_depth = boost_best_trees$tree_depth,
mtry = boost_best_stoch$mtry,
loss_reduction = boost_best_trees$loss_reduction,
stop_iter = boost_best_stoch$stop_iter,
sample_size = boost_best_stoch$sample_size
)
# Set up workflow
<- workflow() %>%
wf_boost_final add_recipe(fish_recipe) %>%
add_model(boost_final_spec)
# Fit to just training data (need for later)
<- wf_boost_final %>%
final_simple_fit fit(data = fish_train)
# Final fit
<- last_fit(boost_final_spec, IsOfConcern~., fish_split)
final_boost_fit
# Store accuracy and auc metrics
<- final_boost_fit %>% collect_metrics()
tibble_boost %>%
tibble_boost kable(caption = "Accuracy and area under ther receiver operator curve of the final fit.")
.metric | .estimator | .estimate | .config |
---|---|---|---|
accuracy | binary | 0.9438669 | Preprocessor1_Model1 |
roc_auc | binary | 0.8593097 | Preprocessor1_Model1 |
<- tibble_boost %>%
final_boost_accuracy filter(.metric == "accuracy") %>%
pull(.estimate)
<- tibble_boost %>%
final_boost_auc filter(.metric == "roc_auc") %>%
pull(.estimate)
# Bind predictions and original data
<- cbind(fish_test, final_boost_fit$.predictions)[, -16]
boost_test_rs
# Compute a confusion matrix
<- boost_test_rs %>% yardstick::conf_mat(truth = IsOfConcern, estimate = .pred_class)
cm_boost
# Create the confusion matrix heatmap plot
autoplot(cm_boost, type = "heatmap") +
+ # Apply the custom color palette
custom_palette theme(
axis.text.x = element_text(size = 12),
axis.text.y = element_text(size = 12),
axis.title = element_text(size = 14),
panel.background = element_blank(),
plot.background = element_blank()
+
) labs(title = "Confusion matrix of random boosted predictions on test data")
# Calculate rates from the confusion matrix
<- cm_boost$table[2, 2]
TP_boost <- cm_boost$table[2, 1]
FP_boost <- cm_boost$table[1, 1]
TN_boost <- cm_boost$table[1, 2]
FN_boost <- TP_boost / (TP_boost + FN_boost) # True Positive Rate
TPR_boost <- FP_boost / (FP_boost + TN_boost) # False Positive Rate
FPR_boost <- TN_boost / (TN_boost + FP_boost) # True Negative Rate
TNR_boost <- FN_boost / (TP_boost + FN_boost) # False Negative Rate
FNR_boost
# Add rates to cm df
<- c(TPR_boost, FPR_boost, TNR_boost, FNR_boost)
boost_cm_vec $Boosting <- boost_cm_vec cm_df
The accuracy of the boosting was 0.944. This is also similar to other models. This model had an auc of 0.859, which is the best of all models by a small margin.
Model Results
Model selection
I want to compare accuracy and area under the curve of all models created.
# Name models in vec
<- c("Dummy", "Lasso", "KNN", "Decision Tree", "Bagging", "Random Forest", "Boosting")
models
# Create accuracy vec
<- c(dummy, final_lasso_accuracy, final_knn_accuracy, final_tree_accuracy, final_bag_accuracy, final_forest_accuracy, final_boost_accuracy)
accuracy
# Make df
<- data.frame(models, accuracy)
accuracy_df
# Create a factor with the desired order for models
$models <- factor(accuracy_df$models, levels = c("Dummy", "Lasso", "KNN", "Decision Tree", "Bagging", "Random Forest", "Boosting"))
accuracy_df
# Create the plot
ggplot(accuracy_df, aes(x = models, y = accuracy)) +
geom_col(fill = "#4a6c75") +
theme_minimal() +
labs(title = "Accuracy was similar across all models",
x = "Model",
y = "Accuracy") +
geom_text(aes(label = round(accuracy, 3)), vjust = -0.5) +
theme(plot.background = element_blank(),
panel.background = element_blank())
# Create auc vec
<- c(final_lasso_auc, final_knn_auc, final_tree_auc, final_bag_auc, final_forest_auc, final_boost_auc)
auc
# Make df
<- data.frame(models[-1], auc)
auc_df
# Create a factor with the desired order for models
$models <- factor(auc_df$models, levels = c("Lasso", "KNN", "Decision Tree", "Bagging", "Random Forest", "Boosting"))
auc_df
ggplot(auc_df, aes(x = models, y = auc)) +
geom_col(fill = "#4a6c75") +
theme_minimal() +
labs(title = "Boosting was the highest performing model",
x = NULL,
y = "Area under ROC curve") +
geom_text(aes(label = round(auc, 3)), vjust = -0.5) +
theme(plot.background = element_blank(),
panel.background = element_blank())
Because the dataset is very imbalanced with the proportion of threatened species being small, the accuracies between models don’t vary much. It is difficult to get much more accurate than always choosing not threatened (the dummy classifier) because that already has an accuracy of 0.94. All models performed better than the dummy classifier though, so that is encouraging. Looking at AUC we see that boosting decision trees was the most successful of the models when applied to the testing data.
I also want to compare the confusion matrix values of each model.
# Print table
%>%
cm_df kable()
Metric | Lasso | KNN | DecisionTree | Bagging | RandomForest | Boosting |
---|---|---|---|---|---|---|
True positive rate | 0.1826087 | 0.0086957 | 0.3217391 | 0.2521739 | 0.2434783 | 0.2521739 |
False positive rate | 0.0082919 | 0.0000000 | 0.0165837 | 0.0121614 | 0.0121614 | 0.0121614 |
True negative rate | 0.9917081 | 1.0000000 | 0.9834163 | 0.9878386 | 0.9878386 | 0.9878386 |
False negative rate | 0.8173913 | 0.9913043 | 0.6782609 | 0.7478261 | 0.7565217 | 0.7478261 |
# Create a named vector with colors for each Metric
<- c("True positive rate" = "#8ae68d",
color_palette "False positive rate" = "#F3A6DE",
"True negative rate" = "#1b4a1c",
"False negative rate" = "#B3297A")
# Plot the true pos etc rates
%>%
cm_df pivot_longer(cols = Lasso:Boosting) %>%
ggplot(aes(x = name, y = value, fill = Metric)) +
geom_col() +
theme_minimal() +
labs(x = NULL,
y = "Value",
title = "Rates of true positive, true negative, etc. for each model") +
theme(plot.background = element_blank(),
panel.background = element_blank()) +
scale_fill_manual(values = color_palette)
The decision tree had the highest true positive rate of all the models and the lowest false negative rate. However, the decision tree also had the largest false positive rate (although all false positive rates were generally small). False positives in the context of investigating fish populations could cause a misuse of resources. The true negative rate is comparable across all models. Based on these metrics and the AUC, either boosting or a decision tree appear to be good options. Because I want to minimize false positive outcomes, I will proceed with the boosted model for predictions.
Variable Importance
I want to look at the variable importance of the most successful model, the boosting.
# Variable importance
<- wf_boost_final %>%
var_imp_boost fit(fish_train) %>%
pull_workflow_fit() %>%
vi() %>%
mutate(
Importance = abs(Importance),
Variable = fct_reorder(Variable, Importance))
%>%
var_imp_boost ggplot(aes(x = Importance, y = Variable)) +
geom_col(fill = "#57ba72") +
theme_minimal() +
scale_x_continuous(expand = c(0, 0)) +
labs(y = "Feature",
x = "Importance",
title = "Feature importance of boosting model") +
theme(plot.background = element_blank(),
panel.background = element_blank())
It looks like length, other body shape, depth range (deep), and having no special electrogenic ability, are driving the model the most.
Prediction Results
I want to use the boosted model to predict Threatened/Nontreatened of species that are listed as ‘Data Deficient’ or NA on the IUCN Red List. I read in the species that are not categorized and apply the same preparation steps on them. I then apply the model to the new data.
set.seed(123)
# Read in data for predictions
<- read_csv("/Users/elkewindschitl/Documents/data-sci/predict_data.csv")
prediction_dat
# Remove columns not used above
<- prediction_dat %>%
predict_short select(all_of(columns_meeting_threshold[1:13]))
# Convert the specified character columns to factors
<- predict_short %>%
predict_fac mutate(across(all_of(character_columns_to_convert), as.factor))
$Index <- 1:nrow(predict_fac)
predict_fac
# If feature is a factor DON'T order, remove identifying columns
<- predict_fac %>%
predict_fac mutate_if(is.ordered, .funs = factor, ordered = F)
<- predict_fac %>%
predict_features select(-GenusSpecies) %>%
select(-SpecCode) %>%
select(-MainCommonName)
# Remove rows with any remaining missing values
<- na.omit(predict_features)
predict_features_no_na
# Remove index column
<- predict_features_no_na[-11]
predict_no_index
# Apply the model
<- final_simple_fit %>% predict(new_data = predict_no_index)
predictions
# Pull name columns
<- predict_fac %>%
species_names_p select(GenusSpecies, MainCommonName, Index)
# Bind to data and join w names
<- cbind(predict_features_no_na, predictions) %>%
predicted_df left_join(y = species_names_p,
by = join_by(Index))
# Isolate threatened species
<- predicted_df %>%
predicted_threatened filter(.pred_class == 1) %>%
select(GenusSpecies, MainCommonName)
# Display threatened species
%>%
predicted_threatened kable(col.names = c("Genus Species", "Main Common Name"),
caption = "Data deficient species predicted to be threated.")
Genus Species | Main Common Name |
---|---|
Makaira mazara | NA |
Cephaloscyllium isabella | NA |
Carcharhinus perezii | NA |
Raja texana | NA |
Hippocampus guttulatus | Long-snouted Seahorse |
Anarhichas lupus | NA |
Pristis microdon | NA |
Zapteryx exasperata | Banded Guitarfish |
Myliobatis californica | NA |
Mobula japanica | NA |
Anarhichas orientalis | NA |
Epinephelus lanceolatus | Giant Grouper |
Dasyatis brevis | NA |
Raja cervigoni | NA |
Rhina ancylostomus | NA |
Raja herwigi | Cape Verde Skate |
Raja rouxi | NA |
Dasyatis tortonesei | Tortonese’s Stingray |
Pseudobatos glaucostigma | NA |
Echinophryne crassispina | NA |
Hemitrygon bennettii | NA |
Rhinobatos formosensis | NA |
Acroteriobatus ocellatus | Speckled Guitarfish |
Hippocampus coronatus | High-crowned Seahorse |
Lophius litulon | Yellow Goosefish |
Echinophryne reynoldsi | NA |
Mobula eregoodootenkee | NA |
Dipturus trachyderma | NA |
Tetronarce semipelagica | NA |
Hippocampus colemani | Coleman’s Pygmy Seahorse |
Squatina punctata | NA |
Of the 3560 data deficient species with complete feature information, my model returned 31 species that are likely threatened.
Conclusions
Machine learning algorithms can be a helpful tool for classifying data that may otherwise be missing. Here, I predicted which data deficient IUCN Red List saltwater fish species might be threatened based on a set of their ecological and morphological characteristics. However, it is important to note there are limitations to my approach here. Because some of the algorithms I used are sensitive to missing data values in the features, I only worked with species that had complete data. Elusive fish may not be represented in FishBase and therefore might not be represented in my model. I also cannot apply the model to fish that are lacking data on both the Red List and Fishbase. Also, these data were very imbalanced. A continuation of this analysis might play with techniques that help address imbalanced data.
References
1 “IUCN,” IUCN Red List of Threatened Species. Version 2022-1, 2022. https://www.iucnredlist.org/ (accessed Dec. 02, 2022).
2 B. J. Cardinale et al., “Biodiversity loss and its impact on humanity,” Nature, vol. 486, no. 7401, Art. no. 7401, Jun. 2012, doi: 10.1038/nature11148.
3 “IUCN,” IUCN Red List of Threatened Species. Version 2022-1, 2015. www.iucnredlist.org
4 M. J. Munstermann et al., “A global ecological signal of extinction risk in terrestrial vertebrates,” Conserv. Biol., vol. 36, no. 3, p. e13852, 2022, doi: 10.1111/cobi.13852.
5 R. Froese and D. Pauly, “FishBase,” 2022. www.fishbase.org
6 C. Boettiger, D. Temple Lang, and P. Wainwright, “rfishbase: exploring, manipulating and visualizing FishBase data from R.,” J. Fish Biol., 2012, doi: https://doi.org/10.1111/j.1095-8649.2012.03464.x.
Citation
@online{windschitl2023,
author = {Windschitl, Elke},
title = {ML {Model} {Comparison} for {Marine} {Fish} {Extinction}
{Risk}},
date = {2023-10-17},
url = {https://elkewind.github.io/posts/2023-10-17-ml-model-comparison},
langid = {en}
}