Introduction

Anacardium occidentale L. (cashew, caju, cajuí) is a species adapted to nutrient-poor soils and habitats that experience high temperatures and water stress (Serrano and Pessoa 2016). It belongs to the plant family Anacardiaceae and was last revised taxonomically by Mitchell and Mori (1987). Because of the high nutritive value of its seeds it was probably distributed widely in neotropical regions by human communities prior to 1500 (Mitchell and Mori 1987). Today this species has global economic importance and has spread by human agency to all regions of the tropics and subtropics, both cultivated in plantations and as naturalized self-reproducing populations, for example in India (Archak et al. 2009). It is characterized by the fruit (′′double fruit′′ of Mitchell and Mori 1987) in which the peduncle develops into a yellow to red, fleshy, sweet and adstringent hypocarp (cashew apple, maçã) supporting a hard-walled drupe (nut, castanha).

In Brazil, the agronomic production of cashew is concentrated in the Northeast region (Brainer 2021), where it began in the 1950s (Serrano and Pessoa 2016). In this region, in addition to the importance of the drupe as food, the hypocarp is used in juices, sweet conserves, cordials (cajuína), wine (Serrano and Pessoa 2016) and beer (Pereira et al. 2020). Another cashew product is cashew nut shell liquid (CNSL), used in industry and in research as a repellent and insecticide and for many other applications (Brainer and Vidal 2018; Silveira et al. 2019).

For centuries A. occidentale has been widely cultivated, selected and manipulated to obtain desirable characteristics, and random crossing has taken place between cultivated and wild populations, with naturalization of their descendants. These processes have resulted in great difficulty in determining the natural area of distribution of the species and for determining the natural patterns of infraspecific variation (Mitchell and Mori 1987; Andrade et al. 2019). According to Mitchell and Mori (1987), natural populations of A. occidentale occur in the cerrado, in littoral regions in restinga vegetation and in transitional areas with caatinga. These wild plants are commonly differentiated from domestic forms by their smaller fruits weighing no more than 3 g (Barros et al. 2002), and they are called cajuí instead of caju (Andrade et al. 2019). The wild fruits are collected and used by local populations both in the littoral region and in the interior (Farias et al. 2020).

The need to study the morphological and genetic variation of wild populations of A. occidentale (cajuizeiro) has become more urgent with growing human pressures within the ecotonal region of Piauí and Maranhão, not only because of the extractive harvesting of the fruits, but also because of the destructive collection of wood and even the clearing of individuals to enable other kinds of land use (Rufino et al. 2007). Uncontrolled extractive activities can alter the morphological characteristics and impact on reproductive and vegetative structures of plants (Vieira et al. 2016). The resulting smaller population sizes also reduce genetic variability and capacity for adaptation to future environmental changes, impacting the natural bank of variability of a species of great economic importance. There are to date relatively few molecular genetic studies of wild A. occidentale populations (Borges 2015; Borges et al. 2018; Santos et al. 2019; Gomes et al. 2021), and the possible erosion of its genetic variability by extractivism remains little studied. In the related cerrado species A. humile A.St.-Hil. there are a similar number of genetic studies of wild populations (Carvalho et al. 2012; Cota et al. 2017; Santos et al. 2020).

Brazil is considered to be the most important centre of natural diversity of the species (Mitchell and Mori 1987; Barros and Crisóstomo 1995; Silva-Luz et al. 2023), and the morphological variation observed there has led to the description of several distinct species (e.g. A. microcarpum Ducke and A. othonianum Rizzini) in cerrado, caatinga and restinga vegetation (see Andrade et al. 2019 for a more detailed discussion). However, in the opinion of taxonomic specialists of the genus (Mitchell and Mori 1987; Silva-Luz et al. 2023) and many ecologists who study cerrado vegetation (Castro et al. 2009; Castro 2013; Ratter et al. 2017), these names are all synonymous with A. occidentale and these authors recognize only a single variable species.

In Piauí and in eastern Maranhão there is a diversity of vegetation types including cerrado, caatinga, restinga and various transitional types (Castro 2013), in which extensive natural populations of A. occidentale trees occur, locally known as cajuizeiros. These two Brazilian states thus offer a suitable context for a study of infraspecific variation. Various research studies have been carried out in this region primarily aimed at differentiating natural restinga populations (restinga ecotype) from domesticated ones using morphometrics (Andrade et al. 2019), genetic variation (Santos et al. 2019) and bioactivity (Ribeiro et al. 2021). Souza et al. (2021) studied the ecology of wild populations in the restingas of Piauí in relation to dune migration.

Morphometric methods have previously been used in studies of A. occidentale for recognition and classification of accessions in germplasm collections representing different genotypes (see references in Andrade et al. 2019). Morphometric techniques have been shown to be very useful generally for delimitation of species (Marhold 2011), providing support for conservation inventories, recuperation of plant communities, species cataloguing and field identification (Barreto and Ferreira 2011).

Commenting on the lack of systematic studies of the infraspecific variation of wild populations of A. occidentale, Mitchell and Mori (1987) proposed an informal subdivision of the species into two kinds of wild cashews: the restinga ecotype occurring in coastal areas in eastern and northeastern Brazil, and the cerrado ecotype occurring in the savannas of central and Amazonian Brazil, Colombia, Venezuela and the Guianas as well as in caatingas transitional to cerrado in northeastern Brazil (Andrade et al. 2019). For Mitchell and Mori (1987), the main differences between the two ecotypes are morphological characters of the leaves, the size and acidity of the fruits, and the architecture of the canopy (Mitchell and Mori 1987 p. 40). Hitherto, however, this hypothesis has not been tested by more detailed analytical methods. Our objective in the present study was thus to verify whether the two ecotypes could be recognized by morphometric study of wild populations and determine their most diagnostic characters.

Material and methods

Study design

This study compared 180 individuals from six wild populations, three from restinga areas and three from cerrado areas. Two partition models of the individuals were evaluated in the analyses: (1) the six populations, justifiable as natural groups based on population genetic theoretical grounds (Picó et al. 2021: 4); (2) the two ecotypes, modelling the hypothesis of Mitchell and Mori (1987), were represented by three populations each, the populations being assigned to ecotypes based on their geographical location (littoral zone/interior of continent) and vegetation type (restinga/cerrado-caatinga transition).

The analyses sought to quantify how well the individuals would partition into the two ecotypes and six populations using cross-validation procedures applied to classification algorithms and also establish the relative similarities of the six populations using cluster analysis of population mean vectors. Diagnostic morphological characters of the two ecotypes were determined using univariate methods.

Field collection authorization

This research is registered in the National System of Genetic Patrimony and Associated Traditional Knowledge (SISGEN) with the registration code A8B44BB. The collecting license was authorized by the System of Authorization and Information on Biodiversity (SISBIO) of the Chico Mendes Institute for the Conservation of Biodiversity—ICMBIO (nº 64340–1).

Geography and ecology of sample sites

The association of the cerrado and restinga vegetation types with the division of A. occidentale into two ecotypes was clearly intended by Mitchell and Mori (1987) to be broadly interpreted; thus their Table IV (p. 40) uses the terms ′′Coastal Brazil′′ for the restinga ecotype and ′′Central Brazil′′ for the cerrado ecotype. Our interpretation of the ecotypes follows this looser terminological meaning. Both cerrado and restinga are very heterogeneous vegetationally and are understood by ecologists as variable domains rather than highly specific formations (e.g. Castro 2013; Santos-Filho 2009; Santos-Filho et al. 2010; Santos-Filho and Zickel 2013). Andrade et al. (2019) experimented with further subdivisions by categorizing one set of coastal populations as ′′dune′′ because of their occurrence on Quaternary dune systems and their highly characteristic dwarfed physiognomy, but there are other coastal populations located on Tertiary geology which are taller and yet exposed to maritime influences. We have termed all these as restinga ecotype populations. Similarly, in the case of the cerrado ecotype, we have grouped under this heading populations from typical cerrado and those from vegetation showing clear transitional features to caatinga.

The distance between study sites varied from a minimum of 13 km to a maximum of 250 km. The three restinga sites are located in the municipalities of Araioses (Maranhão state) and Parnaíba and Cajueiro da Praia (both in Piauí state). The cerrado study areas are located in the municipalities of Campo Maior, José de Freitas and in the Sete Cidades National Park (Fig. 1, Table 1), all in Piauí state. The climate over the study region is Aw type in the Köppen-Geiger classification (Kottek et al. 2006), characterised by a hot rainy summer season, a dry winter season, highest rainfall between January and June and mean annual rainfall varying between 1000 and 1600 mm (Lima et al. 2017).

Fig. 1
figure 1

Sampling localities for wild Anacardium occidentale (cajuí): restinga and cerrado ecotypes

Table 1 Collecting localities, sample sizes and voucher specimens for each study population

All the restinga sites lie within the Environmental Protection Area of the Rio Parnaíba Delta, which extends over 311,731 hectares with a perimeter of 1,473 km and includes coastal municipalities of the states of Maranhão, Piauí and Ceará, with altitudes from 0 to 45 m (Ivanov 2020). The geomorphology of the Canárias and Labino areas consists of Quaternary formations of quartzitic sands (Jacomine et al. 1986), whereas that of the region further east, where Cajueiro da Praia is situated, is dominated by the Tertiary formation of Barreiras sandstones (MME 2006). Besides restinga, the vegetation of the Delta includes caatinga, cerrado, mangrove and tabuleiro plant communities (Fernandes et al. 1996; Castro 2007) and vegetation transitional between cerrado and caatinga which is typical of a large part of the northern region of Piauí (Santos-Filho et al. 2010).

The cerrado localities lie in the continental interior and cover a wider area, but all within the northern part of Piauí state (Fig. 1). The Sete Cidades National Park has an area of 6,221 ha and a perimeter of 36 km and is a Federal Conservation Unit of Total Protection, located between the municipalities of Piracuruca and Brasileira (04°02′-08′ S, 41°40′-45′ W) (ICMBIO 2018), between altitudes of 100 and 290 m. The vegetation is a mosaic of various types of cerrado, and the sample site lies within cerrado in the strict sense, which occupies the greatest area within the Park (Oliveira et al. 2010). The Campo Maior site lies within the Campo Maior microregion, which has an area of 1,657 km2 with vegetation transitional between cerrado and caatinga and including gallery forests with carnaúba (Copernicia prunfera (Mill.) H.E.Moore), periodically flooded caatinga, hyperxerophilous caatinga, sub-deciduous cerrado, sub-deciduous forest and carrasco (Farias and Castro 2004). The sampled area was in a caatinga-cerrado transitional vegetation with greater similarity to typical cerrado (Sousa et al. 2021). The José de Freitas site lies within the eponymous municipality in the Teresina microregion, which extends over an area of 1,538 km2 (IBGE 2018; Aguiar and Gomes 2004) with vegetation similar to that in the Campo Maior region; the sample site was similarly selected in an area of more typical cerrado.

Field sampling

Field collections were carried out mainly in the dry season (July to December) of the years 2018, 2019, 2020 and 2021 when this species flowers and fruits. Each population was visited twice for field observation and collection of flowering and mature fruiting material. The same marked trees were sampled on successive visits.

For the morphometric analyses, thirty individuals from each of the six populations were sampled by traversing each study area and selecting trees more than 10 m apart for collection of sample material. From each individual, five branches that included leaves, flowers and fruits were selected, distributed as far as possible in different parts of the canopy, making a total of 150 branches from each population. The branches were sealed in plastic bags at the collecting sites and then transferred to a refrigerator at the HDELTA herbarium in the Federal University of the Parnaíba Delta, Parnaíba, Piauí (UFDPar). Digital images of bark, leaves, flowers and fruits (Figs. 2, 3) were made in the field.

Fig. 2
figure 2

Vegetative and reproductive organs used in the morphometric study of Anacardium occidentale (cajuí). Cerrado ecotype in Serra de Passa-Tempo, Campo Maior, Piauí. a General view of habit. b Colour of trunk. c Fruits showing yellow hypocarps. d Paniculate inflorescence. e Preparation of herbarium specimen voucher. f Measurement of hypocarp and drupe

Fig. 3
figure 3

Vegetative and reproductive organs used in the morphometric study of Anacardium occidentale (cajuí). Restinga ecotype in Labino, Piauí. a General view of habit. b Paniculate inflorescences and leaves. c Trunk. d Colour of trunk and bark slash

Voucher herbarium specimens representing each population with leaves, flowers and fruits were collected separately in each sampling area. The specimens were prepared according to the methodology of Mori et al. (1989) and incorporated into the HDELTA herbarium at the Ministro Reis Velloso Campus of the Federal University of the Parnaíba Delta (UFDPar) in Parnaíba, Piauí, Brazil (Table 1).

Morphometric characters and data collection

Measurements of plant height, crown width, trunk length and perimeter, and recording of internal bark colour were all made in the field. Plant height was measured with a measuring rod subdivided into metre units (Fig. 3a). A metric measuring tape was used to measure trunk perimeter. Measurement and recording of all other variables and additional photography (Online Resource 1a) were carried out in the HDELTA herbarium, where a millimetre ruler was used for dimensions of the leaf, flower, inflorescence, hypocarp and drupe and a pachymeter was used for measuring leaf thickness. The leaves and inflorescences were pressed and dried and then measured from dried material. The flowers were heated in distilled water and measured once rehydrated. The fruit (drupe and hypocarp) was measured using fresh material or material kept in a refrigerator. Each of the five replicate branches of each sampled individual was examined in turn. One mature leaf at the third or fourth node below the inflorescence was selected for measurement. Secondary vein number on the left and right sides of the leaf blade was recorded separately and then combined into the single variable of total number of secondary veins per leaf. Five masculine and five hermaphrodite flowers and five mature compound fruits (drupe and hypocarp) were measured from each individual. The terminology of vegetative and reproductive characters was based on that of Mitchell and Mori (1987). A total of 42 quantitative and five qualitative characters were recorded (Table 2, Online Resource 1g). The qualitative characters were selected based on information from local informants who collect fruits from wild populations, and we added them as they were said to be indicative of different variants. They are internal bark colour (red/brown red), leaf blade overall shape (elliptic/obovate), apex shape (acute/rounded/emarginate), base shape (acute-convex, acute-cuneate, obtuse) and hypocarp colour (yellow/red).

Table 2 Morphometric analysis of wild Anacardium occidentale

Data analysis

The univariate analyses, based on all recorded 42 quantitative variables (Online Resource 1g), showed that most variables had non-normal distributions and heterogeneous variances between populations (Online Resource 1b, Online Resource 2). Because bract, style and pistil length showed no significant difference between either populations or ecotypes, they were not used in the data set for multivariate analyses (Online Resource 1g). On the other hand, variables that showed no significant difference between ecotypes but did show significant differences between populations were included in the multivariate analyses (Table 2, Online Resource 2).

The data sets used for multivariate analyses differed according to the method used, each of which has its own limitations. For principal component analysis, linear discriminant analysis of ecotypes and k-nearest neighbour analysis the matrix of 39 quantitative variables was used (Table 2). For linear discriminant analysis of populations a reduced matrix of 29 quantitative variables was used (Table 2 indicates the ten characters not used), in order that the number of variables was less than the sample size of the populations (N = 30 in all cases). For the classification and regression tree analysis the five qualitative variables were added to the 39 quantitative ones (Table 2). In the data matrices used for multivariate analysis, the mean value of the five replicate measurements was used for quantitative variables, and the modal state for qualitative ones since the character states were non-numerical.

The data analyses were based on those described by Andrade et al. (2019) and implemented using the statistical language R (R Core Team 2018). Univariate analyses were carried out at three levels: entire data set, the two ecotypes and the six populations. These determined extreme values, distribution, mean, median, maximum and minimum values, standard deviation (SD), standard error of the mean (SE) and coefficient of variation (CV). Tests were computed at each level for normality, variance homogeneity and difference of means and medians (t-test, Wilcoxon rank sum, multiple comparisons). The results of these analyses were used to determine the most important diagnostic characters of the predetermined ecotype groups (Online Resource 2).

The major dimensions of variation of the overall data set were determined using principal component analysis (PCA) of the 39 quantitative variables (Table 2), and this showed which variables were most influential on each PC (principal component). The significance of the principal components was computed using the function evplot (Borcard et al. 2011: 121–124) which applies the Kaiser-Guttman test and the broken-stick model to provide two views of the number of PCs to be interpreted. The estimation of the minimum number of informative principal components in a PCA is inherently empirical (for further discussion see Borcard et al. 2011; Legendre and Legendre 2012). The Kaiser-Guttman and broken-stick methods usually give different results, the former being generally more generous than the latter, and thus together they serve as a guide to the likely range of interpretable PCs. The PCA was carried out using the function prcomp in the stats package in R (R Core Team 2018) with scaled data (argument: scale = TRUE).

We used three different classification methods (LDA, KNN, CART) in exploratory mode to obtain multiple views of the similarity relationships of the ecotype and population categories. We evaluated the consistency of these a priori categories using nonparametric methods, that is, the degree to which the individuals of each category grouped together in each analysis. For LDA and KNN, the results were quantified using leave-one-out cross-validation (James et al. 2013). This is a procedure in which each individual in turn is removed from the matrix and treated as a test set (James et al. 2013); the classification method (LDA, KNN, CART) is then deployed to classify the reduced matrix and the test individual is assigned to the nearest predicted population or ecotype. For each such category, the percentage of erroneous assignments of its individuals (i.e. to populations or ecotypes different to those to which they belong) indicates its degree of morphological distinctness, and also reveals, in the destinations of the misclassified individuals, the other categories with which it overlaps, according to the classification method used. For CART, the consistency of the population and ecotype categories was evaluated by computing confusion matrices which showed the percentage erroneous assignment of individuals to their original categories.

Linear discriminant analysis is a very widely used technique based on covariance matrices of the data, which seeks the maximum separation of the centroids of each a priori group (populations, ecotypes) while minimizing within-population variance. It uses eigenanalysis to produce axes (the discriminant functions) on which the individuals can be plotted, analogous to PCA. Despite the fact that our data did not meet the LDA assumptions of multivariate normality and homogeneity of covariance matrices, we used it because one result of its model is to show which variables are most influential in separating the groups and thus a useful guide for other studies. Preliminary analyses showed that the populations were rather distinct using the nonparametric techniques (KNN, CART) and because the results of the three techniques were broadly similar, this justifies the use of LDA, while requiring a critical attitude to results which diverge noticeably from other techniques and could thus be artefactual. In the LDA of populations, the 39 quantitative variables were reduced to 29 (Table 2) by deleting the ten variables with the least significant univariate difference between ecotypes (Online Resource 1g); this was to ensure that the number of variables was less than the sample size of the smallest group (30 individuals), following Legendre and Legendre (2012). The homogeneity of the covariance matrices of the populations was tested using functions from the vegan package (Oksanen et al. 2018), and the LDAs were carried out using the lda function in the MASS package (Ripley et al. 2022). Scaled (as in PCA) data were used to compute the discriminant function axes and unscaled data to carry out the cross-validation analysis (following Legendre and Legendre 2012: 675–676). The function ELLI (Claude 2008: 85) was used to plot 95% confidence ellipses around each population in the ordination plots. For the LDA with ecotypes as categorical variable, the full quantitative data set of 39 variables was used. Leave-one-out cross-validation was carried out using the CV = TRUE argument in the lda function. The results of the cross-validation were visualized using a bar chart in which each individual is represented by a vertical bar and the posterior probability of belonging to different populations is indicated by the number and length of the coloured segments of each bar.

The k-nearest neighbour (KNN) analysis used all 39 quantitative variables (Table 2) to classify each individual into populations and ecotypes. The method computes the Euclidean distance between each individual and all others and selects its k-nearest neighbours. The individual is then allocated to the population which has the majority representation among these k neighbours. The data set was scaled, following Varmuzer and Filzmoser (2009). The optimal value of k (the neighbourhood) was first established by testing a range of values (k = 1 to 40 or 60) and determining which k number produced the smallest number of allocation errors. This test was carried out separately for the two categorical variables used—populations and ecotypes. Once the optimal k number was established, the result of the KNN analysis was tested by cross-validation. The KNN analyses were carried out using the function knn.cv in the class package (Ripley and Venables 2022).

The classification and regression tree analysis (CART) data set included five qualitative variables together with the 39 quantitative variables used in the PCA, LDA and KNN analyses (Table 2). The algorithm for classification trees was used; this divides the whole set of individuals (the root node) by each variable and selects that which produces the least impure (heterogeneous) pair of subgroups (nodes); node impurity was measured using the Gini index (Varmuza and Filzmoser 2009; Foulkes 2009). The two subgroups are then each divided in the same way into two further subgroups after a new search for the variable which results in the least impure subgroups and this procedure continues until some stopping point is reached, e.g. terminal groups (leaves) containing an arbitrary minimum number of individuals. Each terminal group is named by the population which is represented by the greatest number of individuals. The final number of nodes in the tree is determined by an optimization procedure called cost complexity pruning (Foulkes 2009), which finds an optimal number of nodes by reconciling maximum node purity (minimum level of heterogeneity) with the minimum number of terminal nodes (maximum tree simplicity). The CART analyses were carried out using the rpart (Therneau and Atkinson 2019) and rpart.plot (Milborrow 2020) packages.

Population variability was estimated as the average value computed from the coefficients of variation of each of the 39 quantitative variables in each population. The significance of the morphological differences between populations and ecotypes was computed using the adonis function in package vegan (Oksanen et al. 2018) which carries out a nonparametric multivariate analysis of variance (NPMANOVA). The multiple comparisons of the populations based on the NPMANOVA were computed using the pairwise.perm.manova function of the RVAideMemoire package (Hervé 2018).

Hierarchical cluster analysis (UPGMA: unweighted pair group method with arithmetic mean) was carried out on a matrix of population mean vectors computed from the data set of 39 quantitative variables. This was to provide a separate view of the similarities of the six populations and to verify if they formed two clusters corresponding to the ecotypes. A Euclidean distance matrix was computed using the function dist, and the tree was computed with the function hclust using the method argument ′′average′′. Both functions belong to the R base package stats.

The classification of the populations into the two ecotypes was also tested by running cross-validated KNN analyses on all ten distinct partitions of the six populations into two groups of three and comparing the error rates of the partitions. A range of other two-group partitions were also tested. R scripts were written for these analyses (see Online Resource 3 for all R scripts used in this study).

Results

Univariate diagnostics of the ecotypes

Tests showed significant differences between ecotypes in 31 of the quantitative variables (Online Resource 1g). Twenty-two of the quantitative variables were found to differ between the ecotypes with high significance (Table 3). Of these, the eight vegetative variables showed higher values in the cerrado ecotype and the 14 flower and fruit variables had higher values in the restinga ecotype. To mention only the differences of greatest statistical significance, the cerrado populations had longer trunks, longer leaf blades, wider petioles and higher ratio of leaf blade length to petiole length; the restinga populations had larger hypocarps and drupes and more stamens per flower (in both flower types).

Table 3 Univariate comparison of the 22 most diagnostic quantitative variables of the cerrado and restinga ecotypes

Multidimensional variation

The major univariate differences were reflected in the principal component analysis. The most significant principal components ranged between the first five (broken-stick model) and the first ten (Kaiser-Guttman test). The first ten principal components expressed only 71.37% of the total variance (the first five expressed 53.81%), indicating a relatively low level of correlation among the 39 quantitative variables. The ordination of the first two principal components (Online Resource 1c), representing 32.42% of total variance, suggested that within-population variability is lower in restinga populations (the 95% confidence ellipses are smaller) than in the cerrado. The biplot (Fig. 4) and the loadings (Online Resource 1d) showed that vegetative and floral characters are equally important in the overall variation of the data set.

Fig. 4
figure 4

Biplot superimposed on first two principal components (32.42% of total variance) of PCA using 39 quantitative vegetative and reproductive characters (red arrows) from 180 individuals (grey numbers) in six wild populations of Anacardium occidentale. Cerrado ecotype: Campo Maior, José de Freitas, Parque Nacional de Sete Cidades. Restinga ecotype: Barrinha, Ilha das Canárias, Labino. Computed with R (R Core Team 2018)

On PC1, individuals with the greatest positive scores (mostly restinga plants) tend to have longer and wider hypocarps and drupes and in general floral variables with higher values. Individuals with negative scores (predominantly from the cerrado) tend to have longer leaf blades, wider petioles, taller trunks and leaf blades which are relatively longer than wide, among other characters (Online Resource 1d). On the PC2 axis the characters which have positive scores (mostly restinga populations and Sete Cidades) tend to have more stamens in the masculine and hermaphrodite flowers, longer petioles, wider inflorescences and tree crowns, wider hypocarps and longer inflorescences, while those with negative scores have larger leaf blade length/petiole length ratios, longer leaf blades, greater corolla diameter in hermaphrodite flowers, higher total number of leaf secondary veins and wider petioles.

Classification of the populations and ecotypes

Populations

The distinctness of the populations was evaluated by linear discriminant analysis (LDA), and their separation on the first two discriminant function axes is shown in Fig. 5.

Fig. 5
figure 5

Linear discriminant analysis (LDA) of 29 quantitative vegetative and reproductive characters from 180 individuals in six wild populations of Anacardium occidentale. Ordination on the first two discriminant function axes (80.0% of total variance). Cerrado populations: Campo Maior, José de Freitas, Parque Nacional de Sete Cidades. Restinga populations: Barrinha, Ilha das Canárias, Labino. Computed with R (R Core Team 2018)

In this two-dimensional space, presenting 80.0% of total variance, the three cerrado populations overlap and together are partially separated from the restinga group, of which Barrinha is the most distinct. The greatest overlap between ecotypes is between the Labino (restinga) and José de Freitas (cerrado) populations. The results of the cross-validation test are shown in Table 4 and Fig. 6 and show a total of 22 misclassifications (12.2% of total) among the populations.

Table 4 Cross-validation test of the LDA using populations as categorical variable and data set of 29 variables
Fig. 6
figure 6

Cross-validation test of linear discriminant analysis (LDA) of 29 quantitative vegetative and reproductive characters from 180 individuals in six wild populations of Anacardium occidentale. Each bar represents an individual; in bars with more than one coloured segment, segment length corresponds to the probability that the individual belongs to the population of that colour. Cerrado ecotype: Campo Maior, Sete Cidades and José de Freitas. Restinga populations: Barrinha, Ilha das Canárias and Labino. Computed with R (R Core Team 2018)

The proportion of correctly classified individuals (bold values in Table 4) is lowest in Campo Maior and Labino, which are the populations showing greatest interpopulation overlap in Fig. 5. However, while Campo Maior (cerrado) has a single individual allocated to a restinga population, Labino (restinga) has misclassifications to all other populations except Sete Cidades, also reflected in the relative positions of the populations in the ordination of Fig. 5.

The most distinct populations are Sete Cidades and Barrinha, in which the mis-allocated individuals are to other populations of the same ecotype (Fig. 6). Table 4 and Fig. 6 also show that the overlap of José de Freitas with Campo Maior is lower than with Labino, contradicting the impression given in Fig, 5. This probably results from distortion introduced by the two-dimensional ordination, whereas the cross-validation uses all five discriminant functions of the LDA. Figure 6 expresses the similarities of individuals in more detail than Table 4 since individual bars with two or more colours reveal the range of populations to which they are most similar.

The KNN cross-validation analysis (Online Resource 1h, e) generated 28 errors overall (15.6%), among the populations. The KNN and LDA analyses were similar at population level regarding the higher consistency of the Sete Cidades and Barrinha populations, but the KNN result differed in the lower consistency of José de Freitas and the higher consistency of Labino. As in the LDA, Campo Maior showed relatively low consistency and Canárias had an intermediate level. In the LDA and KNN analyses, the Labino population as reconstituted by cross-validation showed the greatest mixture of individuals from other populations (Table 4, Fig. 6, Online Resource 1h and 1e).

The CART analysis (Fig. 7, Online Resource 1i) resulted in 26 misclassifications (14.4% of total) among the populations. The first node, i.e. the whole set of individuals, is divided by the colour of the internal bark, a qualitative character. The subgroups are then divided by quantitative fruit and floral variables (hypocarp width, length of long stamen in male flower, length of staminal tube in male flower, corolla diameter of male flower). Table SM3 shows a high level of correct classification except in Campo Maior (cerrado), where only ten were correct and 20 were assigned to other nodes, including Canárias (restinga). According to this analysis, the Campo Maior population is the least distinct in its morphology. In the CART analysis of the two ecotypes the single division was made on the quantitative variable hypocarp width (Online Resource 3).

Fig. 7
figure 7

Optimized classification tree (CART) based on 39 quantitative and five qualitative vegetative and reproductive characters from 180 individuals in six wild populations of Anacardium occidentale. Gini index used as impurity measure. Computed with R (R Core Team 2018). Numerical values at nodes are in mm; terminal groups named for population with most included individuals; paired numbers represent number of correct classifications from named population/total individuals classified in group; percentage represents proportion of total study sample (180)

Overall, the LDA cross-validation generated the least number of errors, followed by CART and then KNN. The CART analysis agreed with the others in showing high consistency in Sete Cidades and Barrinha, but differed from them in a much lower consistency for Campo Maior and showing reconstituted Canárias as the most heterogeneous terminal group.

The cerrado populations were individually more variable than the restinga ones (Online Resource 1j) when the variable trunk height was omitted from estimation of overall population coefficient of variation, Sete Cidades being the most variable and Barrinha the least. The multiple comparisons computed from nonparametric MANOVA (NPMANOVA, Online Resource 1k) showed that the cerrado populations were not significantly different from one another morphologically, but were in all cases significantly different from the restinga populations. On the other hand, the restinga populations differed more from one another, and only the pair Labino and Canárias was not significantly different. NPMANOVA also showed a significant morphological difference between the ecotypes (Online Resource 1m, P < 0.001).

The UPGMA analysis of population mean vectors showed a clear association, respectively, of the cerrado and restinga populations into separate clusters and within them showed José de Freitas and Barrinha as distinct from the other two (Fig. 8).

Fig. 8
figure 8

Cluster analysis (UPGMA) of the six population mean vectors based on 39 quantitative vegetative and reproductive characters from 180 individuals in six wild populations of Anacardium occidentale. Computed with R (R Core Team 2018)

Ecotypes

The LDA differentiation of the two ecotypes is shown in Fig. 9 as a frequency distribution of the individuals along a single discriminant function axis. Almost all negative scores belong to individuals from the cerrado, and the great majority of the positive scores are from restinga plants; there is almost no overlap between the two ecotypes.

Fig. 9
figure 9

Linear discriminant analysis of 39 quantitative vegetative and reproductive characters of 180 individuals from six wild populations of Anacardium occidentale. Frequency distribution of the scores on the single discriminant function axis using ecotype as the categorical variable. The axis in two diagrams represents the same discriminant function. Individuals from six wild populations of A. occidentale. Cerrado ecotype: Campo Maior, José de Freitas and Parque Nacional de Sete Cidades. Restinga ecotype: Barrinha, Ilha das Canárias and Labino. Computed with R (R Core Team 2018)

Cross-validation (Table 5) showed that 95.6% of cerrado individuals and 98.9% of restinga individuals were correctly classified, resulting in 2.8% misclassifications.

Table 5 Results of cross-validation of LDA, KNN and CART analyses of the two ecotypes of six wild populations of Anacardium occidentale

The KNN analysis correctly classified all restinga individuals, but there were more cerrado misclassifications, resulting in an overall error of 5% (Table 5). When the KNN cross-validation was carried out assigning the individuals of each population to ecotype (Online Resource 1n, f), all individuals of the restinga populations were correctly assigned, and most cerrado misclassifications to restinga were shown to be from José de Freitas. The CART analysis showed lower differentiation with 11.7% misclassifications (Table 5); there were more misclassifications from the cerrado populations than from the restinga. Overall the three classification approaches agreed in assigning greater consistency to the restinga group. Cross-validated KNN analyses of the ten distinct partitions of the six populations into two groups of three showed that the ecotype partition had a lower percentage misclassification rate than any other such combination (Online Resource 1p). The two-group combination consisting of Barrinhas versus all the other five had an even lower error rate, which reinforces the distinctness previously mentioned for this population. The combination uniting José de Freitas with the restinga populations also had a low error rate, which reflects the LDA and CART cross-validations (Table 4, Online Resource 1i) in which the misclassifications of this population were assigned to restinga populations rather than cerrado ones.

Discussion

The ecotypes

All the classification methods we used are based on the assumption that individuals of each population will resemble each other more than those of other populations. The cross-validation tests demonstrate that this supposition is frequently violated. The variability within each population usually overlaps with at least some of the others (Online Resource 1b). Nevertheless, as stated earlier, a key premise of this study is that the individuals of each population are genetically closer to each other than to any other population, which is to say there is an a priori reason for treating each population as a biologically meaningful group. When the populations are arranged into the sets corresponding to the ecotypes (defined on geography and ecology), they show a stronger morphological difference than in any other arrangement of two sets of three populations each (Online Resource 1n, p, f).

This study is the first to test the distinction between the two ecotypes of A. occidentale and differs from previous morphometric studies in using a larger range of morphological variables, sampled exclusively from wild populations. Our results generally corroborate the recognition of the cerrado and restinga ecotypes proposed by Mitchell and Mori (1987). Bearing in mind its limited geographical scope compared to the full range of the species, our study indicated that the restinga ecotype is morphologically somewhat more distinct than the cerrado ecotype and has consistently larger floral parts, drupes and hypocarps (see also CART analysis of ecotypes: Online Resource 3) and more numerous stamens. On the other hand, cerrado plants are taller and larger in most vegetative variables observed, except crown width, trunk circumference and petiole length (Table 3, Online Resource 1g).

We confirmed that four of the five differentiating variables tabled by Mitchell and Mori (1987: 40) were among the strongest diagnostic characters of the ecotypes: cerrado populations had taller trunks, longer and wider leaf blades with higher length/width ratios and the blade length was relatively greater than the petiole length; secondary vein number was also greater and we confirmed that the petiole was thicker in the cerrado plants (Table 3). We observed that although restinga plants had significantly longer petioles, this difference was weaker than the other characters mentioned here, and there was no significant difference in leaf thickness between ecotypes (Online Resource 1g).

Variability of characters

There were substantial differences in variation measures (standard deviation and coefficient of variation) between our results (Online Resource 1q) and those of Mitchell and Mori (1987: Table IV), probably because their data were most likely obtained from measuring numerous herbarium specimens, each one probably originating from a different population, and gathered over a continent-sized area.

Trunk height varied considerably in restinga populations from zero, that is, the lowest major branches emerging at ground level, to 2 m long. Since these plants are often partly buried because of the action of dunes (Souza et al. 2021), these measurements may not reflect the true length of the trunks and so our estimates of mean population coefficient of variation are better viewed with this variable excluded (Online Resource 1j). In this case it appears that the cerrado populations are consistently more variable overall than the restinga ones. In addition to this, the cerrado populations, despite their greater geographical spread (Fig. 1), were not significantly different from one another, whereas the geographically more concentrated restinga populations were more distinct from one another as well as less variable individually (Online Resource 1k). This result suggests the possibility that, pending the availability of molecular genetic data, phenotypic plasticity may play a greater role in the differentiation of restinga populations, perhaps linked to fine-grained habitat heterogeneity and dynamic change in dune-related habitats (Lázaro-Nagal et al. 2015; Mata et al. 2022; Schneider 2022; Souza et al. 2021). Unlike the results of Andrade et al. (2019), the Labino population did not emerge as the morphologically least variable population in our study, which contrasts with the relatively low genetic diversity for Labino plants reported by Santos et al. (2019); instead the least variable population was Barrinha (Online Resource 1j).

Certain quantitative variables of the inflorescence (length, width, bract length) and flower (style and pistil length in hermaphrodite flowers) showed no significant differences between ecotypes or populations. Variability in inflorescence length and width was quite high (coefficient of variation 28% and 35%, respectively), and since these traits are important for plant fitness (Pereira et al. 2023), selection may play a role in preventing their divergence in different populations. Qualitative characters were used only in the CART analysis; of these five characters, internal bark colour played a role dividing the data but only in the population analysis. This finding suggests that analyses with qualitative characters, or which include a greater number, may also be useful in differentiating populations.

Heterogeneity in restinga ecotype

The distinctiveness of the Barrinha population and the NPMANOVA results (Online Resource 1j, k) suggests stronger differentiation among restinga than cerrado populations. In the restinga case, this may be because of geological and related ecological factors, since Barrinha occurs over Tertiary sandstones, whereas the other two restinga populations lie on Quaternary sands. This variation within the restinga ecotype category should be followed up in future studies and may be related to adaptations to particular habitat conditions. A strong dominance of A. occidentale has been reported in the restinga vegetation of Pará (Silva et al. 2010), Ceará (Johnson 1973) and Piauí (Santos-Filho et al. 2010, 2016), which may be partly attributable to its ability to resist dune burial, among other factors (Souza et al. 2021). Santos-Filho (2009) also noted that in restinga, A. occidentale has low tolerance to competition from other species. Future research may be able to confirm that the restinga ecotype has a degree of adaptation to this marine-influenced environment which is analogous to that of cerrado populations, which occur in a vegetation characterized by a much richer flora (Barros et al. 2003), a wide range of different soil types and are subjected to periodic natural fires, among other environmental factors (Ribeiro and Walter 2008; Mendes et al. 2012; Oliveira et al. 2012; Barros et al. 2020).

Molecular studies

No molecular genetic study has yet directly compared the two ecotypes of A. occidentale. The three previous published studies on wild populations were all carried out in Piauí state using ISSR markers. Borges et al. (2018) studied two populations in coastal Piauí representing the restinga ecotype (Ilha Grande, Parnaíba) and found high within-population variability and very little differentiation between them. Santos et al. (2019) compared wild and domesticated populations over a wider area of northern Piauí and reported higher within-population diversity in wild plants, significant between-population structure but only weak distinction between the wild and domesticated groups. Gomes et al. (2021) studied three wild cerrado populations in northern Piauí and found no significant genetic differentiation between them. Santos et al. (2019) concluded that morphological patterns reported by Andrade et al. (2019) for the same region were not a reliable guide to genetic diversity, implying that environmental factors may play a significant role in morphological differentiation. The results of these genetic investigations are consistent with knowledge of the reproductive biology of wild cashew, which is an outcrossing, bee-pollinated (Freitas and Paxton 1996, 199; Freitas et al. 2014, 2002; Holanda-Neto et al. 2002) and predominantly bat-dispersed species (Mitchell and Mori 1987) in which significant inter-population gene flow is to be expected. Cota et al. (2017), in a study of the related cerrado species A. humile, reported significant inbreeding between naturally clumped plants within a population, which may also be factor in explaining the genetic structure of populations, especially in strongly clumped restinga populations (Andrade et al. 2019; Santos et al. 2019).

Floral biology

Floral characters can be influenced by different kinds of pollinator, which may differ between populations. Studies of the floral biology of cashew have focused more on domesticated plants (Masawe et al. 1996), but there is also work on wild populations notably by Freitas and Paxton (1998) and Freitas et al. (2014). Anacardium occidentale is andromonoecious with a predominance of male flowers and is pollinated mainly by ants, bees and wasps. Within the natural range of A. occidentale in northeastern Brazil, the common honey bee Apis mellifera L. is the major pollinator in commercially cultivated orchards, whereas the solitary native bee Centris [Hemisiella] tarsata has been described as potentially the pollinator of wild cashew. Both wild and domesticated trees are thought to have very similar floral attributes (Freitas and Paxton 1998). Most fruits arise from cross-fertilization since there is a selective abortion mechanism that rejects self-pollinated flowers (Holanda-Neto et al. 2002). Linked to this, the proximity of forest fragments increases the number of native pollinator visits and these are essential for improving productivity (Freitas et al. 2014). Maintenance of nearby native forests is thus recommended for domesticated plantations in order to encourage cross-fertilization, and this stimulates gene flow between native and domesticated forms, producing hybrids between them. Our study suggests the possibility of a difference in the pollination ecology of cerrado plants, since we observed their floral variables had consistently and significantly lower values.

Dispersal biology

Dispersal vectors such as mammals are said to have fed on cashew hypocarps as early as the Eocene, more than 45 million years ago, and are still important biotic dispersal agents (Manchester et al. 2007). Mitchell and Mori (1987) state that frugivorous bats are dispersal agents of A. occidentale, carrying the drupes long distances. Humans have been an important agent of seed dispersal (Mitchell and Mori 1987), because of the economic and nutritional importance of the species, and have distributed the fruit over wide areas and long distances, bringing about new forms of gene exchange with local populations (Johnson 1973).

Domesticated and wild forms

There are many publications on the morphological variation of domesticated and wild populations of cashew, and various names have been applied to the plants studied; these have been reviewed up to 2017 by Andrade et al. (2019, including their Supplementary Table 1) from the viewpoint of systematics, and difficulties in taxon recognition and nomenclature are discussed there. Many of these studies were carried out in germplasm collections and their purpose was more to identify variation in characteristics for plant breeding than to explore the biosystematics of wild forms. Some, however, are either field-sampled from wild populations (Lucena 2006; Rufino et al. 2008; Gomes et al. 2011; Andrade et al. 2019; Carneiro et al. 2019; França et al. 2020) or based on multiple samples from wild populations cultivated in germplasm collections (Pessoni 2007; Maia et al. 2012; Gomes et al. 2013; Borges 2015; Belo et al. 2019; Matos Filho et al. 2019) and thus provide information on natural within-population variability. Most focus on drupe and hypocarp variables because of their importance for commercial products; the drupe is the main cashew export product around the world (Aliyu and Awopetu 2011; Rossetti et al. 2019) and drupe size and weight are the most important characters in selecting plants for genetic improvement of cultivated cashews (Rossetti et al. 2019).

The research programme on wild cashew based in northern Piauí of which the present study forms a part has been previously focused mainly on differences between wild restinga populations and domesticated populations (Vieira et al. 2014; Andrade et al. 2019; Santos et al. 2019; Souza et al. 2021; Ribeiro et al. 2021). The domesticated plants sampled were very heterogeneous in status, varying from plantations of agronomic clones to traditional orchards which probably consisted of mixtures of informally planted trees and naturalized individuals resulting from local introgression with nearby wild populations. Nevertheless, these studies showed some differences between the two groups. Vieira et al. (2014), in a geometric morphometric study of leaf blade outline shape (see also Andrade et al. 2010; Nascimento et al. 2021), showed a statistically significant difference between wild and domesticated populations in coastal Piauí. Andrade et al. (2019) compared wild and domesticated populations from coastal Piauí and domesticated populations at locations further inland and found that coastal individuals had shorter, relatively broader leaves, smaller drupes and hypocarps and fewer leaf secondary veins, making them similar vegetatively to the restinga ecotype studied here. However, since they were compared to only domesticated plants from non-coastal locations, it was to be expected that the fruits of the latter would be larger. The present study compared only wild populations and showed that wild cerrado plants have even smaller fruits than those of the restinga, and that there is a much clearer distinction between the restinga and cerrado ecotypes than between the former and domesticated plants.

Ecotypes and taxonomy

The ecotype concept is not one of the formal taxonomic categories of the Nomenclatural Code for plants (Turland et al. 2018), but has been used mainly in biosystematic studies to differentiate groups of populations that are interfertile, phylogenetically close and partially differentiated ecologically, physiologically or morphologically (Stuessy 2009). The concept of cryptic species might also be applicable to the present case (Monro and Mayo 2022), once there is sufficient molecular evidence available. For the present however, we conclude that the ecotype concept continues to be useful until a more detailed picture of intra-specific diversity has been attained across its extensive geographical range. From the taxonomic standpoint, this situation continues to be less than ideal, but the lack of a more general view of the intra-specific variation in A. occidentale hampers the development of a formal classification of those taxa considered to belong to this species, as currently delimited in the wide sense by taxonomists such as Mitchell and Mori (1987) and Silva-Luz et al. (2023). One example is provided by the contrasting views on recognizing the species Anacardium othonianum, accepted by various modern researchers (e.g. Belo et al. 2019), especially those working on the cerrado taxa of central Brazil, whereas for Mitchell and Mori (1987) this taxon represents the cerrado ecotype of A. occidentale. Other taxonomic problems worthy of note are the continuing confusion over the status of A. microcarpum Ducke (1922), an Amazonian species which has still not been sufficiently investigated in its type locality (Andrade et al. 2019; Pereira et al. 2021), and the fact that the type of A. occidentale itself is from Sri Lanka (Mitchell and Mori 1987). Since the type is not from a wild population, there remains the possibility of restricting this name to domesticated forms once a more comprehensive picture of the systematics of A. occidentale sensu lato is available, but this cannot be achieved without the perspective provided by molecular data. Morphometric techniques are a good way to classify phenotypes, and important for breeding programmes and genetic improvement, but detailed knowledge of the evolutionary lineages can only come from genetic studies, and a general understanding of the species requires both approaches. There is a clear need for more morphological, anatomical (e.g. Ramos et al. 2016), ecological and molecular phylogeographic research on wild forms of A. occidentale, not only in northeastern Brazil where its greatest natural diversity of forms occurs (Paiva et al. 2003), but also elsewhere throughout its range.

Conclusions

This study successfully differentiated the two ecotypes proposed by Mitchell and Mori (1987), although this result should be treated with caution since we studied only a small number of populations. The stronger distinction of the restinga populations from one another and their greater internal uniformity suggest that conservation strategies should maximize the number of protected populations in the coastal regions, where they are probably under the greatest human pressure. However, it seems likely that in both cerrado and restinga domains much greater population diversity can be expected when more wild populations have been investigated. Whether the two-ecotype hypothesis will survive more extensive study remains to be seen; patterns of diversity across the whole species range may be more mosaic-like than this hypothesis allows. Future studies should, if possible, target more remote and undisturbed populations, before they shrink or disappear or become genetically mixed with domesticated forms resulting from rapid expansion of human occupation.

The morphological divergence of the ecotypes may well reflect ecological and environmental factors as much as genetic ones, and it is here that longer-term studies of germplasm collections are of great importance for investigating whether the differentiating phenotypic traits we have discussed have a significant genetic basis, as well as their role in generating cultivars with phenotypic traits of potential ecological, economic and agronomic significance (Pessoni 2007; Borges et al. 2018; Andrade et al. 2019; Gomes et al. 2021). Vegetative characters such as height, crown architecture and leaf attributes are likely to be influenced by nutrient levels and by wind in the restinga, where salt levels are higher and the soils more oligotrophic. Fruit variables may be influenced by gene flow from domesticated cashews, especially in more populated areas, but also by animal and human dispersal of the compound fruits (drupe and hypocarp).

In addition, a much broader program of molecular genetic studies of wild cashew populations could at least provide indexes of within- and between-population genetic diversity across the species range and thus build more rapidly a general framework for conservation strategies of this very important species. In the face of growing human pressure on natural resources, a coordinated approach is needed which combines research on morphological, ecological and molecular markers in wild populations with cultivar development in order to ensure a wider genetic basis for cashew agronomics, given the future impact of climate and habitat changes. Characterizing the morphological diversity of wild cashew populations is thus one of the fundamental elements to understanding the natural variation of the species and its adaptive capacity in the light of future climate change given the continuing human need for its sustainable use. The present study is a small but significant contribution to this end.