Modeling visual perception of Chinese classical private gardens with image parsing and interpretable machine learning

Research workflow

This study proposes a Chinese classical private garden landscape perception modeling framework. We used objective visual feature indicators and subjective perceptual evaluations. The steps included data acquisition and image preprocessing, performing subjective and objective evaluations, and modeling and interpretation of results (Fig. 1).

(1)

Data acquisition and image preprocessing. Image acquisition locations in the garden path network were selected. The locations had an average spacing of 10 m. Image preprocessing was performed in Photoshop software using a macro function.

(2)

Subjective and objective evaluations were conducted by quantifying objective visual features and evaluating subjective perception.

Fig. 1: Framework of the study.

Research framework including data acquisition, feature extraction, perception evaluation, and model analysis.

The objective visual features included 35 indicators of landscape elements, depth of view, color, and texture. They were quantified using computer vision techniques. The semantic segmentation network (SegNet) model was used to quantify the landscape element features. The DINOv2 model and k-means clustering were used to quantify the depth of view features. Python (OpenCV and Pillow libraries) and k-means clustering were used to quantify the color features. The texture features were quantified using FD analysis using the Fiji (ImageJ) software.

The four dimensions of Kaplan’s preference matrix (coherence, complexity, legibility, and mystery) were used for the subjective perception assessment. We used a subjective evaluation scale based on the semantic differential (SD) method and invited respondents to rate the sample images on a 5-point scale using an online questionnaire. The average value of the four-dimensional perceptual indicators was used for each image.

(3) Modeling and interpretation of results. Descriptive statistics of objective visual features and subjective perception scores were obtained. The interpretable machine learning model HGBoost, SHAP, and partial dependence plots (PDPs) were utilized for variable importance assessment and nonlinear response analysis. The impact of visual features shaped by different spatial organization strategies on perceptions was analyzed.

Study area

Three Chinese classical private gardens in the Gusu District of Suzhou City, China, including the Garden of Cultivation, the Canglang Pavilion, and the Net Master’s Garden (Fig. 2), were selected as empirical case studies to examine visual characteristics associated with different spatial organization types.

Fig. 2: Study area maps.

(a) Location of Suzhou, Jiangsu Province, China. (b) Locations of three historic gardens in the Gusu District, Suzhou. Site plan of (c) the Garden of Cultivation, (d) the Canglang Pavilion, and (e) the Net Master’s Garden.

All three gardens are inscribed on the World Heritage List and represent Chinese classical private gardens with distinct gardening traditions and spatial compositions. The Garden of Cultivation (Ming Dynasty, 5450 m2) is a comprehensive landscape garden combining mountain and water features. The Canglang Pavilion (Northern Song Dynasty, 14,460 m2) is characterized by mountain-oriented spatial organization. The Net Master’s Garden (Southern Song Dynasty, 5970 m2) is dominated by waterscape features42,43.

The three gardens exhibit complementary characteristics in dominant landscape elements, spatial organization, and visitor movement rhythms, representing typical spatial types of Chinese classical private gardens. Their comparable site scales also help control potential scale-related effects in subsequent visual feature extraction and perceptual modeling. This study focused on the courtyards and some of the open buildings (e.g., pavilions, corridors, etc.).

Data acquisition and image preprocessing

Garden paths are core elements of historic gardens44. They connect landscape nodes and provide a rhythm for tourist movement and perceptual experience. We constructed a self-photographed image dataset based on the existing outdoor circulation routes of the gardens. The sampling paths covered courtyard spaces and selected open buildings (e.g., pavilions and corridors), forming a continuous and accessible path network to capture variations in visual characteristics along visitor routes.

The outdoor path networks of the three gardens were extracted. Sampling points were placed at regular intervals of ~10 m along the paths to ensure spatial coverage and reproducibility. The unified interval controlled scale effects and enhanced cross-garden comparability, supported by previous findings showing high consistency in path network metrics among small-scale classical private gardens44.

Visual variation in garden spaces follows path-based movement, requiring sampling intervals that balance perceptual sensitivity and redundancy. Perception-based studies have indicated that in the Net Master’s Garden, salient viewpoints associated with the experience of “changing scenery with each step” are concentrated around watersides and building entrances, while the average spacing between such viewpoints along other path segments is ~10 m45, suggesting that this scale captures major perceptual changes along the path sequence. In addition, visual perception studies of traditional garden spaces have adopted path sampling intervals of 7–8 m between adjacent viewpoints30, which is comparable to the sampling strategy used in this study and provides methodological support for the chosen interval.

Image acquisition occurred on November 17–18, 2023, from 13:00 to 16:00 to ensure optimum light conditions. The height of the camera lens was at eye level (160 cm). Four images were acquired at each sampling point in the tangent and perpendicular directions of the path centerline to obtain the spatial composition of the point (Fig. 3). A total of 300 valid photos comprised the experimental dataset. The images were uniformly resized using Photoshop macros for the subsequent image batch processing and feature extraction.

Fig. 3: Image sampling strategy.

Schematic illustration of four-directional image acquisition at each sampling point.

Quantification of landscape element features

Chinese classical private gardens consist of plants, rockery, water, buildings, and pathways. These elements provide a visual structure and spatial hierarchy of the garden space23. Plants characterize nature in gardens; they can be used as a focal point and background to enhance visual composition, and can also shape the enclosure and transparency of a space by forming a continuous interface33. Water and rockery are artistic imitations of the natural landscape. Architecture represents an effect of humans on the landscape. It is an essential means of enclosing space and organizing sightlines. A garden path connects these elements and dominates the visitors’ perspective and rhythm. Its scale and degree of openness affect people’s movement and viewing. Based on the existing literature and expert evaluation46,47, we selected seven indicators of landscape element features (Table 1). The proportion of each element in the image was determined. Based on the theory of environmental psychology, Shannon’s information entropy was used to measure the complexity of the landscape elements in the image18,48.

Table 1 Equations and descriptions of landscape element features

The SegNet model was used to quantify the proportion of different landscape elements in the images (see Supplementary A for more details). The method performs pixel-level analysis to determine the indicators of the landscape elements.

Quantification of depth of view features

Depth perception of space is critical for the visual experience of a garden49,50. Chinese classical private gardens are characterized by “seeing the big in the small” and by organizing spatial elements and spatial units. A spatial hierarchy and sense of extension are created within a limited scale to improve depth perception. The spatial unit’s structure and visual information significantly affect the depth perception of the landscape51,52. Although human depth perception is based on multiple factors, the depth in static images is critical for interpreting the spatial structure.

Due to the high cost of depth measurement methods (e.g., LiDAR or depth camera) and the difficulty of large-scale applications53, this study used a monocular depth estimation technique54. The DINOv2 model35, which has excellent generalization ability, was used to estimate the depth in the landscape images. A single-channel depth map (pixel value range of 0–255) was generated using the DINOv2 model. The K-means clustering algorithm was used to stratify the depth pixel values nonlinearly, and the image was divided into four regions: foreground, midground, background, and far background. The proportion of pixels in the four regions was extracted, and the average depth (AD) and its standard deviation were calculated to quantify the depth of view features (Table 2).

Table 2 Equations and descriptions of depth of view featuresQuantification of color features

We conducted a quantitative analysis using the hue, saturation, and brightness (HSB) color model, which is closer to the color perception of the human eye than the cyan, magenta, yellow, and key (CMYK), red, green, and blue (RGB), and other color models55. We divided the color characteristics into two categories: (1) Color attributes: the brightness, contrast, saturation, and number of colors (NOC) were extracted using OpenCV with the Pillow library (Table 3). (2) Color composition. Studies have shown that the ratio of warm and cool colors and the representative colors of an image affect visual perception38,56. Therefore, we quantified the ratio of warm and cool colors and the hue, saturation, brightness, and proportion of the dominant and accent colors. After the image was converted into the HSB color space, red to yellow colors (H = 0-60 & 150-179) were designated as warm colors based on the hue (H) value, and the green to blue colors (H = 60-150) were categorized as cool colors. The warm-to-cool color ratio (WCR) was calculated. K-means clustering was used to classify the color pixels. The optimal color perception occurs for 4–8 clusters57. We used a K value of 8 and extracted the two clusters with the highest percentage of pixels with dominant colors. The lowest two clusters were accent colors. We extracted their H, S, B, and area proportion to reflect the color dominance of the scene and the intensity of local visual stimuli (Table 3).

Table 3 Equations and descriptions of color featuresQuantification of texture features

Image texture represented by contour elements can affect landscape perception58. We quantified the texture features of Chinese classical private garden images using FD analysis. The FD reflects the textural variation and irregularity and is widely used as a measure of landscape complexity59, visual diversity60, and naturalness61. The box-counting method was used to analyze the self-similarity of images at different spatial scales using Fiji (Image J) software62,63. The FD of an image is defined as:

$${FD}=\mathop{\mathrm{lim}}\limits_{\varepsilon \to 0}\frac{\log N\left(\varepsilon \right)}{\log N\left(\frac{1}{\varepsilon }\right)}$$

(1)

where ${FD}$ is the box-counting fractal dimension; $N\left(\varepsilon \right)$ is the number of small graphs; $\frac{1}{\varepsilon }$ is the length of segments of the small graphs.

Subjective visual perception evaluation

A preference matrix was used to select the subjective perception evaluation indicators. This theory states that people have two basic needs in an environment: to understand and to explore. Depending on the speed of information processing, four information variables exist: coherence (immediate understanding), complexity (immediate exploration), legibility (inferred understanding), and mystery (inferred exploration), comprising the preference matrix18,64. According to the theory and related studies18,19,20,21, the definitions of the four perceptual dimensions were refined (Table 4). The SD method was used to quantify the subjects’ perceptions of coherence, complexity, legibility, and mystery of the historic garden scene57,65,66. Each dimension was defined in adjective pairs to improve the comprehensibility and measurement validity of the scale.

Table 4 Definitions and adjective pairs for subjective perception evaluation indicators

An online questionnaire and real-life images were used to collect data on tourists’ subjective perceptions of Chinese classical private garden landscapes. The 300 images were evenly divided into six groups according to their shooting sequence to reduce respondent burden and improve questionnaire quality, with participants randomly assigned to one group. All questionnaires used a standardized format and scoring system to ensure the consistency of the experiment (Supplementary B).

Before the formal response, the definitions of the four perceptual dimensions and adjective pairs were explained to the subjects to ensure their understanding of the evaluation criteria. A 5-point Likert scale was used for scoring: 1 for preferring the left lexical meaning and 5 for preferring the right side to quantify the degree of perception. Demographic information of the subjects, including age group and professional background, was collected.

The survey was administered through an online platform (https://www.wjx.cn/) during the data collection phase, resulting in 267 valid questionnaires. The number of valid questionnaires in the groups was 43–46, meeting commonly accepted standards for experimental reliability in psychological research67. The age structure of the 267 subjects was as follows: 111 were aged 18–25, 73 were aged 26–35, 34 were aged 36–45, 35 were aged 46–55, and 14 were aged 56–65. Regarding employment, 103 people were working in landscape architecture or studied related topics, and 164 people were in unrelated professions, providing a relatively diverse socio-demographic composition in terms of age and professional background. The ratings of the image in the four dimensions were averaged to create the subjective perception evaluation dataset.

The questionnaire scales used in this study were constructed based on Kaplan’s preference matrix. Given the clear theoretical foundation and established validity of the perceptual dimensions, the intraclass correlation coefficient (ICC) was employed to assess within-group consistency across the four perceptual dimensions for each of the six questionnaire groups68, in order to evaluate the reliability of the ratings within groups (Supplementary C). Based on this assessment, the average scores of the four perceptual dimensions were calculated for each image and used as the basic indicators of subjective perception in subsequent analyzes.

To assess rating-scale consistency across the six questionnaire groups, an anchor-image validation experiment was conducted using linear mixed-effects models (LMMs). The 300 images were clustered based on their mean scores across the four perceptual dimensions using K-means clustering (K = 5). Representative images closest to each cluster centroid were selected as anchor images (30 images in total; five images per questionnaire group) to ensure balanced coverage of the perceptual space.

An independent sample of participants (n = 40) was then recruited to re-evaluate the anchor images using the same rating protocol as the original survey, with a comparable demographic composition. Separate LMMs were then fitted for each perceptual dimension, specifying questionnaire group as a fixed effect and image ID and participant ID as random intercepts, to test for potential systematic scale differences while controlling for image and individual level variability (Supplementary D).

Interpretable machine learning model

To improve regression fitting accuracy, we employed five widely used machine learning regression algorithms: categorical boosting (CatBoost), HGBoost, extreme gradient boosting (XGBoost), light gradient boosting machine (LightGBM), and random forest. All models were fine-tuned using Bayesian optimization and trained on the same dataset under an identical training procedure to ensure comparability. Model performance was evaluated using fivefold cross-validation and an independent test set, with four standard metrics: mean absolute error, mean squared error (MSE), root mean squared error, and the coefficient of determination (R2). HGBoost showed more stable performance and was selected for subsequent analyzes (Supplementary E). We applied the SHAP method to interpret the results of the HGBoost model and determine the relative contributions of the variables. PDPs were used to visualize the nonlinear relationships between key features and perceptual outcomes (more details in Supplementary F).