Cluster analysis consists of. Clustering goals for making recommendations. Statistics related to cluster analysis

Random Forest is one of my favorite data mining algorithms. Firstly, it is incredibly versatile; it can be used to solve both regression and classification problems. Search for anomalies and select predictors. Secondly, this is an algorithm that is really difficult to apply incorrectly. Simply because, unlike other algorithms, it has few customizable parameters. And it is also surprisingly simple in nature. And at the same time, it is amazingly accurate.

What is the idea behind such a wonderful algorithm? The idea is simple: let's say we have some very weak algorithm, say . If we make a lot of different models using this weak algorithm and average the results of their predictions, the final result will be significantly better. This is what is called ensemble learning in action. The Random Forest algorithm is therefore called “Random Forest”; for the received data, it creates many decision trees and then averages the result of their predictions. An important point There is an element of chance in the creation of each tree. After all, it is clear that if we create many identical trees, then the result of their averaging will have the accuracy of one tree.

How does he work? Let's assume we have some input data. Each column corresponds to some parameter, each row corresponds to some data element.

We can randomly select a certain number of columns and rows from the entire data set and build a decision tree based on them.


Thursday, May 10, 2012

Thursday, January 12, 2012


That's all. The 17-hour flight is over, Russia remains overseas. And through the window of a cozy 2-bedroom apartment, San Francisco, the famous Silicon Valley, California, USA, looks at us. Yes, this is the very reason why I haven’t written much lately. We moved.

This all started back in April 2011 when I had a phone interview with Zynga. Then it all seemed like some kind of game unrelated to reality and I could not even imagine what it would lead to. In June 2011, Zynga came to Moscow and conducted a series of interviews, about 60 candidates who passed a telephone interview were considered and about 15 people were selected from them (I don’t know the exact number, some later changed their minds, others immediately refused). The interview turned out to be surprisingly simple. No programming problems, no tricky questions about the shape of hatches, mostly testing your ability to chat. And knowledge, in my opinion, was assessed only superficially.

And then the rigmarole began. First we waited for the results, then the offer, then the LCA approval, then the approval of the visa petition, then documents from the USA, then the queue at the embassy, ​​then additional verification, then the visa. At times it seemed to me that I was ready to give up everything and score. At times I doubted whether we needed this America, after all, Russia is not bad either. The whole process took about six months, in the end, in mid-December we received visas and began to prepare for departure.

Monday was my first working day in a new place. The office has all the conditions for not only working, but also living. Breakfasts, lunches and dinners from our own chefs, a lot of varied food stuffed in every corner, a gym, massage and even a hairdresser. All this is completely free for employees. Many people commute to work by bicycle, and several rooms are equipped for storing vehicles. In general, I have never seen anything like this in Russia. However, everything has its price; we were immediately warned that we would have to work a lot. What “a lot” is, by their standards, is not very clear to me.

I hope, however, that despite the amount of work, in the foreseeable future I will be able to resume blogging and maybe tell something about American life and working as a programmer in America. Wait and see. In the meantime, I wish everyone a Happy New Year and Christmas and see you again!


For an example of use, let's print out the dividend yield Russian companies. As the base price, we take the closing price of the share on the day the register is closed. For some reason, this information is not available on the Troika website, but it is much more interesting than the absolute values ​​of dividends.
Attention! The code takes quite a long time to execute, because... For each promotion you need to make a request to finam servers and get its value.

Result<- NULL for(i in (1:length(divs[,1]))){ d <- divs if (d$Divs>0)( try(( quotes<- getSymbols(d$Symbol, src="Finam", from="2010-01-01", auto.assign=FALSE) if (!is.nan(quotes)){ price <- Cl(quotes) if (length(price)>0)(dd<- d$Divs result <- rbind(result, data.frame(d$Symbol, d$Name, d$RegistryDate, as.numeric(dd)/as.numeric(price), stringsAsFactors=FALSE)) } } }, silent=TRUE) } } colnames(result) <- c("Symbol", "Name", "RegistryDate", "Divs") result


Similarly, you can build statistics for previous years.

Introduction

Chapter 1. Theoretical foundations of Big Data analysis

1.1 About Big Data

.2 Map-Reduce

.3 Data Mining for working with Big Data

1.4 Problems solved by Data Mining methods

Conclusion to the first chapter

Chapter 2. Cluster analysis for Big Data

.1 Selecting a clustering method

.2 Hierarchical methods

.3 Non-hierarchical methods

.4 Comparison of types of clustering

.5 Statistics associated with cluster analysis

Conclusion to the second chapter

Chapter 3. Algorithm for splitting retail outlets

.1 Client profile

.2 Correspondence analysis

.3 The main idea of ​​cluster analysis

.4 Features for clustering

.5 Identification of points of homogeneous location

.5.1 Final division into strata

.6 Clustering objects into homogeneous groups

.7 Clustering of the assortment of retail outlets

Conclusion to the third chapter

Conclusion

Bibliography

Introduction

Humanity in its development uses material, energy, instrumental and information resources. Information about events of the past, present and possible future is of great interest for analyzing what is happening. As the ancients said: Praemonitus praemunitus - "forewarned is forearmed".

The modern development of society is characterized by an unprecedented increase in information flows - in industry, trade, and financial markets. A society’s ability to store and quickly process information generally determines the level of development of a country’s statehood.

The problem of collecting, storing and processing information in modern society receives great attention. However, at the moment there is a clear contradiction. On the one hand, human civilization is experiencing an information explosion; the volume of information is increasing significantly every year. On the other hand, the growth of the current volume of information in society exceeds the individual’s ability to assimilate it. The presence of such problems initiates the massive development of technologies, technical means, and switching flows.

The extremely important role of information in the modern world has led to the identification of information as its own resource, as important and necessary as energy, financial, and raw materials.

Society's needs for collecting, storing and processing information as a commodity have created a new range of services - the information technology market.

For the most complete and integral use of information technologies, information needs to be collected, processed, storage and accumulation places created, transmission systems and access restriction systems created, and finally, information needs to be systematized. The last problem is most relevant recently, since a large, even huge, amount of information entering global storage arrays, without systematization, can lead to information collapse, when accessing or searching for the necessary information can lead to searching for a needle in a haystack.

The purpose of this work : Comparative analysis of cluster analysis methods in solving grouping problems.

Task : Analyze approaches to using cluster analysis in problems of typing a large set of data.

During the work, various methods of cluster analysis will be used in order to identify the advantages and disadvantages of each of them, as well as select the most optimal one for completing the assigned tasks. The main issue of cluster analysis will also be raised - the question of the number of clusters, and recommendations will be given for its solution. The relevance of this work is due to the urgent need to determine optimal methods for processing large volumes of data and solving problems of data systematization in the shortest possible time. The widespread practical application of data obtained through cluster analysis determines the relevance of this study. My thesis is devoted to certain aspects of such problems in the modern development of information technologies.

Chapter 1. Theoretical foundations of analysisBigData

.1 About Big Data

The term “Big Data” describes collections of data with potential exponential growth that are too large, too unformatted, or too unstructured to be analyzed by traditional methods.

Big Data technologies are a series of approaches, tools and methods for processing structured and unstructured data of huge volumes and significant diversity. These technologies are used to obtain human-perceivable results that are effective in conditions of continuous growth and distribution of information across numerous nodes of a computer network. They were formed in the late 2000s as an alternative to traditional database management systems and Business Intelligence solutions. Currently, most of the largest information technology providers to organizations use the concept of “big data” in their business strategies, and major information technology market analysts devote dedicated research to the concept.

Currently, a significant number of companies are closely following the development of technology. According to the McKinsey report "Global Institute, Big data: The next frontier for innovation, competition, and productivity" (Global Institute, big data: the next frontier for innovation, competition and productivity), data has become an important factor of production along with labor and capital resources . The use of Big Data is becoming the basis of competitive advantage and growth for companies.

In modern conditions, organizations and companies create a huge amount of unstructured data: text, various documents, images, videos, machine codes, tables and the like. All this information is hosted and stored in multiple repositories, often outside the organization.

Organizations may have access to a huge amount of their own data, but may not have the necessary tools with which to actually establish relationships between all this data and draw meaningful conclusions based on them. Given the rapid and continuous growth of data, it becomes urgently necessary to move from traditional methods of analysis to more advanced technologies of the Big Data class.

Characteristics. In modern sources, the concept of Big Data is defined as data in the order of terabytes. Signs of Big Data can be defined as “three Vs”: volume - volume; variety - heterogeneity, set; velocity - speed (the need for very fast processing).

Figure 1 Signs of big data

· Volume. The rapid development of technology and the popularization of social networks contribute to the very rapid growth of data volumes. This data, generated by both humans and machines, is distributed in various places and formats in huge volumes.

· Speed. This feature is the speed of data generation. Getting the data you need as quickly as possible is an important competitive advantage for solution developers, also because different applications have different latency requirements.

· Diversity. Diversity can be attributed to different data storage formats. Today, significant amounts of unstructured data are generated in the world, and this is in addition to the structured data that enterprises receive. Before the era of Big Data technology, there were no powerful and reliable tools in the industry that would be able to work with the volume of unstructured data that we see today.

Consuming vast amounts of structured data generated both internally and externally is a necessity for organizations in today's world in order to remain competitive.

The “category” of Big data traditionally includes not only the usual spreadsheets, but also unstructured data that can be stored in the form of images, audio files, video files, web logs, sensor data and many others. In the world of big data, this aspect of different data formats will be called variety.

Below in Figure 2 there is a comparative description of the traditional database and the Big Data database.

There are a number of industries in which data is collected and accumulated very intensively. For applications of this class, in which there is a need to store data for years, the accumulated data is classified as Extremely Big Data.

There is also an increase in the number of Big Data applications in the commercial and government sectors, the volume of data of these types of applications is located in storage facilities and often amounts to hundreds of petabytes.

Figure 2 Comparative data characteristics

The development of certain technologies makes it possible to “track” people, their habits, interests and consumer behavior through various methods. An example is the use of the Internet in general and in particular - shopping in online stores such as Walmart (according to Wikipedia, Walmart's data storage is estimated at more than 2 petabytes), or traveling and moving around with mobile phones, making calls, writing letters, taking photographs, logging into accounts on social networks from different parts of the planet - all this accumulates in databases and can be usefully used thanks to the fast processing of big data.

Likewise, modern medical technologies generate large amounts of data relevant to the delivery of health care (images, videos, real-time monitoring).

Big data sources. Just as data storage formats have changed, data sources have also evolved and are constantly expanding. There is a need to store data in a wide variety of formats.

With the development and advancement of technology, the amount of data that is generated is constantly growing. Big data sources can be divided into six different categories, as shown below.

Figure 3 Big Data Sources

· Enterprise data. Enterprises have large amounts of data in different formats. Common formats include flat files, emails, Word documents, spreadsheets, presentations, HTML pages, PDF documents, XML files, legacy formats, etc. This data, distributed throughout the organization in different formats, called corporate data .

· Transactional data. Every business has its own applications which involve performing different types of transactions such as web applications, mobile applications, CRM systems and many more.

To support transactions, these applications typically use one or more relational databases as the underlying infrastructure. This is mainly structured data and is called transactional data.

· Social media. Social networks such as Twitter, Facebook and many others generate a large amount of data. Typically, social networks use unstructured data formats, including text, images, audio, video. This category of data sources is called social mass media .

· Activity Generate. This includes data from medical devices, censored data, surveillance video, satellites, cell phone towers, industrial equipment, and other data generated primarily by machines. These types of data are called data Activity Generate.

· Public data. This data includes data that is publicly available, such as data published by governments, research data published by research institutes, data from meteorological and meteorological departments, census data, Wikipedia, open source data samples and other data that is freely available to public. This type of public data is called Public Data .

· Archive. Organizations archive a lot of data that is either no longer needed or is rarely needed. In today's world, as hardware gets cheaper, no organization wants to delete any data, they want to store as much data as possible. This type, which is accessed less frequently, is called archived data.

Examples of implementation. As an example of the implementation of this technology, the Hadoop project is most often cited, which is designed to implement distributed computing used to process impressive amounts of data.

This project is being developed within the Apache Software Foundation. Cloudera supports this project commercially.

Developers from different countries of the world are involved as participants in the project. information clustering provider

Technologically, Apache Hadoop can be called a free Java framework that supports the execution of distributed applications running on large clusters built on standard hardware.

Since data processing is performed on a cluster of servers, if one of them fails, the work will be redistributed among other working ones.

It is also necessary to talk about the implementation of MapReduce technology in Hadoop, the main task of which is the automatic parallelization of data and its processing on clusters.

The core of Hadoop is the fault-tolerant distributed file system HDFS (Hadoop Distributed File System), which operates storage systems.

The essence of the system is to split incoming data into blocks, for which there is a specially designated position in the server pool for each of them. The system makes it possible for applications to scale. The level will be thousands of nodes and petabytes of data.

1.2 Map-Reduce

In this paragraph we will talk about the Map-Reduce algorithm, which is a model for distributed computing.

The principles of its operation are based on the distribution of input data to worker nodes of a distributed file system for pre-processing (map step) and then the convolution (merging) of the pre-processed data (reduce step).

The algorithm calculates the running sums of each distributed file system node, then calculates the sum of the running sums and obtains the final sum.

Magic quadrant of solution providers in the field of data warehouse management systems (Gartner, February 2017)

Figure 4 Leaders

Companies:

· Leaders: IBM, SAS, RapidMiner, KNIME

· Contenders: MathWorks, Quest (formerly Dell), Alteryx, Angoss

· Visualizers: Microsoft, H2O.ai, Dataiku, Domino Data Lab, Alpine Data

Niche players: FICO, SAP, Teradata

1.3 Data MiningForworkWithBig Data

Data Mining(DM) - “This is a technology that is designed to search large volumes of data for non-obvious, objective and practical patterns.”

A special feature of Data Mining is the combination of a wide range of mathematical tools (from classical statistical analysis to new cybernetic methods) and the latest achievements in the field of information technology.

This technology combines strictly formalized methods and methods of informal analysis, i.e. quantitative and qualitative data analysis.

.4 Problems solved by Data Mining methods

· Correlation - establishing a statistical dependence of continuous outputs on input variables.

· Clustering is a grouping of objects (observations, events) based on data (properties) that describe the essence of these objects. Objects within a cluster must be “similar” to each other and at the same time have differences from objects that fall into other clusters.

The clustering accuracy will be higher if the objects within the cluster are as similar as possible, and the clusters are as different as possible.

· Classification is the assignment of objects (observations, events) to one of the previously known classes.

· Association - identifying patterns between related events. An example of such a pattern is a rule indicating that event X follows from event Y. Such rules are called associative.

Conclusion to the first chapter

Big data is not just another hype in the IT market, it is a systematic, high-quality transition to the creation of value chains based on knowledge.

In terms of its effect, it can be compared with the advent of affordable computer technology at the end of the last century.

While short-sighted conservatives will use deeply outdated approaches, enterprises that are already using Big Data technologies will in the future find themselves in leading positions and gain competitive advantages in the market. There is no doubt that all major organizations will implement this technology in the coming years, as it is both the present and the future.

Chapter 2. Cluster analysis forBigData

Cluster analysis is a class of methods that are used to classify objects or events into fairly homogeneous groups, which will be called clusters.

The fundamental thing is that objects in clusters must be similar to each other, but at the same time they must be different from objects located in other clusters.

Figure 5 illustrates an ideal clustering situation, with each cluster clearly separated based on differences in two variables: quality orientation (X), and price sensitivity (Y).

Figure 5 Ideal clustering situation

It should be noted that absolutely every consumer falls into one of the clusters, and there are no overlapping areas.

However, the illustration below shows the clustering situation that is most often encountered in practice.

In accordance with the data in Figure 6, the boundaries of the clusters are outlined extremely vaguely, and it is not entirely clear which consumers are assigned to which cluster, since a significant part of them cannot be grouped into one cluster or another.

Figure 6 Real clustering situation

In cluster analysis, groups or clusters are identified using collected current data, and not in advance. Thus, there is absolutely no need to prepare preliminary information about the cluster membership of any of the objects .

Market segmentation. For example, consumers should be divided into clusters based on the benefits they expect from purchasing a given product. A cluster may contain consumers seeking similar benefits. This method is usually called the benefit segmentation method.

Understanding Buyer Behavior. Using cluster analysis when it is necessary to identify homogeneous categories of buyers.

Determining new product capabilities. Determination of competitive groups and sets within a given market is also carried out through clustering of brands and products.

Selecting Test Markets. The selection of similar cities for the purpose of testing multiple marketing strategies is done by grouping cities into homogeneous clusters.

Data Dimension Reduction X. Cluster analysis is also used as a primary tool for reducing data dimensionality by creating clusters or subgroups of data that are more useful for analysis than individual observations. Further multivariate analysis is performed on clusters rather than individual observations.

2.1 Clustering methods

There are two types of clustering methods: hierarchical And non-hierarchical.

Figure 7 Cluster analysis methods

.2 Hierarchical methods

Hierarchical methods are divided into two types - agglomerative and divisional.

Agglomerative clustering starts with each object in a separate cluster. Objects are grouped into increasingly larger clusters. This process will continue until all objects become members of one single cluster.

It should also be highlighted divisional clustering, which starts with all objects being grouped in a single cluster. Clusters will be divided until each object is in a separate cluster. Most often taken for research agglomerative methods, such as communication methods, as well as dispersion and centroid.

Communication methods include single link method, complete link method and average link method. Linkage methods are agglomerative hierarchical clustering methods that combine objects into a cluster based on the calculated distance between them.

Figure 8 Single link method

At the core single connection method lies the minimum distance, or the nearest neighbor rule (formula 1).

When forming a cluster, the first to combine are two objects, the distance between which is minimal. Next, the next shortest distance is determined, and a third object is introduced into the cluster with the first two objects.

At each stage, the distance between two clusters is the distance between their closest points. At any stage, two clusters are combined according to the shortest single distance between them.

This process continues until all objects are combined into a cluster. If the clusters are poorly defined, then the single link method does not work well.

Figure 9 Full link method

At the core full connection method lies the maximum distance between objects, or the farthest neighbor rule. In the complete linkage method, the distance between two clusters is calculated as the distance between their two most distant points.

Figure 10 Average link method

IN average connection method the distance between two clusters is defined as the average of all distances measured between objects in two clusters, with each pair including objects from different clusters. The average link method uses information about all distances between pairs, not just the minimum or maximum distance. For this reason, the medium linkage method is generally preferred over the single or full linkage methods.

Dispersion methods form clusters in such a way as to minimize intra-cluster dispersion.

Figure 11 Ward method

A well-known dispersion method used for this purpose is Ward's method, in which clusters are formed in such a way as to minimize the squares of Euclidean distances to cluster means.

For each cluster, the averages of all variables are calculated. Then, for each object, the squared Euclidean distances to the cluster means are calculated.

These squared distances are summed for all objects. At each stage, the two clusters with the smallest increase in the total intracluster variance are combined.

Figure 12 Centroid method

IN centroid methods the distance between two clusters is the distance between their centroids (the average of all variables).

The centroid method is a variance method for hierarchical clustering. Each time the objects are grouped and a new centroid is calculated.

Ward's method and average linkage show the best results among all hierarchical methods.

2.3 Non-hierarchical methods

Another type of clustering procedures are non-hierarchical methods clustering, often called k-means. k-means method(k-means clustering) - a method that determines the center of a cluster, and then groups all objects within a threshold value specified from the center. These methods include sequential threshold method, parallel threshold method, and optimizing allocation.

where k is the number of clusters, _(i)) are the resulting clusters, i=1,2,…,k

Centers of mass of vectors.

Figure 13 Example of the k-means algorithm (k=2)

IN sequential threshold method Objects that are within a threshold with a given center are grouped together.

The next step is to determine a new cluster center, and this process will be repeated for ungrouped points. After placing an object into a cluster with a new center, it will no longer be considered as an object for further clustering.

It works in a similar way parallel threshold method, but it has one important difference - several cluster centers are simultaneously selected and objects within the threshold level are grouped with the nearest center.

Optimizing distribution method will differ from the two previous threshold methods in that objects can subsequently be assigned to other clusters (redistributed) in order to optimize the overall criterion, which is the average intra-cluster distance established for a given number of clusters.

BIRCH algorithm Thanks to generalized representations of clusters, the speed of clustering increases, while the algorithm is highly scalable. This algorithm implements a two-stage clustering process.

The first stage is to form a preliminary set of clusters. The next stage is to apply other clustering algorithms to the identified clusters, which would be suitable for working with RAM.

Let's imagine each data element as a bead that lies on the surface of the table, then it is absolutely possible to “replace” these clusters with tennis balls and then move on to studying clusters of tennis balls in more detail.

The number of beads can be quite large, but the diameter of the tennis balls can be chosen so that at the second stage, using traditional clustering algorithms, it becomes possible to determine the actual complex shape of the clusters.

Among the new scalable algorithms we can also note the algorithm CURE- a hierarchical clustering algorithm, where the concept of a cluster is formulated using the concept of density. Many researchers are now actively working on scalable methods, whose main task is to overcome the shortcomings of the algorithms that exist today.

2.4 Comparison of types of clustering

The table lists the advantages and disadvantages of such methods as: CURE algorithm, BIRCH, MST, k-means (k-means), PAM, CLOPE, Kohonen Self-Organizing Maps, HCM (Hard C - Means), Fuzzy C-means.

2.5 Statistics related to cluster analysis

The following statistics and concepts are related to cluster analysis:

1. Cluster centroid. The average value of the variables for all cases or objects in a particular cluster.

2. Cluster centers. Initial starting points in non-hierarchical clustering. Clusters are built around these centers, or clustering grains.

3. Belonging to a cluster. Indicates the cluster to which each case or object belongs.

4. Tree diagram- a graphical tool for displaying clustering results. Vertical lines represent clusters being merged. The position of the vertical line on the distance scale shows the distances at which the clusters were combined. This diagram is read from left to right.

5. Variation indicator. Checking the quality of clustering. The ratio of the standard deviation to the mean.

7. Icicle diagram. This is a graphical display of the clustering results.

8. Similarity matrix/matrix of distances between merged objects is a lower triangular matrix containing distance values ​​between pairs of objects or cases

Conclusion to the second chapter

Cluster analysis can truly be called the most convenient and optimal tool for identifying market segments. The use of these methods has become especially important in the age of high technology, in which it is so important to speed up labor-intensive and time-consuming processes with the help of technology. The variables used as the basis for clustering will be correctly selected based on the experience of previous studies, theoretical premises, various tested hypotheses, and also based on the wishes of the researcher. In addition, it is recommended to take an appropriate similarity measure. A distinctive feature of hierarchical clustering is the development of a hierarchical structure. There are and are used two types of hierarchical clustering methods - agglomerative and divisional.

Agglomerative methods include: single, full and medium bond methods. The most common dispersion method is the Bard method. Non-hierarchical clustering methods are often called k-means methods. The choice of clustering method and the choice of distance measure are interrelated. In hierarchical clustering, an important criterion for deciding on the number of clusters is the distances at which clusters merge. The relative sizes of clusters should be such that it makes sense to preserve a given cluster rather than merge it with others. Clusters are interpreted in terms of cluster centroids. Clusters are often interpreted by profiling them through variables that were not the basis for clustering. The reliability and validity of clustering solutions are assessed in different ways.

Chapter 3. Algorithm for splitting retail outlets

A retail enterprise with 36,651 outlets selling confectionery products was taken as the object under study. The list of goods sold by the company includes over 350 units of products.

The purpose of this study will be a comparative analysis of cluster analysis methods in solving problems:

Studying the client’s profile and analyzing the correspondence of the specified characteristics;

2. Division into clusters - identification of homogeneous groups;

Dividing the assortment of a trading enterprise into homogeneous groups.

.1 Client profile

According to a Galileo study conducted in the second half of 2016, approximately 42 million people who consume confectionery products were surveyed.

From this survey it follows that the main consumers of confectionery products are women.

This can be attributed to the fact that women traditionally receive chocolate products as gifts, and the majority of confectionery lovers are women. This can be clearly seen in Figure 10.

· under 16 years of age - the main consumers of chocolate in the form of figures;

· from 16 to 24 years old - the main consumers of chocolate bars;

· chocolate bars are in most cases purchased by women aged 25 to 34 years;

· people from 25 to 45 years old are the main buyers of boxed sweets;

· People 45 and older prefer loose sweets.

Figure 14 Confectionery consumption by gender

Figure 12 shows the distribution of total consumption into 3 groups, depending on income: A-low, B-medium, C-high. The lion's share of consumers falls on the average income group - 54%, followed by the low income group - 29%, the smallest contribution is made by the high income group - 17%.

Figure 15 Confectionery consumption depending on income

This graph illustrates the audience’s preferences in choosing a place of purchase; let’s also consider the distribution depending on income. Obviously, the largest number of purchases are made in hypermarkets and supermarkets, which is true for each income group.

The share of purchases in supermarkets is almost half (46%) for group C, based on which we can conclude that it is advisable to expand the line of goods popular among people with high incomes.

People with average incomes account for 41% of purchases in supermarkets, while people with low incomes have the smallest share - 37%. Next comes the share of purchases in small self-service stores; purchases in such stores are made by all three groups in equal proportions. The smallest share falls on markets and stalls, where the main contribution is made by representatives of group A, which includes a large number of pensioners who often make purchases at the market “out of habit.”

Figure 16 Places of purchase of confectionery products depending on income

The following graph clearly illustrates the degree of importance of a particular product attribute for each of the three income groups. For groups A and B, the most important factor is price, and the appearance of the packaging and the country of manufacture of the product are of little importance. The behavior of representatives of the high-income group will be slightly different; there, in addition to price, the brand and appearance and country of production of the product are important.

Figure 17 Priorities when choosing confectionery products of different income groups

.2 Correspondence analysis

Correspondence analysis is used to visualize tables. This method allows you to identify the relationship between characteristics in the columns and rows of the table.

Let us further consider the analysis of the correspondence of consumption of confectionery products by gender and age, illustrated in Figure 7, as well as Figure 8, which shows the consumption of various categories of products depending on the income of consumers.

First, let's look at the preferences of three groups of men: aged 16-19, 20-24 and 25-34, since their consumer preferences can be characterized as almost identical.

Figure 18 Analysis of the correspondence of popular candies by age and gender

Men in these age groups prefer Snickers, Mars, Nuts, Twix, Picnic, Kinder bueno chocolate bars and M&m’s candies. Products of this type fall into the category “Chocolate bars and other chocolate in small packages” and will be most popular among people with low incomes.

This is followed by the four remaining age groups for men: 35-44, 45-54, 55-64, 65-74. They will also have approximately the same consumer behavior and are extremely passive consumers. For these groups, it will be fair to say that as the income level increases, the level of consumption will change in inverse proportion, that is, among men aged 35-74 with a high income, there will be the lowest consumer activity.

Obviously, the niche that includes solvent men 35-74 is very promising and at the same time unoccupied, but the existing set of products is not able to satisfy the needs of this category of consumers. Based on the above, we can conclude that it makes sense to influence this target audience with some completely new product that can attract consumers.

The next step will be to describe groups of women aged 16-19, 20-24, 25-34 who have similar consumer behavior. The mentioned groups, as a rule, prefer chocolate bars, some of them will be similar to those preferred by men of the same age - Picnic, Twix, Nuts, etc., and Tempo, bounty, Kit Kat, Milky way bars are also highly popular among women , Kinder country, an ordinary miracle.

For these groups, the low income rule will also be true; as income increases, the popularity of chocolate bars will decrease. Next comes the group of women 35-44, for them the most popular choice is Alpen Gold, then Geisha and mini whimsy cake, this statement is true for those with low and middle incomes, equally. As age increases, the following become preferred (groups 45-54, 55-64, 65-74): Alenka, Korovka, Sladko, candies of the “Krupskaya” group and other domestic ones. This is most true for people with average incomes. Assessing the consumption of confectionery products in general, it should be noted that 2/3 of all consumption falls on the female share of the population.

.3 The main idea of ​​cluster analysis

Before applying the clustering algorithm, all retail outlets are divided into strata. The algorithm is applied separately to each of the resulting strata. The clusters obtained for individual groups are then combined into one final set of clusters.

Let us describe the details of the clustering algorithm. Let us denote the number of retail outlets to which the algorithm is applied by , the set of retail outlets by , the Euclidean metric by , and the number of features by . The number of features and, as a consequence, their number depend on the stratum.

First of all, the values ​​of all features are standardized. Standardization is the transformation of a characteristic by subtracting its mean and dividing by its standard deviation. The mean and standard deviation are calculated once on the data being clustered on and are part of the clustering model.

We use the KMeans algorithm as the clustering algorithm. This algorithm requires specifying the number of clusters and the number of initializations of the iterative clustering process (or initial centroids). The number of initializations depends on the time available for clustering. To determine the number of clusters, we use the KMeans algorithm with the number of clusters from 2 to 75. Let us denote the resulting clustering models by , and the centroids by . For each, we define a measure of intracluster scatter

You can consider a clustering model for the case. In this case, there is only one centroid, defined as the element-wise average of all. The resulting measure of intra-cluster dispersion is called the measure of total dispersion of retail outlets:

Attitude

can be interpreted as the proportion of unexplained differences between outlets within clusters. This ratio decreases with growth. We define the optimal number of clusters as


In other words, we select the minimum number of clusters such that the proportion of unexplained differences is no more than 20%.

Note . Instead of the value 0.2, you can take any value from 0 to 1. The choice depends on the restrictions on the number of clusters, as well as on the type of graph of the relationship versus . However, if the maximum allowable proportion of unexplained differences is set before clustering begins, then for the search it is not necessary to build cluster models for all from 2 to 75. The binary search method can be used, which significantly increases the speed of clustering.

As a result of clustering, we obtain the following components of the complete clustering model:

· - average values ​​of characteristics for stratum and type;

· - standard deviations of characteristics for stratum and type;

· - optimal number of clusters for stratum and type;

· - clustering model obtained with the optimal number of clusters for stratum and type.

The algorithm for applying the full clustering model is as follows. Let there be a retail outlet of type belonging to a stratum, specified by a feature vector. Using a vector, we determine a vector with elements

We apply the clustering model to the resulting vector. As a result, we get the cluster number. Thus, the “cluster number” within the full clustering model consists of three parts:

· stratum;

· cluster number according to the clustering model for stratum and type (hereinafter this number will be simply called the cluster number).

3.4 Features for clustering

For clustering, it is necessary to compile a list of features that describe retail outlets. The following indicators were used to characterize retail outlets:

· Distances to places of population attraction (hereinafter referred to as MPN);

· Competitive environment. Distance to transport infrastructure facilities and other retail outlets of KA-networks and non-KA-networks (the distance to the nearest facility and the number of facilities within a radius of 1000 meters are determined);

· Solvency of the population in the vicinity of the retail outlet.

Formally, the characteristics also include stratum and type of outlet. However, clustering based on these characteristics is not carried out.

List of signs for retail outlets:

) population income ( income);

2) the average cost of 1 square meter of housing ( sqm_ price;);

) average cost of renting a one-room apartment ( rent_ price) ;

) the number of MPNs of any type within a radius of 1000 meters ( num_ in_ radius_ mpn_ all);

) number of retail outlets of non-KA chains within a radius of 1000 meters ( num_ in_ radius_ tt);

) number of KA chain retail outlets within a radius of 1000 meters ( num_in_radius_ ka);

) number of railway stations within a radius of 1000 meters ( num_ in_ radius_ railway_ station);

) number of metro stations within a radius of 1000 meters (field num_ in_ radius_ subway_ station);

) number of ground public transport stops within a radius of 1000 meters ( num_ in_ radius_ city);

) distance to the nearest MPN of any type ( dist_ to_ closest_ mpn);

) distance to the nearest railway station ( pts_railway_station_d01_distance);

) distance to the nearest metro station ( pts_subway_station_d01_distance);

) distance to the nearest ground public transport stop ( pts_city_d01_distance);

) distance to the nearest non-KA-network outlet ( tt_to_tt_d001_distance);

) distance to the nearest KA-network outlet ( ka_d01_distance);

3.5 Identification of points of homogeneous location

As part of the data preparation, all data were divided into homogeneous strata based on population size. This is necessary for further high-quality clustering. When dividing into strata, the method of comparison of means was used. The quality of the partition was checked by the degree of difference between strata based on nonparametric analysis of variance. The application results are shown below:

1. Population income . The hypothesis of income equality for the 4 strata was rejected (see Table 1).

Table 1 Hypothesis about population income


As can be seen from Figure 20, there is a noticeable difference in the average income. In the first stratum it is significantly higher than in the others. The lowest income was noted in the fourth stratum.

Figure 20 Comparisons between strata (population income)

2. Average cost of one square meter of housing . The hypothesis about the equality of the cost of 1 sq. meters of housing for 4 strata was rejected (see Table 2).

Table 2. Hypothesis about the average cost of 1 square meter of housing


As can be seen from Figure 21, there is a noticeable difference in the average cost of 1 sq. meters of housing. In the first stratum it is significantly higher than in the others. The lowest value is in the second stratum. Strata 3 and 4 have approximately the same cost.

Figure 21 Comparisons between strata (cost of 1 square meter of housing)

3. Average cost of renting a one-room apartment . The hypothesis about the equality of rental costs for the 4 strata was rejected (see Table 3).

Table 3 Hypothesis about the average rental cost


As can be seen from Figure 22, there is a noticeable difference in the average cost of rental housing. In the first stratum it is significantly higher than in the others. The lowest value is in the second stratum.

Figure 22 Comparisons between strata (average rental price)

4. Number of MPNs of any type within a radius of 1000 meters . The hypothesis was rejected for 4 strata (see Table 4).

Table 4 Hypothesis about the number of MPNs


As can be seen from Figure 23, there is a noticeable difference in the average number of MPNs. In the first stratum it is significantly higher than in the others. The smallest number of MPNs is in the fourth stratum.

Figure 23 Comparisons between strata (number of MPNs)

5. The number of retail outlets is not K.A. -networks within a radius of 1000 meters . The hypothesis was rejected for 4 strata (see Table 5).

Table 5 Hypothesis about the number of retail outlets of non-KA chains


As can be seen from Figure 24, there is a noticeable difference in the average values. In the second stratum the average value is significantly higher than in the others. The lowest value is in the fourth stratum.

Figure 24 Comparisons between strata (Number of TT non-KA networks)

6. Number of outlets K.A. -networks within a radius of 1000 meters . The hypothesis was rejected for 4 strata (see Table 6).

Table 6 Hypothesis about the number of retail outlets of CA chains


As can be seen from Figure 25, there is a noticeable difference in the average values.

In the second stratum the average value is higher than in the others, and the lowest in the fourth stratum.

Figure 25 Comparisons between strata (Number of TT KA networks)

. Number of railway stations within a radius of 1000 meters . The hypothesis was rejected for 4 strata (see Table 7).

Table 7 Hypothesis about the number of railway stations


As can be seen from Figure 26, there is a noticeable difference in the average values.

In the first stratum the average value is higher than in the others.

The smallest number of railway stations is in the third and fourth stratum.

8. Number of ground public transport stops within a radius of 1000 meters. The hypothesis was rejected for 4 strata (see Table 8).

Table 8 Hypothesis about the number of ground transport stops


As can be seen from Figure 27, there is a noticeable difference in the average values. In the first stratum the average value is higher than in the others, the lowest value is in the 4th stratum.

Figure 27 Comparisons between strata (number of ground transport stops)

9. Distance to the nearest MPN of any type. The hypothesis was rejected for 4 strata (see Table 9).

Table 9 Hypothesis about the distance to the nearest MPN


As can be seen from Figure 28, there is a noticeable difference in the average values. In the fourth stratum the average value is higher than in the others. The lowest value was noted in the first and second stratum.

Figure 28 Comparisons between strata (number of ground transport stops)

. Distance to the nearest railway station . The hypothesis was rejected for 4 strata (see Table 10).

Table 10 Hypothesis about the distance to the nearest railway station


As can be seen from Figure 29, there is a noticeable difference in the average values. In the fourth stratum the average value is higher than in the others. The lowest value was noted in the first stratum.

Figure 29 Comparisons between strata (distance to nearest railway station)

11. Distance to the nearest metro station . The hypothesis was rejected for 4 strata (see Table 11).

Table 11 Hypothesis about the distance to the metro station


As can be seen from Figure 30, there is a noticeable difference in the average values. In the second, third and fourth stratum the average value is higher, and the lowest value is noted in the first stratum.

Figure 30 Comparisons between strata (distance to nearest metro station)

12. Distance to the nearest ground public transport stop. The hypothesis was rejected for 4 strata (see Table 12).

Table 12 Hypothesis about the distance to the nearest ground transport stop


As can be seen from Figure 31, there is a noticeable difference in the average values. In the fourth stratum, the average value is higher, and the lowest value is noted in stratum 1.

Figure 31 Comparisons between strata (distance to nearest ground transport stop)

13. The distance to the nearest retail outlet is not K.A. -networks The hypothesis was rejected for 4 strata (see Table 12).

Table 13 Hypothesis about the distance to the nearest non-KA retail outlet


As can be seen from Figure 32, there is a noticeable difference in the average values. In the third stratum the average value is higher, and the lowest value is noted in the first, second and third stratum.

Figure 32 Comparisons between strata (distance to the nearest non-CA network outlet)

14. Distance to the nearest retail outlet K.A. -networks

Table 14 Hypothesis about the distance to the nearest retail outlet of the KA network


As can be seen from Figure 33, there is a noticeable difference in the average values. In the third stratum the average value is higher, and the lowest value is noted in the first, second and third stratum.

Figure 33 Comparisons between strata (distance to the nearest outlet of the CA network)

Thus, in the end, the results of the similarity of strata were obtained (see Table 15).

Table 15 Comparison between strata

.5.1 Final division into strata

As a result, a division into 4 strata was chosen, with satellite cities assigned to the main cities. Strata (field pop_ strata) we determine by the population in the locality in which the retail outlet is located.

· 1st stratum - large cities with a population of more than 1 million people;

· stratum 2 - cities with a population of more than 250 thousand people and up to 1 million people;

· 3rd stratum - cities with a population of more than 100 thousand people and less than 250 thousand people;

· 4 strata - cities with a population of less than 100 thousand people.

.6 Clustering objects into homogeneous groups

To identify CTs with a similar location, we will cluster objects (for each of the strata). Before applying clustering, it is necessary to identify more homogeneous retail outlets by location. The variation index was used to determine the quality of clustering. As a result, 36,651 retail outlets were divided into 15 clusters (36,598 retail outlets) + the 16th cluster consists of 53 anomalous outlets. By abnormal we mean points with very high sales.

To characterize the clusters, the following 7 indicators from descriptive statistics were used:

· Minimum, the lowest sales value;

· Percentile 5%;

· Percentile 25%;

· Median is a point on the scale of measured sales values, above and below which lies half of all measured sales values;

· Percentile 75%;

· Percentile 95%;

· Maximum, highest sales value.

Table 16 Final division into clusters

In Table 1 you can clearly see the final distribution of clusters within strata. The largest number of retail outlets belong to the fourth stratum, and the smallest to the third stratum.

· Strata 1. For the first stratum (4402 retail outlets), by applying the k-means method (chapter 2, paragraph 2.3), an optimal division into 4 clusters based on 15 characteristics was obtained. The number of clusters was chosen based on optimization of the Akaike criterion.

· 1st cluster - includes retail outlets whose areas are close to the center of large cities, or outlets located in shopping centers.

Cluster Profile : This cluster is characterized by a significant number of population attraction places (MPP), a high concentration of shopping areas and developed infrastructure.

Figure 34 Proportion of clusters in the first stratum

It makes up 61.5% of the total sales of the strata. There are 2,708 retail outlets in the cluster. Average monthly sales at retail outlets in this cluster are estimated in the range from 3 to 7 thousand rubles. The average income of the population is 34-36 thousand rubles, which is above average and ahead of most other clusters in this indicator.

The average cost of 1 square meter of housing will be 63 - 64 thousand rubles, which can be called the average. The average cost of renting a one-room apartment is estimated at 14 - 15 thousand rubles, which can also be described as an average in comparison with other clusters.

The number of places of attraction for the population of any type within a radius of 1000 meters is from 32 to 47, which is above average, and the number of retail outlets of non-KA chains within a radius of 1000 meters is about 40 to 53, which is also above average. Retail outlets of KA chains within a radius of 1000 meters are represented on average by 10 units. The presence of railway stations within a radius of 1000 meters is estimated as no more than two.

This cluster is characterized by the complete absence of metro stations within a radius of 1000 meters. The number of ground public transport stops within a radius of 1000 meters is 13-20 units.

: The distance to the nearest place of attraction of a population of any type is minimal - nearby. The distance to the nearest railway station can be described as high - far. Distance to the nearest metro station - none in the area. The distance from the nearest ground public transport stop will be low, the characteristics will be nearby. The distance to the nearest non-KA-network outlet is minimal - nearby, and the distance from the nearest KA-network outlet is a little greater, but also small, the characteristic is close.

· 2nd cluster - These are residential (dormitory) areas of large cities.

Cluster Profile : A small number of MPNs, low concentration of human flow, shopping areas.

: Accounts for 12.2% of the number of retail outlets in the stratum. There are 539 retail outlets in the cluster. Average monthly sales are estimated in the range from 3 to 8 thousand rubles. The average income of the population is estimated at 34 thousand rubles, which is similar to the indicators of the 1st and 3rd clusters of this stratum, but higher than the indicators of most clusters of other strata.

The average cost of 1 square meter of housing is 61 - 63 thousand rubles, and the average cost of renting a one-room apartment will be 14 - 15 thousand rubles, as in the first cluster. The number of places of attraction for the population of any type within a radius of 1000 meters is 7-8 units, and retail outlets of non-KA chains within a radius of 1000 meters are estimated in the range from 24 to 43 units. The number of KA-network outlets within a radius of 1000 meters will be 2. No more than two railway stations within a radius of 1000 meters. An important characteristic is the absence of metro stations within a radius of 1000 meters. The number of ground public transport stops within a radius of 1000 meters is on average 3-4.

Geographic characteristics of the cluster : The distance to the nearest MPN of any type is quite low and is characterized as close. The distance from the nearest railway station is high, the characteristics are far. Complete absence of metro stations. A characteristic different from cluster 1 is the high distance to the nearest public transport stop (far). The distance to the nearest non-KA chain retail outlet is low - nearby. And the distance to the nearest KA-network outlet is high - far

· 3rd cluster - it is the center of large cities.

Cluster Profile : The highest values ​​for the number of places of population attraction, indicators of trade activity and other places indicating a high level of economic activity and human flow.

Main quantitative and qualitative characteristics of the cluster: Accounts for 25.9% percent of the total number of retail outlets in the stratum. This cluster includes a size of 1,139 retail outlets. Average monthly sales range from 3.2 to 10 thousand rubles. The average income of the population is 36 thousand rubles and is a fairly good indicator - higher average

The average cost of 1 square meter of housing is estimated at 63 - 68 thousand rubles, and the average cost of renting a one-room apartment is approximately 14 - 15 thousand rubles, which does not differ from the indicators of the 1st and 2nd clusters. The number of MPNs of any type within a radius of 1000 meters high and equal to 51 - 66 units, and there are 46 - 55 units of retail outlets of non-KA chains within a radius of 1000 meters, which is also a high figure.

The number of KA chain retail outlets within a radius of 1000 meters is 15 - a lot. The presence of railway stations within a radius of 1000 meters is approximately one or two. The number of metro stations within a radius of 1000 meters is on average one, but no more than 3. The number of ground public transport stops within a radius of 1000 meters is 20-30, which is a very high figure.

Geographic characteristics of the cluster : The distance to the nearest MPN of any type is low - nearby. The distance from the nearest railway station is also low, the characteristic is close. The distance to the nearest metro station is low - close.

The nearest public transport stop is a very short distance away - nearby. Low distance to the nearest non-KA-network outlet - nearby. The distance from the nearest KA-network outlet is also very low - nearby.

· 4th cluster - These are residential, expensive areas and private properties remote from the center.

Cluster Profile : The highest values ​​of cost characteristics (income, real estate), the lowest values ​​of the number of MPN, trade indicators. It makes up only 0.4% of all retail outlets in the stratum.

Main quantitative and qualitative characteristics of the cluster : The cluster includes only 16 retail outlets and is the smallest of all the clusters in the stratum. Sales per month range from 4 to 40 thousand rubles. The average monthly income of the population is 49-66 thousand rubles, which is a very high figure. The average cost of 1 square meter of housing is also very high and is estimated at 85 - 124 thousand rubles. The average cost of renting a one-room apartment is higher than in other clusters of this stratum and amounts to 21-34 thousand rubles. The number of MPNs of any type within a radius of 1000 meters is low - from 4 to 20. There are no retail outlets of non-KA chains within a radius of 1000 meters nearby. The number of KA-network outlets within a radius of 1000 metro is 2. The presence of railway stations within a radius of 1000 meters is no more than one. There are no more than two metro stations within a radius of 1000 meters. The number of ground public transport stops within a radius of 1000 meters is only one.

Geographic characteristics of the cluster : The distance to the nearest MPN of any type is low - close. The distance from the nearest railway station is high - far. There are no metro stations - none nearby. The distance to the nearest ground public transport stop is high, the characteristic is far. The distance from the nearest non-KA retail outlet is very high - far. This cluster is characterized by the absence of KA chain retail outlets - none nearby.

stratum For the second stratum (9269 retail outlets), by applying the k-means method (chapter 2, paragraph 2.3), an optimal division into 5 clusters based on 15 characteristics was obtained. The number of clusters was chosen based on optimization of the Akaike criterion.

Figure 35 Proportion of clusters in the second stratum

· 5th cluster - These are the outskirts of cities, small settlements.

Cluster Profile : Average values ​​of infrastructure development indicators (there are Russian Railways, stops). Trading activity occurs only in some non-ka networks. The lowest values ​​of economic activity indicators in the stratum.

Main quantitative and qualitative characteristics of the cluster : Accounts for 10% percent of the total number of retail outlets in the stratum. This cluster includes 892 retail outlets. Average monthly sales are estimated in the range from 2.4 to 6 thousand rubles. The income of the population is estimated at an average of 27 thousand rubles, which is a low figure in comparison with indicators of clusters of the first stratum.

The average cost of 1 square meter of housing fluctuates around 47-53 thousand rubles, which is also lower than the indicators of 1 stratum. The average cost of renting a one-room apartment is 12 thousand rubles. The number of MPNs of any type within a radius of 1000 meters ranges from 2 to 5 pieces. The presence of non-ka-network retail outlets within a radius of 1000 meters is 9-30. There is a complete absence of ka-network retail outlets within a radius of 1000 meters - none nearby. The number of railway stations within a radius of 1000 meters is no more than two. Ground public transport stops within a radius of 1000 meters are on average two.

Geographic characteristics of the cluster : Low distance to the nearest MPN of any type - not far. The distance from the nearest railway station is high, the characteristics are far. The distance to the nearest ground public transport stop will also be high - far. The distance from the nearest non-ka-network outlet is insignificant, the characteristic is nearby. And the distance to the nearest ka-network outlet is large - the points are far away.

· 6th cluster - these are residential, sleeping areas of cities .

Cluster Profile : Average indicators of trading activity due to non-ka-networks and indicators of economic activity due to nearby MPNs;

Main quantitative and qualitative characteristics of the cluster : The cluster makes up 15% of the total number of retail outlets in the stratum and includes 1,345 retail outlets. Monthly sales are estimated at 3-6 thousand rubles. The average income of the population is 26 thousand rubles, which is the average for this stratum. The average cost of 1 square meter of housing is 53 thousand rubles, and the average cost of renting a one-room apartment will be 12 thousand rubles, as in the previous cluster. The number of MPNs of any type within a radius of 1000 meters is 18-25 pieces, and the number of non-ka-network outlets within a radius of 1000 meters is from 30 to 44 pieces. The number of ka-network outlets within a 1000 metro radius is on average 6-9, which is a high figure. There are no more than two railway stations within a radius of 1000 meters. Complete absence of ground public transport stops within a radius of 1000 meters.

The distance to the nearest MPN of any type is low - nearby,

and the nearest railway station is also close.

The distance to the nearest ground public transport stop is high, the characteristic is far. It is close to the nearest non-ka-network outlet, as well as to the nearest ka-network outlet.

· 7th cluster - these are areas close to the center of cities, near highways

Cluster Profile : High indicators of trade activity and infrastructure development (ground transport stops), average indicators of MPN.

Main quantitative and qualitative characteristics of the cluster: It makes up 34% of the total number of retail outlets in the stratum. This cluster includes 3194 retail outlets and is the largest in the stratum, along with the 8th cluster.

Monthly sales are estimated in the range from 2 to 6 thousand rubles.

The average income of the population is 28 thousand rubles.

The average cost of 1 square meter of housing is 42-49 and this is lower than similar indicators in the 5th and 6th clusters.

The average cost of renting a one-room apartment is practically no different from the previously considered clusters of this stratum and amounts to 11-12 thousand rubles.

The number of MPNs of any type within a radius of 1000 meters is 21-33, and the number of retail outlets of non-ka-networks within a radius of 1000 meters is about 50. The number of retail outlets of ka-networks within a radius of 1000 meters is on average 7-10. There are no railway stations within a radius of 1000 meters.

There are about 14 ground public transport stops within a radius of 1000 meters.

Geographic characteristics of the cluster : Low distance to the nearest MPN of any type, high distance to the nearest railway station. Close to the nearest ground public transport stop. Distance to the nearest non-ka-network outlet is low, characteristic - nearby. The nearest ka-network outlet is also close.

· 8th cluster - these are the centers of small cities (~500 thousand people).

Cluster Profile : A significant number of MPNs, a high concentration of shopping areas, and low infrastructure indicators.

Main quantitative and qualitative characteristics of the cluster: It makes up 34% of the total number of retail outlets in the stratum. This cluster includes 3191 retail outlets and is the largest in the stratum, along with the 7th cluster. Average monthly sales data are 3-8 thousand rubles. The average monthly income of the population is estimated at 28 thousand rubles. The average cost of 1 square meter of housing is 47 - 50 thousand rubles, and the average cost of renting a one-room apartment is 12 thousand rubles. The number of MPNs of any type within a radius of 1000 meters is on average 28-40 pieces, the presence of retail outlets of non-ka-networks within a radius of 1000 meters is from 38 to 52 pieces. Availability of ka-network retail outlets within a radius of 1000 meters - from 7 to 11 units. There are no railway stations within a radius of 1000 meters. The number of ground public transport stops within a radius of 1000 meters is very low, there are almost none.

Geographic characteristics of the cluster : The nearest MPN of any type is nearby. The distance to the nearest railway station is high, the characteristic is far. The distance from the nearest ground public transport stop is also high - far. The nearest non-ka-network outlet is close. The distance to the nearest ka-network outlet is close.

· 9th cluster - These are city centers with a population of up to 1 million people.

Cluster Profile : The highest values ​​of economic and trade activity indicators in the stratum.

Main quantitative and qualitative characteristics of the cluster : It makes up 7% percent of the total number of retail outlets in the stratum. This cluster includes 647 retail outlets and is the smallest in the stratum. Monthly sales are equal to 6-8 thousand rubles and this is higher than similar indicators of other clusters in this stratum. The income of the population, as in other clusters of the stratum, is estimated at 28 thousand rubles. The average cost of 1 square meter of housing is 50-53 thousand rubles. The average cost of renting a one-room apartment is also no different from similar indicators in other clusters of the strata and is equal to 12 thousand rubles.

The number of MPNs of any type within a radius of 1000 meters is 90 pieces and is a very high indicator, and the number of non-ka-network outlets within a radius of 1000 meters is 155 pieces, which can also be called a very high indicator. The number of ka-network outlets within a radius of 1000 meters is 20-21 units. There are no railway stations within a radius of 1000 meters.

The number of ground public transport stops within a radius of 1000 meters is about 15-18.

Geographic characteristics of the cluster : The nearest MPN of any type is nearby, but the nearest railway station is far away. It is not far from the nearest public transport stop. The distance to the nearest non-ka-network outlet is low, it is nearby, and the nearest ka-network outlet is also close.

stratum For the third stratum (1958 retail outlets), by applying the k-means method (Chapter 2, paragraph 2.3), an optimal division into 2 clusters based on 13 characteristics was obtained, since in this stratum there are no retail outlets close to the metro. The number of clusters was chosen based on optimization of the Akaike criterion.

Figure 36 Proportion of clusters in the third stratum

· 10th cluster - These are remote areas and cities with smaller populations.

Cluster Profile : Low economic activity, average degree of trading activity.

Main quantitative and qualitative characteristics of the cluster: It makes up 55% of the total number of retail outlets in the stratum. This cluster includes 1084 retail outlets. The income of the population is estimated at 24 thousand rubles, which is lower than the indicators of the 1st and 2nd strata. Average monthly sales are estimated at 18 thousand rubles, which is significant higher than the indicators of the 1st and 2nd strata. Characterized by the absence of MPN of any type within a radius of 1000 meters. The number of non-ka-network outlets within a radius of 1000 meters is from 15 to 40. There are 3 outlets of ka-networks within a radius of 1000 meters. As a rule, there are no railway stations within a radius of 1000 meters .75% have no ground public transport stops within a radius of 1000 meters, the remaining 25% have up to 20 stops.

Geographic characteristics of the cluster: There are no MPNs of any type nearby, and there are no railway stations either. There are no public transport stops nearby. The distance to the nearest non-ka-network outlet is low - it is nearby, and the nearest ka-network outlet is also close.

· 11th cluster - centers of small towns, shopping areas.

Cluster profile: Significant degree of economic and trade activity.

As a rule, there are no railway stations within a radius of 1000 meters.

Number of ground public transport stops within a radius of 1000 meters: 75% have no retail outlets, the remaining 25% have up to 22.

Geographic characteristics of the cluster : The distance to the nearest MPN of any type is low, and there are no railway stations nearby, as well as ground public transport stops. The distance to the nearest non-ka-network outlet is low, the points are located nearby. The distance to the nearest ka-network outlet is also low.

stratum For the fourth stratum (20,969 retail outlets), by applying the k-means method (Chapter 2, paragraph 2.3), an optimal division into 4 clusters based on 12 characteristics was obtained, since in this stratum there are no retail outlets close to the transport infrastructure. The number of clusters was chosen based on optimization of the Akaike criterion.

Figure 37 Proportion of clusters in the fourth stratum

· 12th cluster - outskirts of small settlements.

Cluster Profile : lowest income levels, no transport infrastructure, few shops.

Main quantitative and qualitative characteristics of the cluster: It accounts for 37% of the total number of retail outlets in the strata. This cluster includes 7,682 retail outlets. The income of the population is estimated at 18-20 thousand rubles, which is significantly lower than similar indicators for other strata.

Monthly sales amount to 19-35 thousand rubles. There are no MPNs of any type within a radius of 1000 meters. The number of non-ka-network retail outlets within a radius of 1000 meters is 3 - 8. Lack of ka-network outlets within a radius of 1000 meters. There are no railway stations within a radius of 1000 meters, as well as ground public transport stops. The distance to the nearest arbitrary MPN is large - far. The nearest railway station is also far away. The distance from the nearest ground public transport stop is high - far. The nearest non-ka-network outlet is close, but the nearest ka-network outlet is far away.

· 13th cluster - shopping areas of small towns

Cluster Profile : Average indicators of trading activity, weak signs of the presence of transport infrastructure.

Main quantitative and qualitative characteristics of the cluster: It makes up 31% of the total number of retail outlets in the stratum. This cluster includes 6,514 retail outlets. The population's income is estimated at 21-24 thousand rubles, which is significantly lower than similar indicators for other strata, but higher than the indicator for the 12th cluster of this stratum.

Monthly sales are 21-46 thousand rubles. There are no MPNs of any type within a radius of 1000 meters. The number of non-ka-network outlets within a radius of 1000 meters is 18-28. There are 2-3 ka-network outlets within a radius of 1000 meters. There are no railway stations within a radius of 1000 meters.

Most have no ground public transport stops within a radius of 1000 meters, some have up to 3.

Geographic characteristics of the cluster : It’s far from the nearest MPN of any type, and it’s just as far from the nearest railway station as it is from the nearest ground public transport stop. The nearest retail outlet is not ka-network nearby. The distance to the nearest ka-network outlet is low - not far (up to 1 km).

· 14th cluster - small towns with the lowest level of trade activity

Cluster Profile : The lowest indicators of trading activity, with a minimum set of stores. Average income level of the population.

Main quantitative and qualitative characteristics of the cluster: It makes up 20% of the total number of retail outlets in the strata. This cluster includes 4,188 retail outlets. The population’s income is estimated at 24-26 thousand rubles, which is significantly lower than similar indicators for other strata, but higher than the indicators for the 12th and 13th clusters of this strata. Monthly sales amount to 21-38 thousand rubles.

Complete absence of MPN of any type within a radius of 1000 meters.

The number of non-ka-network outlets within a radius of 1000 meters is from 1 to 4, and there are no ka-network outlets within a radius of 1000 meters. Lack of railway stations within a radius of 1000 meters. There are no ground public transport stops within a radius of 1000 meters.

Geographic characteristics of the cluster : The nearest MPN of any type is far away, as is the nearest railway station and the nearest ground public transport stop. The distance to the nearest retail outlet is not a ka-network: half are up to 400m, the rest are far away. The distance to the nearest ka-network outlet is far.

· 15th cluster - economically active settlements with a population of less than 100 thousand people.

Cluster Profile : The only cluster where there are signs of economic activity in the stratum. The highest rates of trading activity.

Main quantitative and qualitative characteristics of the cluster: It makes up 12% of the total number of retail outlets in the stratum. This cluster includes 2,585 retail outlets. The income of the population is 25-28 thousand rubles, which is significantly lower than similar indicators of other strata, but higher than the indicators of other clusters of this stratum. Monthly sales are 24-52 thousand rubles, which is the highest figure among all strata.

There are 2-7 MPNs of any type within a radius of 1000 meters. The number of non-ka-network outlets within a radius of 1000 meters is from 14 to 28, ka-network outlets within a radius of 1000 meters from 1 to 4. Railway stations in no within a radius of 1000 meters. The number of ground public transport stops within a radius of 1000 meters is absent for the majority, some have up to 7.

Geographic characteristics of the cluster : It is close to the nearest MPN of any type, but far from the nearest railway station, as well as to the nearest ground public transport stop. The distance to the nearest non-ka-network outlet is low - they are nearby. The distance to the nearest ka-network outlet is up to 500m for half, and far for the rest.

3.7 Clustering of the assortment of retail outlets

Figure 38 Number of TTs with grouped assortment

By applying a two-stage cluster analysis method, the range of retail outlets was divided into 5 clusters. The silhouette measure is 0.2, which is the average quality of cluster separation. The dimensions of each of them can be seen in the figure below. The largest cluster is the first, it makes up almost 59% (17,622 retail outlets) of all clusters. The smallest cluster 5 is almost 2% - 452 retail outlets. Differences from clustering of retail outlets: Division of goods that are as dissimilar to each other as possible, and TTs were combined based on the principle of similarity between them.

17 Share of each cluster


Figure 39 Breadth of assortment in each cluster

· First cluster - This is the assortment group with the smallest selection. These are sweets or chocolate bars in small packages. Such a product is most likely presented at gas stations or in small tents. The five best-selling products in this cluster are: Babaevsky bitter chocolate 100 grams, Alenka chocolate 15 grams, Alenka chocolate 100 grams, “Good Company” confectionery bar with wafer crumbs 80 grams and “Good Company” chocolate bar with peanuts 80 grams.

· Second cluster - This group of goods with an average choice of assortment refers to stores in cities with a population of more than 250 thousand people. The five best-selling products in this cluster are: “Good Company” confectionery bar with wafer crumbs 80 grams, Alenka chocolate 20 grams, Alenka chocolate with lots of milk 100 grams, “Good Company” chocolate bar with peanuts 80 grams and Alenka milk chocolate with multi-colored dragees.

· Third cluster - This group offers a small selection of products. These are mainly chocolate products and waffle cakes. This category of goods includes stores in small towns or villages. The five best-selling products in this cluster: Alenka chocolate 100 grams, Alenka chocolate 15 grams, Alenka chocolate 20 grams, Moskvichka caramel and Babaevsky bitter chocolate 100 grams

· Fourth cluster - These are clusters with a large selection of assortments. This group of products belongs to large branded confectionery stores in large cities. The five best-selling products in this cluster are: Alenka chocolate 100 grams, Moskvichka caramel, Babaevsky bitter chocolate 100 grams, Korovka wafers with baked milk flavor and Romashka candies.

· Fifth cluster - These are the clusters with the largest selection of assortments. This group of products belongs to large branded confectionery stores in satellite cities. The five best-selling products in this cluster are: Bird's milk candies, Moskvichka caramel, Alenka chocolate 100 grams, Babaevsky bitter 100 grams and Korovka wafers with baked milk flavor.

We can conclude that the most popular product is Alenka chocolate. It is this product that is found among the leaders in each cluster.

Conclusion to the third chapter

Research conducted using cluster analysis helped divide retail outlets into strata by location, then each stratum was divided into clusters. As a result, such cluster analysis helped reduce homogeneity by 1.77. The relationships between socio-demographic indicators (gender, age, income) and consumer behavior were analyzed and identified. Clustering of the assortment of retail outlets was also carried out, which made it possible to identify that the cluster with the largest number of outlets has the smallest assortment.

Conclusion

Big data is not just another hype in the IT market, it is a systematic, high-quality transition to the creation of value chains based on knowledge. In terms of its effect, it can be compared with the advent of affordable computer technology at the end of the last century. While short-sighted conservatives will use deeply outdated approaches, enterprises that are already using Big Data technologies will in the future find themselves in leading positions and gain competitive advantages in the market. There is no doubt that all major organizations will implement this technology in the coming years, as it is both the present and the future.

This thesis represents a scientific, systematic approach to choosing the location of retail outlets, and the methods of obtaining and analyzing information to obtain the final result are very low-cost, allowing such a procedure to be carried out even by individual entrepreneurs with a small turnover of funds.

Given the growing pace of information accumulation, there is an urgent need for data analysis technologies, which, in this regard, are also rapidly developing. The development of these technologies in recent years has made it possible to move from segmenting customers into groups with similar preferences to building models in real time, based, among other things, on their Internet requests and visits to certain pages. It becomes possible to target specific offers and advertisements based on an analysis of consumer interests, making these offers much more targeted. It is also possible to adjust and reconfigure the model in real time.

Cluster analysis can truly be called the most convenient and optimal tool for identifying market segments. The use of these methods has become especially important in the age of high technology, in which it is so important to speed up labor-intensive and time-consuming processes with the help of technology. The variables used as the basis for clustering will be correctly selected based on the experience of previous studies, theoretical premises, various tested hypotheses, and also based on the wishes of the researcher. In addition, it is recommended to take an appropriate similarity measure. A distinctive feature of hierarchical clustering is the development of a hierarchical structure. The most common and effective dispersion method is the Bard method. Non-hierarchical clustering methods are often called k-means methods. The choice of clustering method and the choice of distance measure are interrelated. In hierarchical clustering, an important criterion for deciding on the number of clusters is the distances at which clusters merge. The sizes of clusters should be such that it makes sense to preserve this cluster, and not merge it with others. The reliability and validity of clustering solutions are assessed in different ways.

Research conducted using cluster analysis helped divide retail outlets into strata by location, then each stratum was divided into clusters. As a result, such cluster analysis helped reduce homogeneity by 1.77. The relationships between socio-demographic indicators (gender, age, income) and consumer behavior were analyzed and identified. Clustering of the assortment of retail outlets was also carried out, which made it possible to identify that the cluster with the largest number of outlets has the smallest assortment.

Bibliography

1. StatSoft - Electronic textbook on statistics

2. Mandel I.D. Cluster analysis., 1988

N. Paklin. "Data Clustering: A Scalable CLOPE Algorithm."

Olenderfer M. S., Blashfield R. K. Cluster analysis / Factor, discriminant and cluster analysis: trans. from English; Under. ed. I. S. Enyukova. - M.: Finance and Statistics, 1989-215 p.

Daniel Fasulo "An Analysis of Recent Work on Clustering Algorithms."

Durand B., Odell P. Cluster analysis. M.: Statistics, 1977

Zhamby M. Hierarchical cluster analysis and correspondence, 1988

Khaidukov D. S. Application of cluster analysis in public administration // Philosophy of mathematics: current problems. - M.: MAKS Press, 2009. - 287 p.

Classification and cluster. Ed. J. Van Ryzina. M.: Mir, 1980.

Tryon R.S. Cluster analysis - London:, 1939. - 139 p.

Berikov V.S., Lbov G.S. Modern trends in cluster analysis 2008. - 67 p.

Vyatchenin D. A. Fuzzy methods of automatic classification. - Minsk: Technoprint, 2004. - 320 p.

I. A. Chubukova Data Mining. Tutorial. - M.: Internet University of Information Technologies;

N. Paklin. "Clustering of categorical data: a scalable CLOPE algorithm."

16. Sudipto Guha, Rajeev Rastogi, Kyuseok Shim “CURE: an efficient clustering algorithm for large databases.” Electronic edition.

17. Tian Zhang, Raghu Ramakrishnan, Miron Livny “BIRCH: an efficient data clustering method for very large databases.”

N. Paklin “Clustering algorithms for the Data Mining service.”

Jan Janson "Modeling".

20. I. A. Chubukova Data Mining. Textbook., 2006.

. “Accessible Data Analytics”, Anil Maheshwari

Kenneth Kekier "Big Data: The Revolution That Will Change How We Live, Work, and Think"

Katie O'Neil and Rachel Schutt “Data Science”

There are two main types of cluster analysis in statistics (both found in SPSS): hierarchical and k-means. In the first case, an automated statistical procedure independently determines the optimal number of clusters and a number of other parameters necessary for cluster

analysis. The second type of analysis has significant limitations in practical applicability - for it it is necessary to independently determine the exact number of clusters to be allocated, and the initial values ​​of the centers of each cluster (centroids), and some other statistics. When analyzing the k-means method, these problems are solved by first conducting a hierarchical cluster analysis and then, based on its results, calculating a cluster model using the k-means method, which in most cases not only does not simplify, but, on the contrary, complicates the work of the researcher (especially the untrained).

In general, we can say that due to the fact that hierarchical cluster analysis is very demanding on computer hardware resources, cluster analysis using the k-means method was introduced in SPSS for processing very large data sets consisting of many thousands of observations (respondents), in conditions insufficient power of computer equipment1. Sample sizes used in marketing research in most cases do not exceed four thousand respondents. The practice of marketing research shows that it is the first type of cluster analysis - hierarchical - that is recommended for use in all cases as the most relevant, universal and accurate. However, it must be emphasized that when conducting cluster analysis, the selection of relevant variables is important. This remark is very important, since the inclusion of several or even one irrelevant variable in the analysis can lead to failure of the entire statistical procedure.

We will describe the methodology for conducting cluster analysis using the following example from the practice of marketing research.

Initial data:

During the study, 745 air passengers flying on one of 22 Russian and foreign airlines were surveyed. Air passengers were asked to rate on a five-point scale - from 1 (very bad) to 5 (excellent) - seven parameters of the work of airline ground staff during the check-in process of passengers for a flight: politeness, professionalism, efficiency, willingness to help, queue management, appearance, work staff in general.

Required:

To segment the airlines under study according to the level of quality of work of ground staff perceived by air passengers.

So, we have a data file that consists of seven interval variables indicating ratings of the quality of work of ground staff of various airlines (ql3-ql9), presented on a single five-point scale. The data file contains a single-variable variable q4 indicating the airlines selected by respondents (22 names in total). We will conduct a cluster analysis and determine into which target groups these airlines can be divided.

Hierarchical cluster analysis is carried out in two stages. The result of the first stage is the number of clusters (target segments) into which the sample of respondents under study should be divided. The cluster analysis procedure as such is not

can independently determine the optimal number of clusters. She can only suggest the required number. Since the task of determining the optimal number of segments is key, it is usually solved at a separate stage of the analysis. At the second stage, the actual clustering of observations is carried out according to the number of clusters that was determined during the first stage of analysis. Let us now consider these steps of cluster analysis in order.

The cluster analysis procedure is launched using the Analyze > Classify > Hierarchical Cluster menu. In the dialog box that opens, from the left list of all variables available in the data file, select the variables that are the segmentation criteria. In our case, there are seven of them, and they indicate estimates of the operating parameters of ground personnel ql3-ql9 (Fig. 5.44). In principle, specifying a set of segmentation criteria will be quite sufficient to perform the first stage of cluster analysis.

Rice. 5.44.

By default, in addition to the table with the results of cluster formation, on the basis of which we will determine their optimal number, SPSS also displays a special inverted icicle histogram, which, according to the creators of the program, helps determine the optimal number of clusters; Charts are displayed using the Plots button (Fig. 5.45). However, if we leave this option set, we will spend a lot of time processing even a relatively small data file. In addition to icicle, you can select the faster Dendogram bar chart in the Plots window. It consists of horizontal bars reflecting the process of cluster formation. Theoretically, with a small (up to 50-100) number of respondents, this diagram really helps to choose the optimal solution regarding the required number of clusters. However, in almost all examples from marketing research, the sample size exceeds this value. The dendogram becomes completely useless, since even with a relatively small number of observations it represents a very long sequence of line numbers in the original data file, connected by horizontal and vertical lines. Most SPSS textbooks contain examples of cluster analysis using precisely such artificial, small samples. In this manual, we show how to work most effectively with SPSS in a practical setting and using real-life marketing research as an example.

Rice. 5.45.

As we have established, neither Icicle nor Dendogram are suitable for practical purposes. Therefore, it is recommended that you do not display plots in the main Hierarchical Cluster Analysis dialog box by deselecting the default Plots option in the Display area, as shown in Figure 1. 5.44. Now you are ready to perform the first stage of cluster analysis. Start the procedure by clicking on the OK button.

After a while, the results will appear in the SPSS Viewer window. As mentioned above, the only significant result for us from the first stage of analysis will be the Average Linkage (Between Groups) table shown in Fig. 5.46. Based on this table, we must determine the optimal number of clusters. It should be noted that there is no single universal method for determining the optimal number of clusters. In each specific case, the researcher must determine this number himself.

Based on existing experience, the author proposes the following scheme for this process. First of all, let's try to apply the most common standard method for determining the number of clusters. Using the Average Linkage (Between Groups) table, you should determine at what step of the cluster formation process (Stage column) the first relatively large jump in the agglomeration coefficient occurs (Coefficients column). This jump means that before it, observations located at fairly small distances from each other (in our case, respondents with similar levels of assessments on the analyzed parameters) were combined into clusters, and starting from this stage, more distant observations are combined.

In our case, the coefficients smoothly increase from 0 to 7.452, that is, the difference between the coefficients at steps from the first to 728 was small (for example, between 728 and 727 steps - 0.534). Starting from step 729, the first significant jump in the coefficient occurs: from 7.452 to 10.364 (by 2.912). The step at which the first jump in the coefficient occurs is 729. Now, to determine the optimal number of clusters, you need to subtract the resulting value from the total number of observations (sample size). The total sample size in our case is 745 people; therefore, the optimal number of clusters is 745-729 = 16.


Rice. 5.46.

We received a fairly large number of clusters, which will be difficult to interpret in the future. Therefore, now you should examine the resulting clusters and determine which of them are significant and which you should try to reduce. This problem is solved at the second stage of cluster analysis.

Open the main dialog box of the cluster analysis procedure (menu Analyze > Classify > Hierarchical Cluster). In the field for the analyzed variables, we already have seven parameters. Click the Save button. The dialog box that opens (Fig. 5.47) allows you to create a new variable in the source data file that distributes respondents into target groups. Select the Single Solution option and indicate in the corresponding field the required number of clusters - 16 (determined at the first stage of cluster analysis). By clicking on the Continue button, return to the main dialog box, in which click on the OK button to start the cluster analysis procedure.

Before continuing with the description of the cluster analysis process, it is necessary to provide a brief description of other parameters. Among them there are both useful features and actually unnecessary ones (from the point of view of practical marketing research). For example, the main Hierarchical Cluster Analysis dialog box contains a Label Cases by field in which you can optionally place a text variable identifying the respondents. In our case, the variable q4, encoding the airlines selected by respondents, can serve for these purposes. In practice, it is difficult to come up with a rational explanation for using the Label Cases by field, so you can safely always leave it empty.

Rice. 5.47.

Infrequently, when conducting cluster analysis, the Statistics dialog box is used, called up by the button of the same name in the main dialog box. It allows you to organize the output of the Cluster Membership table in the SPSS Viewer window, in which each respondent in the source data file is associated with a cluster number. With a sufficiently large number of respondents (in almost all examples of marketing research), this table becomes completely useless, since it represents a long sequence of pairs of “respondent number/cluster number” values, which in this form cannot be interpreted. Technically, the goal of cluster analysis is always to create an additional variable in the data file that reflects the division of respondents into target groups (by clicking the Save button in the main cluster analysis dialog box). This variable, together with the respondent numbers, is the Cluster Membership table. The only practically useful option in the Statistics window is the display of the Average Linkage (Between Groups) table, but it is already installed by default. Therefore, using the Statistics button and displaying a separate Cluster Membership table in the SPSS Viewer window is not practical.

The Plots button has already been mentioned above: it should be deactivated by deselecting the Plots option in the main cluster analysis dialog box.

In addition to these rarely used features of the cluster analysis procedure, SPSS offers some very useful parameters. Among them, first of all, is the Save button, which allows you to create a new variable in the original data file that distributes respondents into clusters. Also in the main dialog box there is an area for selecting a clustering object: respondents or variables. This possibility was discussed above in section 5.4. In the first case, cluster analysis is used mainly to segment respondents according to some criteria; in the second, the purpose of cluster analysis is similar to factor analysis: classification (reduction of the number) of variables.

As can be seen from Fig. 5.44, the only possibility of cluster analysis not considered is the button for selecting a method for carrying out the statistical procedure Method. Experiments with this Parameter allow you to achieve greater accuracy in determining the optimal number of clusters. The general view of this dialog box with default settings is shown in Fig. 5.48.

Rice. 5.48.

The first thing that is set in this window is the method for forming clusters (that is, combining observations). Among all the possible options for statistical methods offered by SPSS, you should choose either the default Between-groups linkage method or the Ward procedure (Ward's method). The first method is used more often due to its versatility and the relative simplicity of the statistical procedure on which it is based. Using this method, the distance between clusters is calculated as the average of the distances between all possible pairs of observations, with each iteration involving one observation from one cluster and the second from another.The information needed to calculate the distance between observations is found based on all theoretically possible pairs of observations. The Ward method is more difficult to understand and is used less frequently. It consists of many stages and is based on averaging the values ​​of all variables for each observation and then summing the squared distances from the calculated means to each observation. For solving practical problems of marketing research, we recommend Always use the default Between-groups linkage method.

After selecting a statistical clustering procedure, you must select a method for calculating distances between observations (the Measure area in the Method dialog box). There are different methods for determining distances for the three types of variables involved in cluster analysis (segmentation criteria). These variables can have an interval (Interval), nominal (Counts) or dichotomous (Binary) scale. A dichotomous scale (Binary) implies only variables reflecting the occurrence/non-occurrence of an event (bought/not bought, yes/no, etc.). Other types of dichotomous variables (eg, male/female) should be treated and analyzed as nominal (Counts).

The most commonly used method for determining distances for interval variables is the Squared Euclidean Distance, which is the default. It is this method that has proven itself in marketing research as the most accurate and universal. However, for dichotomous variables where observations are represented by only two values ​​(for example, 0 and 1), this method is not suitable. The fact is that it only takes into account interactions between observations of the type: X = 1,Y = 0 and X = 0, Y=l (where X and Y are variables) and does not take into account other types of interactions. The most comprehensive distance measure, taking into account all important types of interactions between two dichotomous variables, is the Lambda method. We recommend using this particular method due to its versatility. However, there are other methods, such as Shape, Hamann or Anderbergs's D.

When specifying the method for determining distances for dichotomous variables, in the corresponding field it is necessary to indicate the specific values ​​that the dichotomous variables under study can take: in the Present field - the answer encoding Yes, and in the Absent field - No. The names of the fields present and absent are associated with the fact that in the group of Binary methods it is supposed to use only dichotomous variables reflecting the occurrence/non-occurrence of an event. For the two types of variables Interval and Binary, there are several methods for determining the distance. For variables with a nominal scale type, SPSS offers only two methods: (Chi-square measure) and (Phi-square measure). We recommend using the first method as the most common.

In the Method dialog box there is a Transform Values ​​area, which contains a Standardize field. This field is used when cluster analysis involves variables with different types of scales (for example, interval and nominal). In order to use these variables in cluster analysis, standardization should be carried out, bringing them to a single type of scale - interval. The most common method of standardizing variables is 2-standardization (Zscores): all variables are reduced to a single range of values ​​from -3 to +3 and after transformation are interval.

Since all the best methods (clustering and distance determination) are installed by default, it is advisable to use the Method dialog box only to indicate the type of variables to be analyzed, as well as to indicate the need to 2-standardize the variables.

So, we have described all the main features provided by SPSS for conducting cluster analysis. Let's return to the description of cluster analysis carried out to segment airlines. Let us recall that we settled on a sixteen-cluster solution and created a new variable clul6_l in the original data file, which distributes all analyzed airlines into clusters.

To establish how correctly we have determined the optimal number of clusters, let’s build a linear distribution of the clul6_l variable (menu Analyze > Descriptive Statistics > Frequencies). As can be seen in Fig. 5.49, in clusters with numbers 5-16, the number of respondents is from 1 to 7. Along with the above-described universal method for determining the optimal number of clusters (based on the difference between the total number of respondents and the first jump in the agglomeration coefficient), there is also an additional recommendation: the size of clusters should be statistically meaningful and practically acceptable. With our sample size, this critical value can be set at least at the level of 10. We see that only clusters numbered 1-4 fall under this condition. Therefore, it is now necessary to recalculate the cluster analysis procedure with the output of a four-cluster solution (a new variable du4_l will be created).


Rice. 5.49.

Having constructed a linear distribution for the newly created variable du4_l, we will see that only in two clusters (1 and 2) the number of respondents is practically significant. We need to rebuild the cluster model again - now for a two-cluster solution. After this, we will construct the distribution over the variable du2_l (Fig. 5.50). As you can see from the table, the two-cluster solution has a statistically and practically significant number of respondents in each of the two formed clusters: in cluster 1 - 695 respondents; in cluster 2 - 40. So, we determined the optimal number of clusters for our task and carried out the actual segmentation of respondents according to seven selected criteria. Now we can consider the main goal of our task achieved and proceed to the final stage of cluster analysis - interpretation of the resulting target groups (segments).


Rice. 5.50.

The resulting solution is somewhat different from those you may have seen in SPSS tutorials. Even the most practically oriented textbooks provide artificial examples where clustering results in ideal target groups of respondents. In some cases (5) the authors even directly indicate the artificial origin of the examples. In this manual, we will use a real example from practical marketing research, which does not have ideal proportions, to illustrate the action of cluster analysis. This will allow us to show the most common difficulties in conducting cluster analysis, as well as the best methods for eliminating them.

Before we begin to interpret the resulting clusters, let's summarize. We have the following scheme for determining the optimal number of clusters.

¦ At stage 1, we determine the number of clusters based on a mathematical method based on the agglomeration coefficient.

¦ At stage 2, we cluster respondents according to the resulting number of clusters and then build a linear distribution according to the formed new variable (clul6_l). Here you should also determine how many clusters consist of a statistically significant number of respondents. In general, it is recommended to set the minimum significant number of clusters at a level of at least 10 respondents.

¦ If all clusters satisfy this criterion, we move on to the final stage of cluster analysis: interpretation of clusters. If there are clusters with an insignificant number of observations that make them up, we establish how many clusters consist of a significant number of respondents.

¦ We recalculate the cluster analysis procedure by specifying in the Save dialog box the number of clusters consisting of a significant number of observations.

¦ We construct a linear distribution for a new variable.

This sequence of actions is repeated until a solution is found in which all clusters consist of a statistically significant number of respondents. After this, you can move on to the final stage of cluster analysis - interpretation of clusters.

It should be especially noted that the criterion of practical and statistical significance of the number of clusters is not the only criterion by which the optimal number of clusters can be determined. The researcher can independently, based on his experience, propose the number of clusters (the significance condition must be satisfied). Another option is a fairly common situation when, for research purposes, a condition is set in advance to segment respondents according to a given number of target groups. In this case, you simply need to conduct a hierarchical cluster analysis once, maintaining the required number of clusters, and then try to interpret what you get.

In order to describe the resulting target segments, one should use the procedure for comparing the average values ​​of the studied variables (cluster centroids). We compare the average values ​​of the seven segmentation criteria considered in each of the two resulting clusters.

The procedure for comparing average values ​​is called up using the menu Analyze > Compare Means > Means. In the dialog box that opens (Fig. 5.51), from the left list select the seven variables selected as segmentation criteria (ql3-ql9) and transfer them to the field for dependent variables Dependent List. Then move the variable сШ2_1, reflecting the division of respondents into clusters in the final (two-cluster) solution of the problem, from the left list to the field for independent variables Independent List. After that, click on the Options button.

Rice. 5.51.

The Options dialog box will open, select the necessary statistics for comparing clusters (Fig. 5.52). To do this, in the Cell Statistics field, leave only the display of average values ​​Mean, removing other default statistics from it. Close the Options dialog box by clicking the Continue button. Finally, from the main Means dialog box, run the mean comparison procedure (OK button).

Rice. 5.52.

The results of the statistical procedure for comparing average values ​​will appear in the SPSS Viewer window that opens. We are interested in the Report table (Fig. 5.53). From it you can see on what basis SPSS divided the respondents into two clusters. In our case, such a criterion is the level of assessments for the analyzed parameters. Cluster 1 consists of respondents for whom the average scores for all segmentation criteria are at a relatively high level (4.40 points and above). Cluster 2 includes respondents who rated the segmentation criteria under consideration quite low (3.35 points and below). Thus, we can conclude that 93.3% of respondents who formed cluster 1 rated the analyzed airlines as generally good in all respects; 5.4% is quite low; 1.3% found it difficult to answer (see Fig. 5.50). From Fig. 5.53, one can also draw a conclusion about what level of assessments for each of the parameters under consideration separately is high and what is low (and this conclusion will be made by the respondents, which makes it possible to achieve high classification accuracy). From the Report table you can see that for the Queue Regulation variable the average score of 4.40 is considered high, and for the Appearance parameter it is 4.72.


Rice. 5.53.

It may turn out that in a similar case, a high score for parameter X is 4.5, but only 3.9 for parameter Y. This will not be a clustering error, but, on the contrary, will allow us to draw an important conclusion regarding the importance of the parameters under consideration for respondents. Thus, for parameter Y, already 3.9 points is a good score, while respondents have more stringent requirements for parameter X.

We identified two significant clusters that differ in the level of average scores according to the segmentation criteria. Now you can assign labels to the resulting clusters: for 1 -- Airlines that meet the requirements of respondents (according to seven analyzed criteria); for 2 -- Airlines that do not meet the respondents' requirements. Now you can see which specific airlines (coded in variable q4) meet the respondents’ requirements and which do not according to the segmentation criteria. To do this, it is necessary to construct a cross-sectional distribution of the variable q4 (analyzed airlines) depending on the clustering variable clu2_l. The results of such a cross-sectional analysis are presented in Fig. 5.54.

From this table, the following conclusions can be drawn regarding the membership of the studied airlines in the selected target segments.


Rice. 5.54.

1. Airlines that fully meet the requirements of all clients in terms of the work of ground staff (included only in the first cluster):

¦ Vnukovo Airlines;

¦ American Airlines;

¦ Delta Airlines;

¦ Austrian Airlines;

¦ British Airways;

¦ Korean Airlines;

¦ Japan Airlines.

2. Airlines that meet the requirements of the majority of their customers regarding the work of ground staff (the majority of respondents flying with these airlines are satisfied with the work of ground staff):

¦ Transaero.

3. Airlines that do not meet the requirements of the majority of their customers regarding the work of ground staff (the majority of respondents flying with these airlines are not satisfied with the work of ground staff):

¦ Domodedovo Airlines;

¦ Pulkovo;

¦ Siberia;

¦ Ural Airlines;

¦ Samara Airlines;

Thus, three target segments of airlines were obtained according to the level of average ratings, characterized by varying degrees of satisfaction of respondents with the work of ground staff:

  • 1. the most attractive airlines for passengers in terms of the level of work of ground staff (14);
  • 2. rather attractive airlines (1);
  • 3. rather unattractive airlines (7).

We have successfully completed all stages of cluster analysis and segmented airlines according to seven selected criteria.

Now we give a description of the cluster analysis technique paired with factor analysis. We use the problem condition from section 5.2.1 (factor analysis). As already mentioned, in segmentation problems with a large number of variables, it is advisable to precede cluster analysis with factor analysis. This is done to reduce the number of segmentation criteria to the most significant ones. In our case, we have 24 variables in the original data file. As a result of factor analysis, we were able to reduce their number to 5. Now this number of factors can be effectively used for cluster analysis, and the factors themselves can be used as segmentation criteria.

If we are faced with the task of segmenting respondents according to their assessment of various aspects of the current competitive position of airline X, we can conduct a hierarchical cluster analysis according to the selected five criteria (variables nfacl_l-nfac5_l). In our case, the variables were assessed on different scales. For example, a rating of 1 for the statement I would not like the airline to change and the same rating for the statement Changes in the airline would be a positive thing and are diametrically opposed in meaning. In the first case, a score of 1 (strongly disagree) means that the respondent welcomes changes in the airline; in the second case, a score of 1 indicates that the respondent rejects changes in the airline. When interpreting clusters, we will inevitably have difficulties, since such variables with opposite meanings can

fall into the same factor. Thus, for segmentation purposes, it is recommended to first align the scales of the variables under study and then recalculate the factor model. And then carry out cluster analysis on the factor variables obtained as a result of factor analysis. We will not again describe in detail the procedures for factor and cluster analysis (this was done above in the relevant sections). Let us only note that with this methodology, we ended up with three target groups of air passengers, differing in the level of assessment of the selected factors (that is, groups of variables): lowest, average and highest.

A very useful application of cluster analysis is the division of frequency tables into groups. Let's assume we have a linear distribution of answers to the question: What brands of antivirus are installed in your organization? To form conclusions based on this distribution, it is necessary to divide antivirus brands into several groups (usually 2-3). To divide all brands into three groups (most popular brands, average popularity and unpopular brands), it is best to use cluster analysis, although, as a rule, researchers divide elements of frequency tables by eye, based on subjective considerations. In contrast to this approach, cluster analysis makes it possible to scientifically substantiate the grouping performed. To do this, enter the values ​​of each parameter into SPSS (it is useful to express these values ​​as percentages) and then perform cluster analysis on this data. By saving the cluster solution for the required number of groups (in our case 3) as a new variable, we obtain a statistically valid grouping.

We will devote the final part of this section to describing the use of cluster analysis to classify variables and comparing its results with the results of the factor analysis carried out in section 5.2.1. To do this, we will again use the condition of the problem about assessing the current position of airline X in the air transportation market. The methodology for conducting cluster analysis almost completely repeats that described above (when respondents were segmented).

So, in the original data file we have 24 variables describing respondents' attitudes towards various aspects of the current competitive position of airline X. Open the main Hierarchical Cluster Analysis dialog box and place the 24 variables (ql-q24) in the Variable(s) field, Fig. 5.55. In the Cluster area, specify that you are classifying variables (check the Variables option). You will see that the Save button is disabled - unlike factor analysis, in cluster analysis you cannot save factor ratings for all respondents. Avoid displaying charts by deactivating the Plots parameter. At the first stage, you do not need any other parameters, so simply click on the OK button to start the cluster analysis procedure.

The Agglomeration Schedule table appeared in the SPSS Viewer window, from which we determined the optimal number of clusters using the method described above (Fig. 5.56). The first jump in the agglomeration coefficient is observed at step 20 (from 18834.000 to 21980.967). Based on the total number of analyzed variables equal to 24, we can calculate the optimal number of clusters: 24 - 20 = 4.

Rice. 5.55.


Rice. 5.56.

When classifying variables, a cluster consisting of just one variable is practically and statistically significant. Therefore, since we have obtained an acceptable number of clusters using a mathematical method, no further checks are required. Instead, open the main cluster analysis dialog again (all data used in the previous step is preserved) and click the Statistics button to display the classification table. You will see a dialog box of the same name, where you need to specify the number of clusters into which you want to divide 24 variables (Fig. 5.57). To do this, select the Single solution option and specify the required number of clusters in the corresponding field: 4. Now close the Statistics dialog box by clicking the Continue button and run the procedure from the main cluster analysis window.

As a result, the Cluster Membership table will appear in the SPSS Viewer window, distributing the analyzed variables into four clusters (Fig. 5.58).

Rice. 5.58.

Using this table, each variable under consideration can be classified into a specific cluster as follows.

Cluster 1

ql. Airline X has a reputation for excellent passenger service.

q2. Airline X can compete with the best airlines in the world.

q3. I believe that Airline X has a promising future in global aviation.

q5. I am proud to work for X Airline.

q9. We have a long way to go before we can claim to be a world class airline.

qlO. Airline X truly cares about its passengers.

ql3. I like the way Airline X is currently being presented visually to the general public (in terms of color scheme and branding).

ql4. Airline X is the face of Russia.

ql6. Airline X's service is consistent and recognizable throughout

ql8. Airline X needs to change in order to exploit its full potential.

ql9. I think airline X needs to present itself in a more modern way visually.

q20. Changes at airline X will be a positive development. q21. Airline X is an efficient airline.

q22. I would like to see airline X's image improve from the point of view of foreign passengers.

q23. Airline X is better than many people think it is.

q24. It is important that people all over the world know that we are a Russian airline.

Cluster 2

q4. I know what the development strategy of airline X will be in the future.

q6. Airline X has good communication between departments.

q7. Every employee at the airline works hard to ensure its success.

q8. Now airline X is improving rapidly.

qll. There is a high degree of job satisfaction among airline employees.

ql2. I believe that senior managers are working hard to make the airline successful.

Cluster 3

ql5. We look like yesterday compared to other airlines.

Cluster 4

ql7. I wouldn't want airline X to change.

Comparing the results of factor (section 5.2.1) and cluster analyzes, you will see that they differ significantly. Cluster analysis not only provides significantly less opportunity for clustering variables (for example, the inability to store group ratings) compared to factor analysis, but also produces much less clear results. In our case, if clusters 2, 3 and 4 are still amenable to logical interpretation1, then cluster 1 contains statements that are completely different in meaning. In this situation, you can either try to describe cluster 1 as is, or rebuild the statistical model with a different number of clusters. In the latter case, to find the optimal number of clusters that can be logically described, you can use the Range of solutions parameter in the Statistics dialog box (see Fig. 5.57), specifying the minimum and maximum number of clusters in the appropriate fields (in our case, 4 and 6, respectively). In this situation, SPSS will rebuild the Cluster Membership table for each number of clusters. The analyst’s task in this case is to try to select a classification model in which all clusters will be interpreted unambiguously. In order to demonstrate the capabilities of the cluster analysis procedure for clustering variables, we will not rebuild the cluster model, but will limit ourselves to what was said above.

It should be noted that, despite the apparent simplicity of cluster analysis compared to factor analysis, in almost all cases of marketing research, factor analysis turns out to be faster and more effective than cluster analysis. Therefore, to classify (reduce) variables, we strongly recommend using factor analysis and leaving the use of cluster analysis to classify respondents.

Classification analysis is perhaps one of the most complex statistical tools from the point of view of an untrained user. This is due to its very low prevalence in marketing companies. At the same time, this particular group of statistical methods is one of the most useful for practitioners in the field of marketing research.

Many of us have heard the phrase “cluster analysis,” but not everyone understands what it means. Besides, it sounds more than mysterious! In fact, this is just the name of a method of dividing a sample of data into categories of elements according to certain criteria. For example, cluster analysis allows you to divide people into groups with high, medium and low self-esteem. Simply put, a cluster is a type of objects that are similar in a certain way.

Cluster analysis: problems in use

Having decided to use this method in your research, you need to remember that the clusters identified during the study may be unstable. Therefore, as in the case of factor analysis, it is necessary to check the results on another group of objects or after a certain period of time to calculate the measurement error. Moreover, it is best to use cluster analysis on large samples selected by randomization or stratification, because this is the only way to draw a scientific conclusion using induction. He showed himself best in testing hypotheses, rather than creating them from scratch.

Hierarchical cluster analysis

If you need to classify random elements quickly, you can start by treating each of them initially as a separate cluster. This is the essence of one of the easiest types of cluster analysis to understand. Using it, the researcher at the second stage forms pairs of elements that are similar in the required attribute, and then connects them together the required number of times. Clusters located at a minimum distance from each other are determined using an integrative procedure. This is repeated until one of the following criteria is met:

  • obtaining a pre-planned number of clusters;
  • each of the clusters contains the required number of elements;
  • Each group has the necessary ratio of heterogeneity and homogeneity within it.

In order to correctly calculate the distance between clusters, the following techniques are often used:

  • single and full connection;
  • King's average relationship;
  • centroid method;
  • reception of group averages.

To evaluate the clustering results, the following criteria are used:

  • clarity index;
  • partition factor;
  • ordinary, normalized and modified entropy;
  • the second and third Rubens functionals.

Cluster analysis methods

Most often, when analyzing a sample of objects, the minimum distance method is used. It consists in combining elements with a similarity coefficient that is greater than a threshold value into a cluster. When using the local distance method, two clusters are distinguished: the distance between the points of the first of them is maximum, and the second is minimum. The centroid method of clustering involves calculating the distances between the average values ​​of indicators in groups. And the Word method is most rationally used to group clusters that are similar in terms of the studied parameter.

Cluster analysis

Most researchers are inclined to believe that for the first time the term “cluster analysis” (English) cluster- bunch, clot, bunch) was proposed by the mathematician R. Trion. Subsequently, a number of terms arose that are currently considered synonymous with the term “cluster analysis”: automatic classification; botryology.

Cluster analysis is a multivariate statistical procedure that collects data containing information about a sample of objects and then arranges the objects into relatively homogeneous groups (clusters) (Q-clustering, or Q-technique, cluster analysis itself). A cluster is a group of elements characterized by a common property; the main goal of cluster analysis is to find groups of similar objects in a sample. The range of applications of cluster analysis is very wide: it is used in archaeology, medicine, psychology, chemistry, biology, public administration, philology, anthropology, marketing, sociology and other disciplines. However, the universality of application has led to the emergence of a large number of incompatible terms, methods and approaches, making it difficult to unambiguously use and consistent interpretation of cluster analysis. Orlov A.I. suggests distinguishing as follows:

Objectives and conditions

Cluster analysis performs the following main goals:

  • Development of a typology or classification.
  • An exploration of useful conceptual schemes for grouping objects.
  • Generating hypotheses based on data exploration.
  • Hypothesis testing or research to determine whether the types (groups) identified in one way or another are actually present in the available data.

Regardless of the subject of study, the use of cluster analysis involves next steps:

  • Selecting a sample for clustering. The implication is that it makes sense to cluster only quantitative data.
  • Determining the set of variables by which objects in the sample will be assessed, that is, the feature space.
  • Calculation of the values ​​of a particular measure of similarity (or difference) between objects.
  • Using the cluster analysis method to create groups of similar objects.
  • Checking the reliability of the cluster solution results.

Cluster analysis presents the following data requirements:

  1. indicators should not correlate with each other;
  2. indicators should not contradict measurement theory;
  3. the distribution of indicators should be close to normal;
  4. indicators must meet the requirement of “stability”, which means the absence of influence on their values ​​by random factors;
  5. the sample must be homogeneous and not contain “outliers”.

You can find a description of two fundamental requirements for data - homogeneity and completeness:

Homogeneity requires that all entities represented in the table be of the same nature. The completeness requirement is that the sets I And J presented a complete inventory of the manifestations of the phenomenon under consideration. If we consider a table in which I- the totality, and J- a set of variables describing this population, it must be a representative sample from the population being studied, and the system of characteristics J should give a satisfactory vector representation of individuals i from the researcher's point of view.

If cluster analysis is preceded by factor analysis, then the sample does not need to be “repaired” - the stated requirements are fulfilled automatically by the factor modeling procedure itself (there is another advantage - z-standardization without negative consequences for the sample; if it is carried out directly for cluster analysis, it can lead to entails a decrease in the clarity of the division of groups). Otherwise, the sample needs to be adjusted.

Typology of clustering problems

Input types

In modern science, several algorithms for processing input data are used. Analysis by comparing objects based on characteristics (most common in biological sciences) is called Q-type of analysis, and in the case of comparing features, based on objects - R-type of analysis. There are attempts to use hybrid types of analysis (for example, RQ-analysis), but this methodology has not yet been properly developed.

Goals of Clustering

  • Understanding data by identifying cluster structure. Dividing the sample into groups of similar objects makes it possible to simplify further data processing and decision-making by applying a different method of analysis to each cluster (the “divide and conquer” strategy).
  • Data compression. If the original sample is excessively large, then you can reduce it, leaving one most typical representative from each cluster.
  • Detection of novelty novelty detection). Atypical objects are identified that cannot be attached to any of the clusters.

In the first case, they try to make the number of clusters smaller. In the second case, it is more important to ensure a high degree of similarity of objects within each cluster, and there can be any number of clusters. In the third case, the most interesting are individual objects that do not fit into any of the clusters.

In all these cases, hierarchical clustering can be used, when large clusters are divided into smaller ones, which in turn are divided into even smaller ones, etc. Such problems are called taxonomy problems. The taxonomy results in a tree-like hierarchical structure. In this case, each object is characterized by listing all the clusters to which it belongs, usually from large to small.

Clustering methods

There is no generally accepted classification of clustering methods, but a solid attempt by V. S. Berikov and G. S. Lbov can be noted. If we generalize the various classifications of clustering methods, we can distinguish a number of groups (some methods can be classified into several groups at once and therefore it is proposed to consider this typification as some approximation to the real classification of clustering methods):

  1. Probabilistic approach. It is assumed that each object under consideration belongs to one of k classes. Some authors (for example, A.I. Orlov) believe that this group does not relate to clustering at all and oppose it under the name “discrimination,” that is, the choice of assigning objects to one of the known groups (training samples).
  2. Approaches based on artificial intelligence systems. A very conditional group, since there are a lot of AI methods and methodologically they are very different.
  3. Logical approach. The dendrogram is constructed using a decision tree.
  4. Graph Theoretic Approach.
    • Graph Clustering Algorithms
  5. Hierarchical approach. The presence of nested groups (clusters of different orders) is assumed. Algorithms, in turn, are divided into agglomerative (unifying) and divisional (separating). Based on the number of characteristics, monothetic and polythetic methods of classification are sometimes distinguished.
    • Hierarchical divisional clustering or taxonomy. Clustering problems are addressed in a quantitative taxonomy.
  6. Other methods. Not included in previous groups.
    • Statistical clustering algorithms
    • Ensemble of clusterizers
    • KRAB family algorithms
    • Algorithm based on sifting method
    • DBSCAN et al.

Approaches 4 and 5 are sometimes combined under the name of a structural or geometric approach, which has a more formalized concept of proximity. Despite the significant differences between the listed methods, they all rely on the original “ compactness hypothesis": in the space of objects, all close objects must belong to the same cluster, and all different objects, accordingly, must be in different clusters.

Formal formulation of the clustering problem

Let be a set of objects, and let be a set of numbers (names, labels) of clusters. The distance function between objects is specified. There is a finite training sample of objects. It is required to split the sample into disjoint subsets called clusters, so that each cluster consists of objects that are similar in metric, and the objects of different clusters are significantly different. In this case, each object is assigned a cluster number.

Clustering algorithm is a function that assigns a cluster number to any object. In some cases, the set is known in advance, but more often the task is to determine the optimal number of clusters, from the point of view of one or another quality criteria clustering.

Clustering (unsupervised learning) differs from classification (supervised learning) in that the labels of the original objects are not initially specified, and the set itself may even be unknown.

The solution to the clustering problem is fundamentally ambiguous, and there are several reasons for this (as a number of authors believe):

  • There is no clear best criterion for clustering quality. A number of heuristic criteria are known, as well as a number of algorithms that do not have a clearly defined criterion, but carry out fairly reasonable clustering “by construction”. All of them can give different results. Therefore, to determine the quality of clustering, a domain expert is required who can assess the meaningfulness of cluster selection.
  • the number of clusters is usually unknown in advance and is set in accordance with some subjective criterion. This is true only for discrimination methods, since in clustering methods, clusters are identified through a formalized approach based on proximity measures.
  • the result of clustering significantly depends on the metric, the choice of which, as a rule, is also subjective and determined by an expert. But it is worth noting that there are a number of recommendations for choosing proximity measures for various tasks.

Application

In biology

In biology, clustering has many applications in a wide variety of fields. For example, in bioinformatics it is used to analyze complex networks of interacting genes, sometimes consisting of hundreds or even thousands of elements. Cluster analysis makes it possible to identify subnetworks, bottlenecks, hubs and other hidden properties of the system being studied, which ultimately makes it possible to find out the contribution of each gene to the formation of the phenomenon under study.

In the field of ecology, it is widely used to identify spatially homogeneous groups of organisms, communities, etc. Less commonly, cluster analysis methods are used to study communities over time. The heterogeneity of the community structure leads to the emergence of non-trivial methods of cluster analysis (for example, the Chekanovsky method).

In general, it is worth noting that historically, measures of similarity rather than measures of difference (distance) are often used as measures of proximity in biology.

In sociology

When analyzing the results of sociological research, it is recommended to carry out the analysis using the methods of the hierarchical agglomerative family, namely the Ward method, in which the minimum dispersion within clusters is optimized, ultimately creating clusters of approximately equal sizes. Ward's method is most suitable for analyzing sociological data. A better measure of difference is the quadratic Euclidean distance, which helps increase the contrast of clusters. The main result of hierarchical cluster analysis is a dendrogram or “icicle diagram”. When interpreting it, researchers are faced with the same kind of problem as the interpretation of the results of factor analysis - the lack of unambiguous criteria for identifying clusters. It is recommended to use two main methods - visual analysis of the dendrogram and comparison of clustering results performed by different methods.

Visual analysis of the dendrogram involves “trimming” the tree at the optimal level of similarity of sample elements. It is advisable to “cut off the grape branch” (the terminology of M. S. Oldenderfer and R. K. Blashfield) at level 5 of the Rescaled Distance Cluster Combine scale, thus an 80% level of similarity will be achieved. If identifying clusters using this label is difficult (several small clusters merge into one large one), then you can select another label. This technique is proposed by Oldenderfer and Blashfield.

Now the question arises of the sustainability of the adopted cluster solution. In essence, checking the stability of clustering comes down to checking its reliability. There is a rule of thumb here - a stable typology is preserved when clustering methods change. The results of hierarchical cluster analysis can be verified by iterative cluster analysis using the k-means method. If the compared classifications of groups of respondents have a coincidence rate of more than 70% (more than 2/3 of matches), then a cluster decision is made.

It is impossible to check the adequacy of a solution without resorting to another type of analysis. At least in theoretical terms, this problem has not been solved. Oldenderfer and Blashfield's classic paper, Cluster Analysis, discusses in detail and ultimately rejects an additional five robustness testing methods:

In computer science

  • Clustering of search results - used for “intelligent” grouping of results when searching for files, websites, and other objects, providing the user with the ability to quickly navigate, select a obviously more relevant subset and exclude a obviously less relevant one - which can increase the usability of the interface compared to output in the form a simple list sorted by relevance.
    • Clusty is a clustering search engine from Vivísimo
    • Nigma - Russian search engine with automatic clustering of results
    • Quintura - visual clustering in the form of a keyword cloud
  • Image segmentation image segmentation) - Clustering can be used to split a digital image into separate areas for the purpose of edge detection. edge detection) or object recognition.
  • Data mining data mining)- Clustering in Data Mining becomes valuable when it acts as one of the stages of data analysis and construction of a complete analytical solution. It is often easier for an analyst to identify groups of similar objects, study their features and build a separate model for each group than to create one general model for all data. This technique is constantly used in marketing, identifying groups of clients, buyers, products and developing a separate strategy for each of them.

see also

Notes

Links

In Russian
  • www.MachineLearning.ru - professional wiki resource dedicated to machine learning and data mining
In English
  • COMPACT - Comparative Package for Clustering Assessment. A free Matlab package, 2006.
  • P. Berkhin, Survey of Clustering Data Mining Techniques, Accrue Software, 2002.
  • Jain, Murty and Flynn: Data Clustering: A Review,ACM Comp. Surv., 1999.
  • for another presentation of hierarchical, k-means and fuzzy c-means see this introduction to clustering. Also has an explanation on mixture of Gaussians.
  • David Dowe, Mixture Modeling page- other clustering and mixture model links.
  • a tutorial on clustering
  • The on-line textbook: Information Theory, Inference, and Learning Algorithms, by David J.C. MacKay includes chapters on k-means clustering, soft k-means clustering, and derivations including the E-M algorithm and the variational view of the E-M algorithm.
  • “The Self-Organized Gene”, tutorial explaining clustering through competitive learning and self-organizing maps.
  • kernlab - R package for kernel based machine learning (includes spectral clustering implementation)
  • Tutorial - Tutorial with introduction of Clustering Algorithms (k-means, fuzzy-c-means, hierarchical, mixture of gaussians) + some interactive demos (java applets)
  • Data Mining Software - Data mining software frequently utilizes clustering techniques.
  • Java Competitive Learning Application A suite of Unsupervised Neural Networks for clustering. Written in Java. Complete with all source code.
  • Machine Learning Software - Also contains much clustering software.