Network Structure Detection and Analysis of Shanghai Stock Market

Purpose: In order to investigate community structure of the component stocks of SSE (Shanghai Stock Exchange) 180-index, a stock correlation network is built to find the intracommunity and inter-community relationship. Design/methodology/approach: The stock correlation network is built taking the vertices as stocks and edges as correlation coefficients of logarithm returns of stock price. It is built as undirected weighted at first. GN algorithm is selected to detect community structure after transferring the network into un-weighted with different thresholds. Findings: The result of the network community structure analysis shows that the stock market has obvious industrial characteristics. Most of the stocks in the same industry or in the same supply chain are assigned to the same community. The correlation of stock prices' fluctuation in the internal community is closer than in different ones. The result of community structure detection also reflects correlations among different industries. Originality/value: Based on the analysis of the community structure in Shanghai stock market, the result reflects some industrial characteristics, which has reference value to relationship among industries or sub-sectors of listed companies.


Introduction
Complex networks are the abstract of complex systems, such as computer network, biology network, social network and transportation network.The complexity of network systems reflects in complexity of structure, complexity of vertices and all the interactions among complex factors.
In recent years, the research of complex network gradually expanded from mathematics, physics and biology to sociology and economics.Rubinov and Sporns (2010) discussed construction of brain networks from connectivity data and described the most commonly used network measures of structural and functional connectivity.Morales, Borondo, Losada and Benito (2014) built complex network to understand the collective reaction to individual actions on Twitter.Gao and Gu (2011) applied Social network analysis to study the strength relation, small groups and coreness to detect the problems in knowledge flow within groups with the help of quantitative description, measure knowledge network.Li, He, Zhuang and Shi (2011) studied the impact of stocks on Chinese inter-bank network stability from the perspective of complex networks.
In these researches, complex networks revealed some obvious statistical characteristics, including small-world properties (Milgram, 1967), scale-free property (Pool & Kochen, 1987) and so on.A network, that has large clustering coefficient and small average distance, has small-world property.Consequently the network is called small-world network.A network with power-law degree distribution is called scale-free network.The degree distribution is the basic topological property of a network, which represents the possibility that a randomly selected node has degree k or described as the possibility of a node whose degree is k in network.The degree distribution of a lot of networks can be described by power-law distribution, which is P(K)aK -r .The power-law distribution is also known as the scale-free distribution.
Another important characteristic of complex networks is community structure (Albert, Jeong & Barabasi, 1999).Communities can be defined as groups of nodes such that there is a higher density of edges within groups than between them (Boccaletti, Latora, Moreno, Chavez & Hwang, 2006).A lot of researches' results indicated that many real networks have community structures.Namely, the whole network consists of many communities, the connection between different communities is relatively sparse, but the connection in the same community is relatively dense.Community detection utilizes topological structure of graph to analyze its modularized community structure of complex network, which is meaningful for having an insight into module, function and evolution of the whole network as well as understanding principles of organization, topologies and dynamic characteristics of complex systems.
Detecting community structure in complex networks has become attractive in many researches.Hierarchical partitioning is a common kind of algorithms for community structure detection.According to edges are added or removed from the network, the algorithms are classified into agglomerative algorithms and divisive algorithms (Scott, 2002).
The basic idea of agglomerative algorithms is to take every node in network as a community, then merge nodes with highest similarity into a community.Repeating this step until all the nodes are in one community or satisfying other termination condition.Common agglomerative algorithms include Newman fast algorithm (Newman, 2004), CNM algorithm (Clauset, Newman & Moore, 2004), clustering algorithm combined with spectral analysis (Donetti & Munoz, 2004) and so on.On the contrary, divisive algorithms put all the nodes in a community at first, and then delete nodes with lowest similarity.Continuing the step above, the entire community will be divided into smaller communities until every node turns to an independent community or satisfying other termination condition.The famous GN algorithm (Girvan & Newman, 2001) is a divisive algorithm.
Since Mantegna (1999) analyzed hierarchical structure of the Standard and Poor's 500 index stocks correlation network, scholars around the world began the researches on the topologies of stock market networks, the conclusions of which have important reference value to reveal trends in financial market.Kim et al. (2002) established the weighted correlation network of the Standard and Poor's 500 index stocks to study the scale-free characteristics.Boginski, Butenko and Pardalos (2005) studied American stock market and found that the correlation of the stock price is in accordance with scale-free attribute.Through construction of American stock network, Tse, Liu and Lau (2010) found that the variation of stock prices were strongly influenced by a relatively small number of stocks.Caraiani (2012) investigated the properties of the returns of the main emerging stock markets from Europe by means of complex networks.Besides the studies abroad, the application of complex network in China also develops fast these years.Zhuang, Min and Chen (2007) analyzed the topology in Shanghai stock market and found that the stock market networks had the typical statistical characteristics of complex networks.Huang, Zhuang and Yao (2009) used a threshold method to construct China's stock correlation network and then study the network's structural properties and topological stability.Han and Wang (2010) utilized improving CNM algorithm and Li and Chen (2013) utilized multi-gene method, they both found obvious community structure of stock market.The results of these researches are beneficial for economic forecast and financial supervision.
In this paper, through the community detection of correlation network, closely related communities are found in Shanghai stock market.The price fluctuation of stocks in a same community is highly correlated, which illustrates that these stocks are influenced by the same or similar economic factors.The analysis of industrial characteristics of stocks has reference value to relationship among industries or sub-sectors of listed companies.

Construction of the network
The initial network is built weighted undirected.The vertices represent stocks and the edges represent correlation coefficients of logarithm returns of stock price (Wu, Tuo & Xiong, 2014).
Assuming that there are N stocks in this network, the observation time is [t0, t0+t].At any time point t, stock i's logarithm return is Pi(t) is the stock price at time point t.Δt is the time difference to calculate the logarithm returns.We choose closing price on Friday every week as the source data to calculate the logarithm returns, hence the time difference is a week.Then we calculate the correlation coefficient rij between any two stocks i and j by the logarithm return series, namely ri is the logarithm return series of stock i.rj is the logarithm return series of stock j.E[*] is In formula (3), cij represents the element in i th row and j th column in matrix C.
As formula (3) shows, rij = rji.Therefore, matrix C is symmetric.So we just take the lower triangular matrix to study.

GN Algorithm
GN algorithm is a kind of community detection algorithm based on divisive idea.It keeps deleting the edge with biggest betweenness.The steps are: • Calculate all the edges' betweennesses.
• Find the edge with biggest betweenness and delete it from network.
• Repeat step 2, until every node in network becomes an independent community.
Since it is difficult to judge when to stop GN algorithm if the amount of communities is unknown, Newman and Girvan (2004) came up with a concept modularity Q to solve this problem.
In formula (4), ai represents sum of each row (or column).Tre represents sum of every element on the diagonal.‖x‖ represents sum of all the elements in matrix x.
Typically, Q is calculated for each split of a network into communities.The values in dendrogram's peak indicate particularly satisfactory splits.The height of a peak is a measure of the strength of the community division.

Community Structure Detection
GN algorithm is applied to un-weighted network, so we determine a threshold to turn the network into un-weighted.Assuming the threshold is d, the conversion process is as follows: The key to the conversion is the choice of threshold.Due to the strong correlation is the only interested factor in this study, we set threshold according to coefficient distribution (Newman & Girvan, 2004;Newman, Strogats & Watts, 2001).In the built network, when |Cij|≥d, vertices i and j are linked.Otherwise there's no edge.We regard the influence between two stocks as the same, without considering the direction.
When using the GN algorithm, the threshold d decides the number of edges so it leads to the structure of stock network.And it also determines the result of community structure detection to a large extent.Different stock networks have different structures.The multi-threshold method tries to find the most reasonable community structure with probing the distribution of the network.

Data Preparation
The SSE 180-index was listed on July 1st, 2002, which selects the most marketing representative 180 sample stocks from Shanghai A-share Composite.In the choosing process, stocks are deleted in following conditions: • stocks listed less than a season; • stocks are suspended; • stocks that have operational abnormality or have serious loss in recent financial report; • stocks whose price is in a sharp volatile situation or whose market are obviously manipulated; • other stocks through identification of experts committee.
The basic data for the empirical analysis is selected from SSE

The Frequency Distribution of Correlation Coefficients
There are 13,530 correlation coefficients between all the 164 stocks.The biggest one is 1, the smallest is -0.2041.The mean value is 0.3834 and the standard deviation is 0.1771.According to correlation coefficients, the frequency distribution histogram and the probability density distribution curve are as Figure 1 shows.Using SPSS to verify, the distribution composed of each data point approximates to the linear distribution as Figure 2 shows.Thus, it proves the frequency distribution of correlation coefficients among 164 stocks obeys the normal distribution.

The Community Structure Analysis
After removing edges whose correlation coefficient is 1, there are 13,366 edges left.By using different thresholds to transfer network into un-weighted, we try to find the most reasonable community structure.Table 2 shows the amount and percentage retention of edges with different thresholds.As Table 2 shows, when threshold is bigger, the amount of remaining edges in network is smaller.This means, it reflects closing correlation between vertices better.
However, too few remaining edges are not enough to reflect correlation among stocks.So we choose some reasonable thresholds to analyze their community structures.When selecting threshold as 0.4, there are 45.41% edges left, as Figure 3 shows.The nodes in the upper left corner in Figure 3 represent outliers in network.Apparently, the community structure is not clear because of too many edges.There are three outliers in the network, which means the nodes have no edges connected with other nodes.When threshold gets larger, the amount of outliers gets larger too.After investigation, we found that these three outliers are all controlled by State-owned enterprises.
Selecting threshold as 0.5, there are 23.69%edges left.Compared to network with threshold 0.4, most edges with small correlation coefficients have already been deleted.However, its community structure is still not obvious.Using GN algorithm to analyze this network, the result is similar to network with threshold 0.4.As Figure 4 shows, the community structure is not obvious because of too many retaining edges.The result of detection is not ideal.
Adjusting threshold to 0.6, as Figure 5 shows.The number of edges in network is 8.14% of the original.Only a small number of edges corresponding to larger coefficient are left.Compared to those with thresholds 0.4 and 0.5, this result is better to reflect the structure that is made of stocks with strong correlation.Using GN algorithm to analyze, after sixth partition, the community structure of SSE 180-index network is basically stable.The whole network is divided into 13 small communities and 1 big community.The rest of vertices are outliers.The biggest community includes 100 vertices after 6th partition, as Table 3 shows.
Compared to detection results with threshold 0.4 and 0.5, a situation appeared that most nodes come into the same biggest community.However, because of the lack of vertices in other communities, it is difficult to find their common characteristics.
After adjusting threshold to 0.7, as Figure 6 shows, the vertices with larger coefficients and their edges remain in the network.The community structure begins obvious, compared with previous structures.

Conclusions
In this paper, GN algorithm is applied to find community structure of Shanghai stock market in order to analyze industrial characteristics inside community.Through changing different thresholds, the result shows that along with the threshold getting larger, community structure of network becomes much more obvious.When threshold turns to 0.7, the network is divided into 13 communities, where the first four get the most nodes.
The results of community structure detection indicate that, first of all, the stocks belonging to the same industry are inclined to be in the same community, such as real estate, manufacturing and mining industry.The most obvious is the banking sector of financial insurance industry.Second, the stocks of the same sub-sectors prefer to be in the same community.For example, the community including banking sector does not include financial trust or securities.Finally, the industries in the same supply chain tend to be in the same community.For example, stocks of all kinds of metal rolling processing industry belonging to mining and manufacturing industry are divided into the same group.
In short, the price fluctuation of stocks in a same community is highly correlated, which suggests that these stocks are influenced by identical economic factors.Therefore, the stock price is vulnerable to the same industry or affected by the volatility of stock prices on the same production chain; at the same time, besides stock prices are affected by macroeconomic factors, they are more susceptible to effects of the same economic factors.

Figure 1 .
Figure 1.The Frequency Distribution Histogram and The Probability Density Distribution Curve

Figure 2 .
Figure 2. P-P Plot Test of Correlation Coefficients

Figure 3 .
Figure 3.The SSE 180-Index Sample Stock Network Topology with Threshold 0.4

Figure 5 .
Figure 5.The SSE 180-Index Sample Stock Network Topology with Threshold 0.6

Table 1 .
180-index component stocks   from Jan. 9th, 2009 to Apr. 8th, 2011.To ensure integrity, the stocks that listed after Jan. 9th, 2009 are deleted.After deletion, 164 vertices are left in the stock network.For each stock, the weekly closing price means the daily closing price on Friday.If there are any stocks whose closing prices are missing, we replace it with the closing price on last day that week.If all this weeks' data is missing, we replace it with last week's data to ensure the consistency of transaction data.After completion, there are 118 weeks' data, a total of 19532 daily closing prices.According to Listing Corporation Industry Classification guide, listed companies are classified into 13 categories.164 stocks to be analyzed in this paper are included in these industries, as Table 1 shows.According to data from Table 1, in SSE 180-index component stocks, the Listing Corporation Industry Classification Guide industries having more stocks are manufacturing (C), real estate (J), financial insurance (I) and mining industry (B).These four categories take 72% of all the 164 stocks, others take 28%.

Table 2 .
Retaining Amount and Percentage of Edges with Different Threshold

Table 3 .
The Community Structure Distribution by GN Algorithm with Threshold 0.6

number of stocks in community 1 25Table 4 .
shows, the 68% stocks in community 1 belong to real estate development and management.Meanwhile, stock No. 79 belongs to civil engineering construction of construction industry.Stock No. 98 belongs to IT industry, which has close business contact with stock No. 79.Apparently, stock No. 79 and stock No. 98 are both closely related to the real estate industry.Community 1 shows that real estate industry, especially the development and management part and other sectors belong to real estate have strong industrial characteristics.The fluctuations of the stock price belonging to the same industry are highly correlated, but less affected by other industries.Stocks in Community 1

Table 7 .
Table 7 shows.All the nodes belong to banking.Although bonds and insurance belong to financial insurance industry, they are excluded in this network.Thus, the price fluctuation of banking stocks is not only immune to other types of industries, but also to Securities & Futures industry and insurance industry.Stocks in Community 4