Indirect Shareholding Ratio Calculation Using Graph Techniques
This article explains how to compute indirect shareholding ratios between companies by generating synthetic relationship data, cleaning and normalizing it with multiprocessing, constructing a weighted directed graph using NetworkX, and applying a matrix‑based algorithm to derive the final ownership matrix.
The introduction presents a corporate customer graph example and asks how to obtain the shareholding ratio of Company A over Company D, proposing the use of graph techniques to calculate indirect ownership.
Algorithm steps are illustrated with diagrams showing the workflow for computing indirect shareholding ratios.
Data description explains that demo data is generated with Python's faker library, producing relationship records and target customer data. Sample code for generating edge data and node data is provided:
#生成控股比例数据
#edge_num生成多少条demo关系记录
def demo_data_(edge_num):
s = []
for i in range(edge_num):
#投资公司、被投资公司、投资比例、投资时间
s.append([fake.company(), fake.company(), random.random(), fake.date(pattern="%Y-%m-%d", end_datetime=None)])
demo_data = pd.DataFrame(s, columns=['start_company', 'end_company', 'weight', 'data_date'])
print("-----demo_data describe-----")
print(demo_data.info())
print("-----demo_data head---------")
print(demo_data.head())
return demo_data
#节点数据
def node_data_(node_num):
cust_list = [fake.company() for i in range(node_num)]
node_data = pd.DataFrame(cust_list, columns=['cust_id']).drop_duplicates()
print('节点数目', len(node_data['cust_id'].unique()))
node_data.to_csv('node_data.csv', index = False)Data processing, which uses Python's multiprocessing module, removes self‑investments, filters out empty records, deduplicates by date, discards multiple records with weight > 0.5 while keeping the latest, and normalizes weights that exceed 1. The processing code is:
#demeo数据处理
def rela_data_(demo_data):
print('原始数据记录数', len(demo_data))
#去除自投资
demo_data['bool'] = demo_data.apply(lambda x: if_same(x['start_company'], x['end_company']), axis=1)
demo_data = demo_data.loc[demo_data['bool'] != 1]
#去除非空
demo_data = demo_data[(demo_data['start_company'] != '')&(demo_data['end_company'] != '')]
#按照日期排序删除重复start_company、end_company项
demo_data = demo_data.sort_values(by=['start_company', 'end_company', 'data_date'], ascending=False).drop_duplicates(keep='first', subset=['start_company', 'end_company']).reset_index()
#删除多条大于0.5且保留最新值
demo_data = pd.concat([demo_data.loc[demo_data['weight'] <= 0.5], demo_data.loc[demo_data['weight'] > 0.5].sort_values(by=['end_company', 'data_date'], ascending=False).drop_duplicates(keep='first', subset=['end_company', 'weight'])]).reset_index()[['start_company', 'end_company', 'weight', 'data_date']]
global demo_data_init
demo_data_init = demo_data.copy()
#持股比例求和
demo_data_sum = demo_data[['end_company', 'weight']].groupby(['end_company']).sum()
#持股比例大于1的index
more_one_index = demo_data_sum.loc[demo_data_sum['weight']>1].index.unique()
print('持股比例大于1的index', len(more_one_index))
#并行处理持股比例大于1的数据归一化(Linux可执行,Windows报错)
items = more_one_index[:]
p = multiprocessing.Pool(32)
start = timeit.default_timer()
b = p.map(do_something, items)
p.close()
p.join()
end = timeit.default_timer()
print('multi processing time:', str(end-start), 's')
base_more_one = pd.read_csv('exchange.csv', header=None)
base_more_one.columns = ['start_company', 'end_company', 'weight', 'data_date']
#持股比例不大于1的index
low_one_index = demo_data_sum.loc[demo_data_sum['weight']<=1].index
base_low_one = pd.merge(demo_data, pd.DataFrame(low_one_index), on = ['end_company'], how = 'inner')
demo_data_final = pd.concat([base_low_one, base_more_one]).reset_index()[['start_company', 'end_company', 'weight', 'data_date']].drop_duplicates()
print('数据处理后记录数', len(demo_data_final))
demo_data_final.to_csv('demo_data_final.csv', index = False)
return demo_data_final
#并行处理函数
def do_something(i):
#大于1的pd
exchange = demo_data_init.loc[demo_data_init['end_company'] == i].sort_values(by=['end_company', 'data_date'], ascending=False)
#fundedratio
weight_sum = sum(exchange['weight'])
exchange['weight'] = exchange['weight']/weight_sum
exchange.to_csv('exchange.csv', encoding = 'utf-8', index = False, header = 0, mode = 'a')
print('-----End of The', i, '-----')Graph construction uses networkx to build a directed weighted graph from the cleaned relationship data. The relevant code is:
#构造有向图
def graph_(rela_data):
Graph = nx.DiGraph()
for indexs in rela_data.index:
Graph.add_weighted_edges_from([tuple(rela_data.loc[indexs].values)])
return Graph
global Graph
Graph = graph_(rela_data[['start_company', 'end_company', 'weight']].drop_duplicates())
print('图中节点数目', Graph.number_of_nodes())
print('图中关系数目', Graph.number_of_edges())The model description introduces a matrix‑based method to obtain the indirect shareholding ratio matrix, with a decay parameter C and iterative multiplication:
#获取(间接)控股比例矩阵
def sum_involution(ma, n_step):
#衰减参数
C = 1
mab = ma
result = ma
for _ in range(n_step-1):
ma = round(ma.dot(mab), 6)
np.fill_diagonal(ma.values,0,wrap=True)
result = result + C*ma
return resultAn example of the model output is shown with a diagram illustrating the computed indirect ownership matrix.
Future work mentions discovering hidden relationships, applying community detection (e.g., Louvain) for group segmentation, and using supervised learning with known group labels to tune the decay parameter C.
The full source code is available at https://github.com/MO2T/1.Recognition_of_implicit_relationship .
Big Data Technology Architecture
Exploring Open Source Big Data and AI Technologies
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.