block by riccardoscalco 37c06615ce51534de742

Twitter Influencers

Twitter users visibility by means of regular Markov chains.

This is an attempt to define user visibility on a specific topic. Briefly, tweets are collected via the Twitter streaming API, stored in sqlite databases and then processed in order to create a regular Markov chain. The steady state distribution of the chain defines a metric on the set of Twitter users, which can be used to retrieve an ordered list of users.

Have a look at this paper and this other paper for further details about the mathematical methods.

Be careful, the procedure described here is experimental and it is not meant to be used in production environments.

Fetching and storing from the Twitter streaming

The file tweet_stream.py has to be filled with your Twitter credentials, furthermore change at will keywords and languages. Then start the script:

$ nohup python tweet_stream.py #Python3

The script will save a sqlite database containing informations about the tweets: tweet id, language, text, date of creation, user id, username, mentions/hashtags/urls contained and, in case of retweets, the id and user id of the original tweet. The script keeps running, it is up to you to stop it.

Define a ordered list of twitter users

Dependences: before to continue, please install networkx, dataset and pykov.

Open a Python shell. The following command create a directed graph G from the collected tweets:

>>> import analysis
>>> G, Q = analysis.getGraph('tweets-Thu-Oct--2-06:02:23-2014.db')

G is a netwokx DiGraph object, nodes are Twitter user_ids and edges are links among them (expressed as tuples).

>>> type(G)
<class 'networkx.classes.digraph.DiGraph'>
>>> G.edges()[:3]
[('231907053', '1117325342'), ('2567615524', '928342410'), ('2567615524', '18396319')]
>>> G.nodes()[:3]
['2456481724', '231907053', '2567615524']

Q is a python dictionary, where keys are links among Twitter users and values are the number of times that links have been observed:

>>> Q[('525630064', '282695161')]
3.0

The directed graph results to be disconnected, as expected. The lack of connectivity is a trivial consequence of the fact that each link is created independently from the others, and therefore there is no way to expect the existence of a connecting path between any couple of nodes.

The largest strongly connected component can be found with the following code:

>>> import networkx as nx
>>> scc = list(nx.strongly_connected_component_subgraphs(G))
>>> scc = sorted(scc,key = lambda graph: len(graph.nodes()))
>>> F = scc[-1]

The regular Markov chain is created on top of the largest component:

>>> T = analysis.getTransitionMatrix(F,Q)
>>> type(T)
<class 'pykov.Chain'>

The above chain is ergodic and its steady state can be easily derived. Users with the highest stationary probability are the most visible:

>>> s = T.steady().sort(True)
>>> s[:5]
[(u'14499829', 0.17468580328789371), (u'5402612', 0.037378364822550497), (u'427900496', 0.033809038083563836), (u'2195671183', 0.026123344415636889), (u'17899109', 0.022593710080353133)]

Written with StackEdit.

analysis.py

tweet_stream.py