Analyse social media, news or blogging data by querying the original data sources with no need for data integration!
The Media Planning case study focuses on analysis of social media, news or blogging data.
The interest for analyzing the interrelation between the entities in huge networks such as the Web, geographical systems, or social networks is rapidly increasing, forcing the creation of information management systems that are able to perform graph-oriented operations efficiently. In particular, there are complex scenarios where multiple data sources in different formats must be integrated and interrelated to extract valuable information from the relationships between documents, entity interactions, socio-economic data and user behaviours, among others. In this case, the graph-storage becomes the centre piece of the data management structure, where all the different data sources are integrated and connected after a deduplication process to identify common entities across sources. But the integration of huge amounts of different data formats into the graph is sometimes a very expensive process when only a few attributes are going to be relevant for the exploration of relationships, and the requirement for distributed query processing across the data sources becomes critical.
One use case of graph technologies in the centre of a pool of distinct data sources is for media planning, the task of a media agency to identify the most appropriate media platform for a client’s product or brand. In this case, the data sources could be of different kinds and formats, such as documents from blogging and micro-blogging systems, on-line news publishers, advertisement in newspapers or company profiles; log files with the user behaviour on web sites and social network portals; list of links with the user interactions (messages) or group memberships; and structured (relational) data such as user profiles and socio-economic information. The graph mining techniques obtain the reputation of people to identify viral leaders, the user roles (generators, consumers, opinion people, group members, followers, etc.), the sentiment and preferences. Also, they generate an ontology of domains and key concepts into the documents, and model the propagation of information during time. For all of these the graph database basically require the interconnections between the data entities and a few extra attributes to characterize them during the construction and the query processes, and not the full data of each entity.
Once the relationships graph have been built and enriched, the media planners can start their analysis process by executing complex queries across the different data sources. In this use case, we mainly are able to resolve the most the most relevant authors of a topic and the most important communities with their influencers for a given set of keywords. These queries are connected to four different data sources (document, graph, key-value and relational). Therefore, having a middleware (i.e the common query engine) that is able to integrate the results save a lot of development time and makes the application more robust. In one statement you run a query over different datastores. Let’s see the execution plan for retrieving the communities.
On the other hand, updating this information incrementally is a very complex process: it requires deduplication, document indexation and to infer the influence relationships and communities. Moreover, update different datastores simultaneously with different characteristics. It is easily performed with the Holistic Transactional Manager and the Complex Event Processor.