Past Research Projects

INSITE: Center for Business Intelligence and Analytics is a research center that addresses the ever-growing volume, velocity, and variety of big data being generated by social media and Web 2.0.

This innovative center focuses on predictive analytics through the use of data generated from social media, internal transactional, sensor, and other emerging big data to provide analytics across multiple social media platforms. It provides visualization and real-time network analysis of interaction patterns gleaned from big data.

Previous INSITE and ADRG Research Projects

Click a title below to learn more about past INSITE and Advanced Database Research Group research projects:

Health Care Analytics

Two health care analytics research projects: Analysis of Patient Report Health Outcomes from Online Health Communities, and Phermacovigilance and Safety Signals: Analysis of Large Clinical Trails Datasets.

“Analysis of Patient Reported Health Outcomes from Online Health Communities” with Sanofi and Critical Path Institute (C-PATH)

The objective of this study is to collect patient reported data from online communities to develop (1) the capability of bringing together extremely large, disparate data sets, (2) the standards needed to analyze and share this data, and (3) the subsequent health-related informatic analysis skills and tools.

Data is being extracted from patients who are reporting from OHCs (online health communities).

INSITE is using innovative analytical techniques and developing visualizations and reports to demonstrate new insights into this data of potential value to the OHCs, as well as helping C-Path in its efforts to improve data standards and improve patient outcomes.

“Pharmacovigilance and Safety Signals: Analysis of Large Clinical Trials Datasets” with C-PATH and Oracle.

This project entails conducting an analysis of safety signals using longitudinal datasets collected from clinical trials on Alzheimer’s patients. The Coalition Against Major Diseases (CAMD) clinical trials database is possibly the first large-scale, heterogeneous research database of clinical trials information in standardized format that includes multiple products under development for treating Alzheimer’s and Parkinson’s diseases.

Researchers are currently limited to the use of custom programs in languages such as SAS for exploring this data. This collaboration among INSITE, C-Path, and Oracle will develop advanced data mining and analysis techniques, including new Bayesian methods, to explore their application to CAMD and provide medical insights into treatment of Alzheimer’s and Parkinson’s that may be uncovered as a result.

How Social is Social Bookmarking?

Social bookmarking services allow a user to make her personal collection of favorite web resources accessible by the public. The content of this collection can attract users of “similar minds” and therefore has tremendous potential to enable networking and collaboration. In this research, we analyzed a large dataset collected from one of the most popular social bookmarking services. To understand why there is a large gap between a user’s explicit network and her implicit user-user association networks based on common resources or common tags, we compared a users’ bookmark resources and tags to those of her explicit network members. Our results suggest that a typical social bookmarking service user does not create her explicit network based on common interests. We discuss the implications behind the gap between a user’s explicit network and implicit network and propose solutions to enhance and improve the “social” functions of social bookmarking services.

Organizing Social Bookmarking Tags Using a Network Analysis Approach

Social bookmarking tags are generated and shared by web users as online content metadata for different reasons. We believe these reasons can be used to distinguish tags from each other, and thus, can be used to organize the flat tag space into a faceted structure. In this research, we collected a large set of tags from Delicious and empirically analyzed them using a social network analysis approach. Our results show that tags can be organized into a faceted structure and these facets can be derived from the social network analysis.

Who Does What: Collaboration Patterns in the Wikipedia and their Impact on Data Quality

The quality of entries in the world's largest open-access online encyclopedia depends on how authors collaborate, UA Eller College Professor Sudha Ram finds.

Read more at UA News:

iPlant Collaborative: A Cyberinfrastructure to Support Plant Biology

How do we feed a growing world? The human population is increasing, while farmland decreases and food cultivation competes with fuel production. In addition, climate change and energy sustainability impact agriculture, ecology, and biodiversity. Developing solutions to these problems means understanding how the organisms that contribute to our food, fuels, and ecosystem are shaped by the interactions between their genetics and the environment. By enabling biologists to do data-driven science by providing them with powerful computational infrastructure for handling huge datasets and complex analyses, iPlant fills a niche created by the computing epoch and a rapidly evolving world.   

Investigating Data Provenance in the Context of New Product Design and Development

Funding Agency: National Science Foundation
Funding Period: May 2005-2006
Amount: $244,404

Information is one of the biggest assets for most enterprises. In today's information age, almost every enterprise decision is based on a detailed analysis of data recorded in diverse sources ranging from structured databases to the World Wide Web. To ensure that data retrieved from different sources is used appropriately and within context, it is imperative that the provenance of the data be recorded and made available to its users. Provenance refers to the knowledge that enables a piece of data be interpreted correctly. It is the essential ingredient that ensures that users of data (for whom the data may or may not have been originally intended) understand the background of the data. This includes elements such as, who (person) or what (process) created the data, where it came from, how it was transformed, the assumptions made in generating it, and the processes used to modify it. This research team will investigate the semantics of data provenance and will develop an ontology to represent the semantics of data provenance, including the development of ways to automate the capture of provenance. Using new product design and development as the real world domain, a partnership will be formed with a large defense contracting company, viz., Raytheon Missile Systems, located in Tucson, Arizona, to investigate these research issues. A testbed will be created to capture and use provenance and evaluate the system's utility using a well defined set of metrics. Raytheon has committed considerable resources in the form of personnel and access to software as needed for this research. The intellectual merit of this proposal stems from the theoretical framework for understanding and representing the semantics of data provenance. This is considerably different from existing work on provenance which has mainly explored the 'where' and 'why' of provenance. This work will pave the way for understanding the extent to which provenance can be automatically captured. The project has the potential for broader impacts on society. Most importantly, the development of techniques to represent, capture and deploy provenance has the potential to revolutionize the Department of Defense product development industry and other domains as well. The ultimate goal is to enable the development of autonomic and interoperable enterprise data management systems.

Modeling Business Rules

A business enterprise typically functions using business rules. A business rule is a statement that intends to assert the structure or control the behavior of the enterprise. From an information systems perspective, business rules function as constraints on a database helping to ensure that the structure and content of the real world-sometimes referred to as miniworld-is accurately incorporated into the database. It is important to elicit these rules during the analysis and design stage, since the captured rules are the basis for subsequent development of a business constraints repository. We present a taxonomy for set-based business rules, and describe a framework for modeling rules that constrain the cardinality of sets. Our proposed framework includes various types of constraints on a semantic model that supports abstractions like classification, generalization/specialization, aggregation and association. Via our proposed framework, we show how explicitly capturing business rules will help bridge the semantic gap between the real world and its representation in an information system.

Data Management for the Human Subjects Protection Program

The objective of this project is to develop a comprehensive data management system to support the business processes of the Human Subjects Protection Program (HSPP) at the University of Arizona. This project has been funded by the National Institutes of Health (NIH). The data management system will link disparate databases that currently exist in the HSPP office. The system will facilitate smooth, timely and standardized flow of data from widely disparate sites, streamline and integrate the information to be gleaned from these data, and facilitate the transfer of knowledge to a wider community. It will achieve higher efficiency and effectiveness in the processes used for screening proposals through the HSPP.

Mediators for Interoperability: The USM* System

Project Sponsors: National Science Foundation, National Aeronautics and Space Administration, and IBM

This research is aimed at developing a formal semantic model, a theoretical framework, and methodology to facilitate interoperability among distributed and heterogeneous geographic database systems. The primary objective is to develop techniques to identify and resolve various data and schema level conflicts among multiple information sources. Set theory is used to formalize the semantic model (called USM*), which supports explicit modeling and representation of the complex nature of geographic data objects. A comprehensive framework for classifying semantic conflicts has been developed. The framework is then used as a basis for automating the detection and resolution of conflicts among heterogeneous databases. While the focus is on geographic databases, the work is applicable to non-geographic databases as well. A methodology for conflict detection and resolution has been developed to provide interoperability. The methodology is based on the concept of a “mediator? Several types of mediators and an ontology called SCROL for have been defined to provide mediation services.

A software toolkit called USM* that embeds these concepts has been implemented. This toolkit has several components including:

  • A modeling system that allows users to develop a semantic model to describe their local databases and the federated or global schema
  • An ontology definition system that allows users to define different types of semantic conflicts and also map from their local schema or federated to the ontology
  • A mapping tool to map from the federated schema to the local schemas
  • A query tool that allows the federated schema to be used in accessing the individual databases for queries.

The entire tool kit has been implemented in Java using a three-tier architecture. Oracle is used as the backend for storing the repository of metadata and ontology. The system can be accessed through any Java enabled web browser. We have evaluated the utility of the system and tested it out using several case studies. The results of our evaluation indicate that mediators based systems hold great promise for integrating disparate data sources on the web.

View presentations related to this project:

Saguaro Digital Library for Natural Asset Management

The United States has been endowed with tremendous amount of natural assets in the form of abundant fresh water, ecological habitats, forests, grasslands, fisheries, fertile lands, and climatic conditions. The Arizona Growing Smarter Initiative has been proposed by the State of Arizona in recognition of the value of these natural assets. As part of a new paradigm that is in evidence nationwide and in Arizona, we are recognizing that there is a close link between the economy and the environment and that our natural assets of the arid and semi-arid regions of the U.S. Southwest are facing increasing risk due to a variety of natural and human impacts. There is immediate need to actively conserve biodiversity and protect our natural ecosystems in order to preserve the quality of human life. This requires the use of information technology for enhancing our understanding of the interdependence between the economy and the environment. In response to this need, we propose to develop the Saguaro Digital Library (SDL), a comprehensive digital library system providing a full range of services to facilitate our understanding of the impacts of natural and human environmental hazards, to provide models of environmental change that can access and utilize data, processing tools and algorithms across the Internet, and to provide a wide range of users the ability to obtain quantitative measures of this change. The primary focus of the SDL is to facilitate the responsible stewardship of our natural assets and good ecosystem management. Our project will directly address the goals of the National Biological Information Infrastructure (NBII) by developing the capability to share data and resources so that biodiversity and ecosystem findings can be more readily applied in management and policy. The library will specifically provide decision support tools to improve monitoring of ecosystem status, better predict and mitigate change, and optimize sustainable productivity. This proposal addresses the key issues of (a) interoperability among digital library collections, (b) harvesting of resources to provide a living and evolving digital library, (c) ensuring long term sustainability of the library, and (d) addressing the needs of a wide variety of users especially those who are not experts in the use of remotely sensed data and Geographic Information Systems (GIS). The ultimate goal of our project is to allow components of the digital library to evolve independently and yet be able to call on one another efficiently and conveniently. Thus, our digital library will support heterogeneous and federated collections of digital content, including data, metadata, models, tools, and algorithms. The Saguaro Digital Library is being developed by a consortium of University research groups, Federal and State agencies in conjunction with industrial partners. State Agencies partnering in this proposal include the Arizona State Lands Department, the Arizona State Cartographers Office, The Arizona State Geological Survey, and The Arizona Geographic Information Council. Federal Agencies participating in the project include, the United State Geological Survey (USGS) Cooperative Park Studies Unit, The US Army, The Rocky Mountain Research Station, Los Alamos National Laboratory, The Nature Conservancy, and the US National Park Service. Our industrial part-ners involved in the proposal include Online Computer Library Center Inc. (OCLC), Raytheon STX, and Simons International Corporation. Our K-12 partners include Lawrence Intermediate School, Fort Lowell Elementary School, and the Vail School District from Arizona. The development of the library will be led by the Department of Management Information Systems in collaboration with other departments at the University of Arizona (UA) including, the UA Library, Hydrology and Water Resources, Arid Land Studies, Geography, Electrical and Computer Engineering, the Arizona Regional Image Archive, and Renewable Natural Resources. We have a commitment and plan to sustain the library through the efforts of the UA USGS Biological Resources Division after the funding period.

View presentations related to this project: