Management of Complex Data Structures: Upgrades to EUGene data software

Since it was first made available in 1998, the software program EUGene has evolved into a useful tool for scholars who analyze international conflict and crisis data by providing a common software infrastructure for dataset construction. Initially designed to generate data for variables used to test Bruce Bueno de Mesquita and colleagues? version of an expected utility theory of conflict initiation and escalation (e.g., Bueno de Mesquita and Lalman 1992) EUGene has developed into a widely used data management toolbox. The program makes routine a set of data preparation tasks that are cumbersome and difficult, and keeps track of critical research design choices made by users. This facilitates more advanced research and theorizing by scholars by freeing them of the necessity to perform technical data manipulations. While the goal of expected utility data creation was the initial driving force behind EUGene, additional expansions have focused on increasing the program?s ability to quickly and easily generate various useful datasets. We initially developed and released EUGene as freeware in 1998 with support from NSF grant SBR-9601151. We expanded its functionality under NSF grants SES-9975115, SES-0213727, and SES-0452173 and made numerous additions without NSF funding. New releases, expansions, and bug fixes have been published at least annually since 1998.

We request two years of continuing funding to further expand the EUGene software. Continued program development will focus on a redesign of the data management engine to allow for the development of new more flexible datasets to enable scholars to account for extensive cross case correlation, endogeneity, as well as spatial correlations within the datasets they use. We also plan to build in tools to allow users to construct uni and multi dimensional scales using existing data. Continued development of EUGene will facilitate both new research and the expansion of ongoing research agendas by many scholars. More specifically, in this expansion we focus on five key new additions: 1. Develop an observational format that relaxes the current restrictions of monadic or dyadic data. 2. Allow users to specify the unit of observation, expanding the scope beyond the current country year restriction as well as relaxing the restriction of annual data by allowing the generation of data sets using any time duration the user cares to. Part of this improvement will include the development of a unit code translation facility to allow for the merging of ISO, COW, IMF and other country tracking schemes. Users will be able to select the temporal element of any output unit, e.g daily, weekly, monthly, quarterly, or annual data. We will also add more flexibility to the input data, allowing the units under observation to be the country, the IGO, the terrorist group, the non-state actor, or any other systematic that can be specified by the user in a standard format. 3. Develop an expanded sampling engine to accommodate the greatly increased population of observations the new data structures will generate. 4. Build in an expanding scaling and measurement engine to allow users to build scales and indices of a wide range of types. 5. Develop the data handling routines needed to spatial regression analysis. In addition EUGene 4.0 will include the most recent releases of an expanding variety of pre-existing datasets. This proposal requests primarily support for research assistants, software coders, and conference travel, with limited direct support for the Co-PI?s. Below, we summarize how EUGene has developed to this point, and then turn to a fuller discussion of proposed software improvements and extensions.