Database update and final report.
6 Month Progress Report for Earthtime-EU Project ’NUTS’
1 July 2014
David Lazarus and Johan Renaudie, Museum für Naturkunde, Berlin
This project (NSB-Upgrade of Taxonomy and Stratigraphy: NUTS) consists of several distinct goals. The two main ones are to 1) update, using IODP’s recently compiled new Taxonomic Name Lists (TNLs) the taxonomic name lists in the Neptune Sandbox Berlin (NSB) database of marine microfossil data; and 2) update and extend the scope of stratigraphic information in NSB. Three secondary goals are to 3) provide website access to the NSB database, 4) to develop a stratigraphic analysis tool based on an older program (Age Depth Plot: ADP) and 5) identify entry errors and primary outliers (i.e., discrepant data present in the original sources) in the NSB occurrence data. Goals one and three have been achieved, substantial preparatory work for goal two has been done and some progress has been made for goal five. Details are as follows:
The older NSB database taxonomic name lists have been updated to the new TNL content. This required modifying the core linking key field structure of NSB to make it compatible with the IODP key-field standard, and thus was a major technical upgrade as well. Initial tests using NSB analysis algorithms identified ca 65 (out of several thousand) TNL name records with incomplete link data; these have been flagged and will be corrected in the next months. An initial test of the new system suggests that using the new taxonomy alone already increases recovery from existing occurrence data records by ca 25%, though this is only a first provisional estimate.
Updating the stratigraphy information in NSB is a much more complex task and will take most of the remaining project time. A substantial first step has been to survey the highly scattered primary research literature for meta-information on IODP stratigraphy data needed to evaluate data quality and priorities for entry work. 43 more recent, high quality legs have been formally evaluated and for high or medium priority legs data for individual sites have been surveyed and the metadata compiled. Occurrence data for 85 sites were individually examined. Scripts were created that searched and downloaded information for over 600 relevant datasets from the Pangea database. All of this information is held in a newly created project management database (‘metanuts’). Workflows have been defined for processing the individual data sources to a common format and uploading to NSB. We will soon advertise for a student helper who will carry out the bulk of the reformatting work. 178 range-charts were identified as high priority from Legs 154, 171B, 177, 198, 199, 207, 208, 303, 306, 320, 321. Age models for these legs will be entered as well. We will extend NSB by adding tables necessary to contain stratigraphic event information (bioevents, pmag or isotopic events). We will then upload the data produced from Earthtime as well as the events from the priority legs listed above. The website will be also modified by adding a second search page specifically dedicated in retrieving age models and stratigraphic events.
The website is now online at http://nsb-mfn-berlin.de. It has been programmed in Django (a dialect of Python 2.7 specialized for developing websites). The interface so far allows the user to search occurrence data in NSB (with several options such as having the taxonomy resolved using the new TNL content). Users can access their downloaded datasets for 30 days. For security reasons users at the moment need to obtain an account by email. Group roles have been created in the database cluster in order to facilitate the mass addition of new users in the future, and non registered user (anonymous) access is planned.
As a first step in data quality control in NSB we have examined duplicate entries. Originally ca. 7000 entries were identified as possible duplicates (same sample name and fossil group) but, after verification in the original publications, it turns out that most of them are legitimate duplicates (i. e. originating from the investigations of two different authors, or more surprisingly sometimes duplicated analyses of the same sample in the original publication), the rest were simply mislabeled and have been corrected. Investigation will continue to identify and flag samples of suspected bad quality.
The above work is within the planned project schedule and we currently anticipate completing the project as planned and on-time, although complications may yet arise with aspects of data entry, hiring of the helper, etc.
Interim Update Report, January 2015: