[F2020:1-2] Enhance the ability of authors to cite data by 1) improving instructions to authors on webpages, 2) promotion through IRIS newsletter, 3) informing journals of the citation services provided, 4) investigating the use of tags on data distributed to users. Develop a DRAFT Data Licensing policy in coordination with major funding agencies and UNAVCO; consider legal advice.
Status: (March 2021) Instructions to authors on IRIS and FDSN web pages have been updated and an article on citation was included in the winter newsletter. Items 3 and 4 have not yet been addressed. The Joint Data Services committee of IRIS and UNAVCO met in January 2021 to discuss data licensing. It was recommended that no restrictions be imposed on the organization regarding the data that might be accepted. Carter has reached out to FDSN to discuss the issue with the executive committee.
(October 2021) The UNAVCO DS governance committee is recommending that a workshop proposal be prepared to address these issues and inviting IRIS to be involved in this workshop. This item is expected to be linked to the citation workshop. In addition to this, the FDSN has been approached about preparing a policy statement about licensing with a recommend that all metadata be in the public domain and all data be either in the public domain or be minimally licensed to require attribution.
(March 2022) The UNAVCO DS governance committee chairpersons (Julie and Suzan) have started to organize a workshop on Data Citation/Licensing.
(October 2022) The UNAVCO DS governance committee chairpersons (Julie and Suzan) organized a workshop on Data Citation/Licensing on 17, 19, and 21 October.
[F2020:3] Develop a DAS data directive that provides a consistent approach to requests for storing DAS data in the data repositories.
Status: (March 2021) IRIS has submitted a MSRI design proposal to address DAS data storage. (October 2021) The MSRI proposal was not asked to continue to the second round. More community involvement is needed and this will be sought through a community workshop. (March 2022) A draft proposal for a community workshop has been introduced to the DAS RCN working group on Data Management. Members of the organizing committee are being sought. It should be noted that IRIS and UNAVCO are continuing to work on better metadata/format that would be appropriate for accepting DAS data. This does not, however, solve the data volume issues. (October 2022) No progress
[S2021:1] Investigate a way of finding data that “looks different”. What are the most common reasons for which data are tossed? Link to new “funny squiggles” paper (Ringler et al. 2021 preprint).
Status: (October 2021) The QAAC is considering this at their next meeting (winter 2021). (March 2022) No progress. The QAAC is still planning to discuss.
(October 2022) No progress.
[S2022:1] Wordsmith a list of principles for the data services governance committee and share it with Julie E., David M., Jerry.
Responsible: Jonathan A.
Status: (October 2022) A draft version was sent to Suzan for consideration.
[S2022:2] Begin to coordinate dates/times for a regular joint governance meeting beginning in May. Responsible: Carter
Status: (October 2022) As the merger will happen in 2-months’ time and the DS governance committees are in the process of selecting members for the new committee, this action item is recommended for removal.
Virtual Meetings on 12 and 14 October 2022
Present: DSSC members: Ebru Bozdag, Marine Denolle, Heather Ford, Jonathan Ajo-Franklin, Suzan van der Lee
Staff: Jerry Carter, Rob Casey, Chad Trabant, Gillian Sharer.
Reporters/Observers: Bruce Beaudoin, Eric Sandvol, Rebecca Rodd, Julie Elliott, David Mencin, Dan Auerbach, Rob Mellors, Adam Ringler
Approval of Spring 2022 Minutes:
The minutes from the Spring 2022 DSSC meeting were approved.
Requests to Store New Data – (Carter)
There were three requests to store new datasets:
• DAS from global earthquake experiment in Feb 2023 (Andreas Wuesterfeld) Data from a community experiment. IRIS cannot accommodate DAS data in SEED or PH5. As data are being stored in pubDAS for nnow, this an opportunity for IRIS to experiment with ingesting DAS data because the amount is not huge. Jerry proposes to store the DAS data at IRIS as a format-agnostic “assembled data set”, which cannot take advantage of advanced search and discovery. However, getting the data in a chunk (assembled) is better than not having it. DSSC discussion was supportive of IRIS storing DAS data from this event.
• SEGY data from oil exploration (Joseph Dellinger)
Exploration data, license CC-BY-NC-SA, not typical for IRIS data. There are 2 datasets – one of airgun data and one that captured an earthquake from ocean-bottom seismometers. DSSC does not want to lose these data sets and make them more findable – for now they can be accepted as assembled data sets, perhaps in the future, some indexing and findablility might be added.
• Antipodal earthquake database (Rhett Butler)
Derivative data product: catalog of antipodal earthquakes – there are too many stations to systematically find antipodal earthquakes. At the moment, all the tools exist at the IRIS DMC and
they need to be put together to accomplish building the database. The script/code can live on github and be indexed in SeisCode on the IRIS web site. Action: Jerry will contact Rhett and recommend the development of a script which can be shared on github and indexed in seiscode.
Director’s Report – (Carter)
• Jerry returned to full-time work and effectiveness after successful leave. He is working remotely, which works well.
• 100 % uptime for the last 6 months. Only 1 month was a bit less. As is to be expected for any big data center, there are a few glitches, which are being worked on.
• One Data Services all-hands meeting: Staff from IRIS and UNAVCO attended to prepare for working as a single unit, following the example of working together on the Common Cloud Platform. All DS managers are setting the standard/ is a role model for the overall IRIS-UNACVO merger into EarthScope.
• CCP funding from NSF continues, but is not sufficient. Carry-over funds have supplemented this amount.
• Three new hires: Thaddeus Megow (cloud infrastructure), Bill Fassbinder, replaces Rick Benson (leads infrastructure section), Emily Maher, replaces Forrest Thompson (data engineer, focused on ingestion)
• DS are well prepared for merger. EarthScope DS Org chart will be presented on October 24 to DS staff. Org chart might evolve over time, but goal is to keep changes minimal.
• Deadlines: NSF solicitation for next geophysical facility - proposal due in first Q of 2024. Need to know what EarthScope DS facility will look like by middle of 2023.
• Worldwide Data Centers are watching EarthScope DS developments closely, including TileDB. Bring computing to the data (in the cloud). This has implications for other data centers, as data exchange will be happening less and back-end (data/computation server-side) technology might need to become more standardized.
• New data collections: specifically DAS. CCP should contribute to data formats and data handling efforts.
• SZ4D submitted SZNet proposal (UCSC). IRIS is a subaward of this proposal for 1) making legacy volcano monitoring data (esp. From Chile) available, 2) build umbrella web site with links to data and sample repositories of volcano monitoring data.
• Jerry sincerely thanks staff and deputy directors, they make IRIS DMC great. He has much appreciated working with Chuck Meertens and David Mencin.
Section Status Reports: Operations – (Sharer/Trabant)
• Hardware transitions are going well. SeismiQuery had critical vulnerabilities and was disabled. It had an effect on the community as it has some functions that are not easily replaced by other tools. Shipments package improved and provides usage stats. A DASK engine was added.
• Emily Maher joined Gillian’s Team and data engineer. MT (previously EM) data was re-archived to allow attribution. Smart nodal data ingestions was worked on. New data from QZ and AB network data was archived (part of SNECCA). NV network data was converted to OW.
Section Status Reports: Quality Assurance – (Sharer)
• Nominal Response Library, version 2 released. Major improvement with postgres db. Maintainable + extendable. New web service for delivery of responses in RESP, StationXML, or “sub-XML” format. NRL also available in zip archive. A PDCC replacement will implement version 2. The STS6 response has not yet been added (Streckeisen will only distribute custom responses by serial number, not a nominal response. We will contact them again about this.)
Section Status Reports: QAAC report – (Sandvol /Sharer)
• Should QAAC exist or has it served its time? Eric S. queried previous chairs. A: QAAC has been important for providing community perspective and advice on MUSTANG and PIQQA.
• QAAC accomplishments:
o Creation of the MUSTANG quality assurance tools. MUSTANG is probably the most significant system undertaken by IRIS to assure quality. And it has been very successful. This accomplishment by itself is a strong argument to continue the QAAC.
o QAAC served a critically important role that helps to find balance between what users of the facility need and what maintainers of the facility think is important.
o Support and advice in development of PIQQA, a tool that can be used to help PASSCAL PI’s summarize the quality of the data acquired by their deployments.
o QAAC provides an important mechanism that balances the needs of different constituents that use the Data Management System.
• The committee discussed various options for the role and, makeup, and functionality of an EarthScope QAAC that might be presented to the board. It was agreed that the QAAC is valuable and should continue in some form after the merger. EarthScope QAAC would cut across data, instrumentation, and engagement.
Section Status Reports: Cyberinfrastructure – (Casey)
• Cyberinfrastructure upgraded our services to Tomcat 9 and openJDK 11. Upgrade was completed in August 2022.
• New usage stats pipeline. Web services log to this new system: accepts user ID code, return code, data content.
• LUCID: enhanced identity management, with UNAVCO. Single Sign On, Auth0 portal with CILogon. Use your own univ’s login credentials. Documentation is being developed. Can manage restricted datasets and PIs can control data set access. People are signing up and using auth0 to login, it’s off to a start and more participation from community members is encouraged.
• Web services for PH5 data. PH5 is preferred format for active-source data. SEGY output format issues are fixed.
• Cybersecurity: logging tickets, collaboration with LLNL security team (offered guidance and performed penetration testing). Rob participates in a six-month cybersecurity workshop by TrustedCI (NSF). Most exploits logged at IRIS were related to legacy code and quickly resolved.
• An open source beta version of the Yasmine StationXML editor (GUI tool and a command-line tool) is on github. Resif and ISTI contributed to the tools. An NRL web service will be developed. • Some discussion ensued about accessing embargoed data by manuscript reviewers, who are requested by the journals to verify that shared data is indeed available, but the journals want them to do so in an overall anonymous way.
Section Status Reports: DMC Architecture and Products – (Trabant)
• MiniSEED 3 is in review (next generation data format). SeedLink (v4) is in review. SeedLink 4 can do identity management.
• Derivative products: Revision of Source Time Function product and Event Plots. Manoch is working on this. Products are ported away from matlab and into Python3. Software will be published on github and will be ready for the cloud.
• EMC-tools can handle projected coordinate systems. Bugs have been fixed in EMC web service, which is not yet released. EMC model explorer is a Jupyter Notebook that will be shared with community at Fall AGU Meeting. More notebooks will be developed in the future. The EMC data product remains popular.
• The EARS repository is not maintainable; the source code is lost and there is little value in continued operation as most stations have been saturated. The preservation plan is to Dockerize the old system and share on dockerhub.
• Download stats are now mapped instead of tabled.
• Lots of Jupyter notebooks will be published. The DSSC recommends that the engagement arm of EarthScope be informed as soon as they are ready, and maybe pulled into testing them. E.g. ROSES leaders would love to know about them and potentially use them.
DCC Reports: ASL DCC – (Ringler)
• ASL is in Isleta de Pueblo, quietest in US at 1 Hz, 40 staff, 15 USGS, 25 KBR. Run 2/3rd of GSN ⅓ run by IDA. ASL runs US backbone and New England and intermountain west, and N4 networks. All is contributed to IRIS. STS1 upgraded to STS6, 260 s low corner. T360 is nanometrics competitor to STS6. There are not many other BB instruments.
• Many stations upgraded, now it’s the turn of harder stations: Africa, Norway, etc. GSN station quality is good, also below 0.001 Hz.
• New station in Antarctica (QSPA upgrade?): 2500 m depth near IceCube neutrino detector. • Articulated value of GSN to NSF: reviews of Geophysics paper. Lots of science results published by ASL/IDA authors: example: Hunga Tonga Hunga H’apai eruption.
• Internship focused on select filmstrip records of interest for scanning.
• SRL special issue on Global Seismology upcoming.
DCC Reports: IDA DCC – (Mellors)
• IDA send data to DMC via AWS cloud. All stations are up and sending data. Some issues in Kazachstan and Pakistan, Easter Island, but overall recovery from COVID-related issues. Data availability is good in general. Water in vault at Diego Garcia. 6 months of data has bad timing. No nearby stations. GSN committee weighed in about adding timing that could be a few seconds off.
• New metadata db (StationXML) – also using VPN for data transmission in the cloud. Looking for single internet provider for all stations (e.g. Starlink).
• New station in Uzbekistan. Rotational seismometer (blueseis) deployed at PFO.
• Cybersecurity requires increased effort: vulnerabilities can be unexpected.
• SRL special issue. Two sessions at SSA.
DCC Reports: PASSCAL Instrument Center Report – (Beaudoin)
• 2022 experiments: floodgate: 72 new ones in 2022, greatest since 2010. Many BB and Nodes, also in polar. Also MT experiments.
• New sensors and datalogger purchased, incl. 1000 Nodes. 85% of inventory is reported to NSF (for availability assessment).
• 1 Pb NAS storage server purchased for Node data.
• PQL replaced with SQLX, commercial version of Richard Boaz.
• MT capability is now fully functional. Receivers and coils are there. MT short course at SAGE/GAGE workshop. Additional equipment is acquired. October 2022 MT workshop next week at New Mexico Tech.
• PASSCAL contributed to ObsPy for interfacing with NRL. Nexus also utilizes NRL. Are there mutual benefits to leveraging PASSCAL and DMC software at DMC and PASSCAL, respectively?
Project Reports: CCP Project update (Trabant)
• CCP is prioritized, various teams have various tasks, coordination planned with PASSCAL on TileDB. GeoCrate is modern data container for cloud and would like to work with researchers, especially those using HPC. Phased set of milestones. MiniSEED will remain for a while. TileDB will likely be added.
• Old metadata will be retained to meet FAIR standards.
• Focus on Development teams, building in AWS.
• Exploring user direct access to data.
• Exploring process for transferring data to AWS.
Project Reports: Citation and Data Licensing (Elliott)
• Next week: workshop. Invited: publishers. Data policies, ethics policies, challenges, licensing, citation and attribution will be discussed.
Board Discussion Items (Van der Lee)
• Review progress against the five-year SAGE-II plan
- Core activities: data ingestion, curation, and distribution: 328 TiB of data in archive, 4PiB sent to users.
- Maintain Quality Assurance: MUSTANG was expanded and turned into a web service. - Support scientific software: SAC (community code), SeisCode (repository), DMC programs and libraries are on Github. SAC had a license, but is now in public domain.
- Host multidomain data sets: Not all data can be squeezed into existing formats (e.g. MT) that work for one data type but for another, looking for more flexible formats.
- Develop shared data center in the cloud, provide seamless access to seismic and geodetic data: CCP was designed and is being built and utilizes AWS.
- Seamless integrate seismic and geodetic data to help the community generate integrated data products: Infrastructure exists. One repository for data from both data centers. This will be so when CCP is done, early 2024. Physical archive in Seattle might be decommissioned in March 2024. It is already at the end of its life, and on extended support.
- Expand Seismic Data Center Federation across world, incl. Africa & Asia: 24 federated data centers have registered with FDSN, they can be looked up on FDSN web site ([url=https://www.fdsn.org/datacenters/]https://www.fdsn.org/datacenters/[/url]: none in Africa have registered, one each in Korea, Japan, etc.) Some data sit in more than one data center, but only one is the primary source.
- Improve SEED and expand utility of SEED for other types of time series: SEED improvements were made (they overcome limitations, e.g. lengths of strings), but SEED found to not be compatible with many data types. Looking for other “universal” formats. MiniSEED can still be an export format?
- Seamless access to high-frequency active and passive source seismic data: PH5 was developed (is HDF5, metadata are included in it rather than separated out as for TileDB) specifically for passive and active data from Nodes. HDF5 is not cloud friendly, currently looking for better formats.
- Support data formats that are useful in HPC environments: Looking at TileDB. Client-side ROVER can help with downloading very large data sets, even with intermittent connectivity. [url=https://iris-edu.github.io/rover ]https://iris-edu.github.io/rover [/url];
- Support domain-agnostic formats like GeoCSV: GeoCSV is supported as export format.
- Improve availability of on-line tutorials: Little progress because of major shift in data archiving practices, but working on Jupyter notebooks.
- Establish capability to support workflows in the cloud where the data reside: CCP provides capability.
- Generate higher-level data products: Many products developed.
• Develop any specific recommendations for SAGE-II work-plans and budgets for award years 6 and 7
- Complete CCP: Provides flexibility and scalability for multiple data/metadata formats from seismology, geodesy, and new types of data such as DAS as well as proximity to cloud compute resources
- Operate CCP: Working with international data centers on data services standards (e.g. web services) and ways for direct access, common systems, standardize direct access.
- Training the user base: Find ways to provide a bridge/support/training between the massive data archive and the researchers/other users/computation. Education & training of researchers will be key:
▪ Collaborate with ROSES and other EarthScope Engagement programs; this can also be a testing ground for cloud-based access tools.
▪ Involve graduate students in providing the education & training, develop documentation. Employ graduate student interns at DMC to build valuable skills that are broader than just geophysical research and to build affinity with and understanding of DS.
- Large N ingestion & preparing for this (streaming, DAS, ubiquitous sensors like MyShake, hi rate GNSS): Storage and distribution of legacy data (historical, analog seismic data from microfilms and microfiches). People like Miaki Ishii and Tim Ahern are working on metadata and data formats. NSF will decide what to fund. Community needs to come up with justification for what data to scan and digitize.
• Document key directions and priorities (strategic plan) for your program (include non-SAGE activities as relevant),
- Transition of staff to EarthScope DS
- Complete full transition to the cloud-based system, including integrated data, data products, and user training.
- Respond to Facility Solicitation (NSF)
- Encourage, guide, and optimize user transition to cloud computing near to data via ROSES/SCOPED educational efforts & engagement of graduate students.
- Some discussion ensued about financial support for new “bring workflow to cloud data” users. It is about trying to create a small "energy barrier" to fully open use.
- Prepare for new data types (e.g., DAS and hi-rate GNSS)
- Develop new policies for data acceptance
• Enumerate key science accomplishments, justifications, objectives, and concerns that the program has/will facilitate (please note how these objectives tie to spending priorities for the program)
- Near 100% uptime
- Virtually all data-driven research in seismology and geodesy uses data findable via IRIS.
- Continual modernization of data formats, delivery methods, and other data and metadata access methods.
- Facilitated and supported the growing research fields of environmental seismology and geodesy
- Contributed to modern workforce development
- Maintained high standards and provided metrics of data and metadata quality
• Identify concerns or issues that should be brought to the attention of the IRIS and EarthScope Boards
- Moving data to the cloud requires investment in user training, which is a big deal and a big opportunity for workforce development.
- Identity management might decrease data usage and create unintended conflicts or consequences, as well as cause cloud expenses. Identity management that involves granular usage metrics can slow down computations with data in the cloud and data delivery to users.
- Discussion ensued about the goals of NSF when asking for identity management and about the value of data products not being directly linked to the frequency and volume of data used, and the value of data products for US research even if non-US researchers also created value from the data. More dialogue with NSF could be useful.
- Financial operations related to data flow and identity management in and out of the CCP are a big unknown and can have critical impacts on operational budgets.
- New data types (e.g. DAS) have huge storage needs.
- Data sources are often not or incompletely or incorrectly cited in professional and other publications that used data via IRIS or UNAVCO.
- An idea was proposed to appoint postdocs at EarthScope who can do research with the EarthScope data. Pros and cons were discussed.
• Include a summary of, and guidance for, incorporating input from subcommittees / advisory committees that report to SCs.
- DSSC recommendation for QAAC: DSSC agrees that there is important work to be done in QA, which requires community input. DSSC also understands that having too many committees does not always promote good governance. Hence we recommend that the type of work currently done by QAAC will be managed by an ad-hoc committee (“ephemeral” committee) or working group, which can be called into existence for the duration of the task at hand. Membership can be composed of a small number of members from various standing advisory committees as well as experts from the community, which bring the necessary expertise for the task at hand.
Briefings to DSSC (Carter)
• The Director provided briefings to the DSSC on funding plans/priorities from recent NSF end-of year supplementary funding request as well as on the EarthScope Consortium governance structure and member nomination process. The DSSC made recommendations for future committee memberships.