About the data population
Types of content
There are 5 main content types on the SSH Open Marketplace, which are considered to be representative for the large array of digital resources that can be found on this discovery platform.
Tools & services
This type of content refers to all sorts of digital services or products, such as software, applications, programs, websites, programming libraries and APIs, that make tasks easier to execute. Some of them can be accessed directly in web browsers whereas others have to be installed locally. Examples: (1) The Gephi tool is a visualisation and exploration software for all kinds of graphs and networks. (2) The DARIAH-DE Helpdesk (CLARIAH-DE, Text+) is a service whose team is familiar with the humanities and cultural science research processes and may likely help you along with questions falling not directly into the services/resources of the mentioned initiatives.
Training materials
Tutorials, lessons or didactic resources explaining how to perform an action or highlighting the potential learning outcomes gained from using that material.
Examples: (1) The lesson Cleaning Data with OpenRefine which allows you to learn the principles and practice of data cleaning and how to use the OpenRefine tool. (2) The SSHOC Webinar: CLARIN Hands-on Tutorial on Transcribing Interview Data which focuses on the role of automatic speech recognition – what are the opportunities, what are the pitfalls and to where can it be applied successfully.
Workflows
Sequences of steps that one can perform on research data during their lifecycle. Workflows can be achieved by using diverse tools, resources and methods, and the useful resources are connected to each step.
Examples: (1) Extract textual content from images is a workflow composed of 13 steps coming from the Standardization Survival Kit useful for extracting textual content from images. (2) Intertextuality phenomena in European drama history is a workflow composed of 4 steps useful for analysing the relationships between the characters in a drama based on monologue/dialogue.
Datasets
A dataset is an organised collection of data. It is generally associated with a unique body of work, typically covers one topic at a time and is treated as a single unit by a computer.
Examples: (1) Parlamint contains a collection of parliamentary datasets (corpora) in a number of languages and in a harmonized format that are processed linguistically, and indexed with popular concordancers. (2) CoFiF-Corpus for Finance is the first corpus comprising company reports in French; the documents are collected from the 60 largest French companies listed in France’s main stock indices CAC40 and CAC Next 20.
Publications
Research results published in academic journals or non-peer-reviewed publication repositories such as Zenodo (our logic being that not all valuable academic production is necessarily in a traditional, peer-reviewed academic journal). The SSH Open Marketplace references only publications that can be connected to other resources (i.e. tools and services, training materials, workflows or datasets). As such, the SSH Open Marketplace is not an exhaustive collection of papers.
Examples: (1) PoetryLab. An Open Source Toolkit for the Analysis of Spanish Poetry Corpora is a conference paper presented during the DH2020 Conference in which you can find an example of use of the SpaCy library (referenced as a tool in the SSH Open Marketplace). (2) Music Encoding Initiative Guidelines are intended to serve as a reference tool for music encoders, and are linked to several workflows and related tools in the Marketplace.
Inclusion criteria for adding resources: 4 questions to ask
In order to have high quality and high relevance content in the SSH Open Marketplace, we have established inclusion criteria for entry into the discovery platform. These apply both to individual items added manually by users, as well as to mass ingestions of entire sources (more on that below).
Here is a list of four questions to ask before adding a resource on the SSH Open Marketplace and the ensuing inclusion criteria:
- The relevance of the resource. The question to ask is : will this resource be relevant to the SSH scientific community? Thus, to be selected, any resource must fulfil at least two criteria: (1) scientific relevance and usefulness for SSH research and researchers and (2) pertinence to the digital methodologies used within the SSH landscape.
- The technical status of the resource. The question to ask is: is the resource current, supported, and ideally open? The SSH Open Marketplace favours the uptake of Open Science workflows and open research practices. Software resources are preferably built upon open source solutions.
Note: Given that the SSH Open Marketplace seeks to mirror actual research practices, commercial or non-current resources are also referenced where these are relevant for the scientific community. - The degree of compliance with Open Science requirements of the resource. The question to ask is: is the resource FAIR – Findable, Accessible, Interoperable and Re-usable - or contributing to the uptake of Open Science best practices? The SSH Open Marketplace maximises the findability and re-use of data, and guides users towards tools, services or training materials that can help them in their FAIRification of workflows.
- The uniqueness of the resource The question to ask is: is the resource already in the Marketplace? If yes, there is no need to add it again, either as an individual item or with a source. Even though duplicate entries are dealt with within the Marketplace, ingesting identical (or worse almost identical) entries multiple times uses up valuable human resources that will be better put to use elsewhere considering the limited timeframe and personnel of the project
SSH Open Marketplace Core Principles
- Contextualization is one of the key pillars of the SSH Open Marketplace. It is meant to provide a discovery portal for tools and services, while placing these tools and services in context via publications, training materials, datasets, and workflows. As such, these last four categories are indexed in the SSH Open Marketplace insofar as they can be placed in relation with tools and services, and if no relation to a tool or service exists, they should not be accepted.
- Curation is, alongside contextualization, a key pillar of the SSH Open Marketplace. We believe that accurate and up-to-date content is key to making the SSH Open Marketplace a rich and useful discovery portal. As such, data population and curation are at the heart of our work. Once (meta)data have been ingested in the Marketplace, we curate and enrich them thanks to a hybrid approach: automated checks are run on the ingested data, followed by manual review of the identified problems as well as of aspects that cannot be checked automatically. In this process, contributors and moderators are playing an essential role (See “How to contribute”).
- Quality of the (meta)data: higher quality data will increase the quality of the Marketplace. This means that the more of the metadata fields in the Marketplace data model are covered by an item, the better (items with just a title and a link are inferior to ones containing a long description and maybe even a list of contributors) and the more good quality metadata in your item or source, the higher the overall quality of the item or source. Note: currently, most publications lack metadata. This is a temporary situation, which is due to the fact that they were automatically ingested with the purpose to automatically identify relations with the other types of resources included in the SSH Open Marketplace.
- Focus on primary sources: in the Marketplace, we give priority to primary sources and exclude sources that only aggregate information that can be found elsewhere. Such “2nd hand metadata” introduces additional points of failure and using the primary source instead will likely result in higher quality metadata.
- The technical interface: it is easier to import a source into the Marketplace if it offers a well-documented API. Considering the limited human resources of the team, sources which needed a lot of additional work before they could be fed into the ingestion pipeline are placed very low on the priority list.
For more questions, please [contact us](/contact).
Sources
During the development phase of the SSH Open Marketplace, we identified and prioritised trusted sources (i.e compliant with the Core Principles defined above) from which to gather information to populate the SSH Open Marketplace. Over the 3 years of the SSHOC project, 15 sources have been ingested for a total of over 5,000 individual items. The sources are listed below.
TAPoR is a long-standing gateway to tools used for text analysis and retrieval.
The Programming Historian publishes novice-friendly, peer-reviewed tutorials that help humanists learn a wide range of digital tools, techniques, and workflows to facilitate research and teaching.
The Standardisation Survival Kit presents a collection of research use case scenarios illustrating best practices in Digital Humanities and Heritage research. The SSK website is now archived and all scenarios are directly hosted and visible from the SSH Open Marketplace.
The Language Resource Switchboard helps you find tools that can process your data.
The dblp computer science bibliography provides open bibliographic information on major computer science journals and proceedings. Only a subset of publications related to digital humanities is being ingested in the SSH Open Marketplace.
The EOSC Portal catalogue & marketplace is an integrated platform that allows easy access to plenty of services and resources for various research domains along with integrated data analytics tools. Only a subset of resources, relevant for Social Sciences and Humanities, is being ingested in the SSH Open Marketplace.
The Humanities Data website collects and presents datasets and recipes stemming from Digital Humanities projects.
The CESSDA Training Working Group, implementing one of the four strategic pillars of CESSDA, offers a wide variety of training in Research Data Management and data archiving to both researchers and data curators.
The CLARIN Resource Families are a number of curated collections of corpora and tools. They are manually put together by CLARIN with the aim to provide a user-friendly overview of existing resources within and without the CLARIN infrastructure.
DARIAH-Campus is both a discovery framework and a hosting platform for DARIAH and DARIAH-affiliated training and education materials.
DARIAH member states contribute to the DARIAH distributed infrastructure with a diverse range of resources and services, and declare these in-kind contributions via the DARIAH contribution tool. A selected set of these contributions has been added to the SSH Open Marketplace.
The SSHOC service catalogue is the result of all SSHOC Work Packages to collect and consolidate the SSHOC services (or resources) offered. Based on the progressing implementation of the project, resources referenced in the catalogue are the most visible outputs of the SSHOC project.
The SSH Conversion Hub is an outcome of SSHOC WP3. It allows users to search for tools that convert from one (meta-)data/file format to another one, e.g. from CSV (comma-separated values) to TEI (Text Encoding Initiative). import withHeadingIds from ‘remark-slug’
The SSH Training Discovery Toolkit acts as an overview on relevant sources that hold (digital) material for trainers. For such sources, selected items are described in more detail than is useful for the SSH community and give a hint about what training material to expect from the source.
If you don’t see some of these sources in the Marketplace, it is because they are still in the process of being ingested. New sources are also regularly added (between 1 to 2 sources per year). As there are some criteria to comply with, please check the Contribute section if you would like to suggest new sources.
The SSH Open Marketplace is maintained and will be further developed by three European Research Infrastructures - DARIAH, CLARIN and CESSDA - and their national partners. It was developed as part of the "Social Sciences and Humanities Open Cloud" SSHOC project, European Union's Horizon 2020 project call H2020-INFRAEOSC-04-2018, grant agreement #823782.