Observational/Hydrographic data of the South Atlantic Ocean published as LOD

. This article describes the publication of occurrences of Southern Elephant Seals Mirounga leonina (Linnaeus, 1758) as Linked Open Data in two environments (marine and coastal). The data constitutes hydrographic measurements of instrumented animals and observation data collected during census between 1990 and 2017. The data scheme is based on the previously developed ontology BiGe-Onto and the new version of the Semantic Sensor Network ontology (SSN) . We introduce the network of ontologies used to organize the data and the transformation process to publish the dataset. In the use case, we develop an application to access and analyze the dataset. The linked open dataset and the related visualization tool turned data into a resource that can be located by the international community and thus increase the commitment to its sustainability. The data, coming from Peninsula Valdés (UNESCO World Heritage), is available for interdisciplinary studies of management and conservation of marine and coastal protected areas which demand reliable and updated data.


Introduction
In the ecology domain, research teams collect and store biological and environmental information over the years/decades in database systems to answer their own queries. However, this information is isolated from other datasets for interoperating with and, in addition, is not ready to be accessed by machines. Particularly, in marine science the data collection is a process of cumulative logistic complexity, which makes it important to work on the curation and sustainability of the database, in both the short and long term. It is of great benefit for scientific institutions to publish their datasets following the Linked Data principles [1] not only for interlinking and easy cross-referencing but also for other purposes that are not foreseen at the moment of publication. The state of the art in the last * Corresponding author. E-mail: zarate@cenpat-conicet.gob.ar. decade shows that together with technology to collect data, semantic interoperability has further grown in importance [2]. To meet Linked Data requirements, datasets must be described with rich metadata such as controlled vocabularies in a particular form -RDF -and published as a findable resource with a unique identifier.
This paper integrates observational and hydrographic datasets based on the SOSA/SSN ontology [3] and BiGe-Onto ontology 1 [4]. As far as we know, this work is the first to publish linked open data occurrences of a species in two geographical environments (coast and marine) collected over two decades. Data comes from a research program focused on Southern Elephant Seals (SES) in Patagonia Argentina "Temporal and spatial distribution of the southern elephant seal colony in Península Valdés, Argentina" [5]. The pro- gram started in 1990 to study ecology and life history strategies of SES, together with the research of foraging areas and dive behavior, and to contribute to understanding the effects on the species from changes in the ecosystem of SW Atlantic Ocean. The research site is in Península Valdés (PV), which has been a UNESCO World Natural Heritage since 1999 [6]. During the annual cycle, the SES come ashore to breed and molt. The rest of the year they are at sea, traveling long distances throughout its extensive migration (up to 8 months and 12000 km. of round trip), and diving continuously to a depth of 2000 meters or more. During their terrestrial phase they frequently revisit previous years' sites [7,8]. The behavior during the marine phase shows that SES are ideal carriers of devices, providing physical profiles, (i.e. hydrographic of the water column). For tracking the SES at sea, researchers make use of miniaturized animalattached tags for relaying data, known as biologging domain [9], and cover animal migration and oceanographic measurements [10]. The instruments deployed on the seal return, at a low cost, large volumes of hydrographic data in regions never studied directly by buoys or oceanographic vessels and collecting large amounts of information associated with the key habitats in the South Atlantic Ocean. This paper is organized as follows: Section 2 describes the SES database. Section 3 briefly presents the network of ontologies. Section 4 shows examples of how the data are organized. Section 5, describes the populating processes and the links to other datasets. Section 6 shows the application developed to access and analyze the dataset. Finally, we conclude by presenting an analysis of our work and perspectives.

SES Database
Data are recorded from measurements of physical variables and locations obtained in two different stages. First stage involves an annual census, which takes place during the breeding season of SES. The second stage starts at the end of the breeding season, when SES go back at sea for foraging purposes. Below we briefly describe how data are generated and recorded in each stage. During the breeding season, the SES haul out to the beach to breed. Annual census on foot along the coast of the colony is an arduous but indispensable work to know distribution and trend of the population. The objective is to count each of the harems scattered on the beaches of the PV to deter-mine the number of offspring born in a season. Counts carried out during 2-3 days at peak of the breeding season (October 3-7), when most of the population is ashore. All the breeding groups were counted and located along 200 km of coastline, divided into sections and each census taker is assigned to a route. The census taker must count for the number of animals and classify them by sex and age males, females, and pups. Hereinafter, we will call the procedure of counting individuals in a certain place Occurrence. Each occurrence was georeferenced (latitude and longitude) and demographic data included date and time, group size and substrate where the SES is located. All information about these censuses is recorded in a field book and then uploaded into a MySQL database. Table 1 summarizes the most relevant fields for the conducted censuses. At the end of breeding, SES go back at sea for foraging. The trip is monitored by small computers designed and built by Wildlife Computers Inc. 2 with sensors to take measurements about their location and immediate environment. The instrument is deployed when the seal is on land before the migration into the sea begins. Time Depth Recorder (TDR) records, time, depth, and temperature every 30 minutes during round foraging trip. The position is also registered when the seal ascends to the surface. Table 2 summarizes the fields that are most relevant in diving.
The census and the deployments of the instruments are carried out by the research team belonging to Centre for the Study of Marine Systems hosted in Puerto Madryn, Patagonia Argentina (CESIMAR-CENPAT-CONICET) 3 . The institute is engaged in oceano-

Ontologies used to model observations and hydrographic profiles
The core of our ontologies network is composed by SOSA/SSN [3] and BiGe-Onto [4], which can be jointly used for both hydrographic profiles and observational data. These ontologies are linked to other ones describing different sub-domains, and thus creating such network. Therefore, the resulting network is composed mainly by the following: an ontology to describe the sensors used to measure hydrographic profiles an ontology to describe SES occurrences made during censuses an ontology to describe the associated measures.
an ontology to describe the locations and places of interest an ontology to describe temporality of events an ontology to describe scientific publications In this section, we briefly summarize these ontologies used for the publication of our dataset, indicating the reuse of concepts.
Semantic Sensor Network (SSN) Ontology: The Semantic Sensor Network (SSN) is a generic ontology related to sensor observations. This ontology has been updated to become a W3C recommendation, and currently it is a lightweight one dedicated to sensor and actuator descriptions. It has been called Sensor, Observation, Sample, and Actuator (SOSA) pattern. The link between SSN and SOSA is described as follows in [3]. The classes we have reused from SOSA/SSN ontology are: sosa:Observation: to describe the measurements context. sosa:FeatureOfInterest to specify the observed phenomena. In our case, the sample of water column registered by the SES during diving. sosa:ObservableProperty to specify the measured property of the observed phenomena (temperature, depth and location). sosa:Platform to represent the platform hosting a sensor. In our cases, the platform is always the SES. sosa:Sensor to describe sensors hosted by platform (e.g. TDR). sosa:Result to represent the measurement values from the sensors.
BiGe-Onto Ontology: is an ontology designed for modeling Biodiversity and Marine Biogeography data [4]. The main concept of BiGe-Onto is an occurrence. Given that the census are occurrences of SES at a specific time and place, we consider BiGe-Onto fits the nature of our data. At the same time, it reuses different vocabularies such as Darwin Core (DwC) [11], which is the core one in BiGe-Onto. Its main classes are: dwc:Occurrence, dwc:Event, dwc:Taxon and dwc:Organism. Moreover, BiGe-Onto reuses foaf:Person void:Dataset and dc:Location, among others.
Since BiGe-Onto mainly describes occurrences, which dependent on other concepts to exist, we also outline below some of the most important properties defined for relating such occurrences: bigeonto:associated: each occurrence is described based on the existence of an organism at a particular place and at a particular time. Organisms are related to a taxon by means of bigeonto:belongsTo. bigeonto:has_event: occurrences take place during a sampling event at a location given by bigeonto:has_location, which is also characterized bigeonto:caracterizes by a specific environment. The relations between bigeonto:Environment and EnvO classes are primarily controlled by the Relations Ontology (RO) 4 respectively. dwciri:recordedBy: this property provides information about people, groups, or organizations, who have recorded the occurrence. It is also reused from the DwC URI namespace and enables non-literal ranges for its analogous with DwC, dwc:recordedBy. dwciri:inDataset: This object property is provided to link a subject dataset record to the dataset which contains it.
The Quantity, Unit, Dimension and Type (QUDT): is a collection of OWL ontologies and vocabularies [12]. The QUDT schema defines the base classes, properties, and restrictions used for modeling physical quantities, units of measure, and their dimensions in various measurement systems. QUDT also contains a set of vocabularies to define units for different domains. We have reused the unit vocabulary that categorizes units in different classes. This vocabulary also provides individuals to identify units such as qudt:Meter or qudt:DegreeCelsius.
GeoSPARQL Ontology: GeoSPARQL [13] is an Open Geospatial Consortium (OGC) standard for supporting the representation and querying of geospatial data on the Semantic Web. As such, it is based on the OGC's Simple Features model, with some adaptations for RDF. GeoSPARQL designates a vocabulary for representing geospatial data in RDF, and it defines an extension to the SPARQL query language 5 for processing them, together with both a small ontology for representing features 6 and geometries 7 , and a number of SPARQL query predicates and functions. All these definitions are derived from other OGC standards so that they are well grounded and documented. Using the new standard should ensure two things: (1) if a data provider uses the spatial ontology in combination with an ontology of their domain, these data can be properly indexed and queried in spatial RDF stores; and (2) RDF-compliant triple stores should be able to properly process the majority of spatial RDF data. This ontology is used to describe the location of each occurrence, and the beaches involved. We reuse the classes geo:Feature and geo:Geometry, and the associated properties geo:hasGeometry, geo:asWKT, etc. 5 https://www.w3.org/TR/sparql11-query/ 6 A feature is simply any entity in the real world with some spatial location. 7 A geometry is any geometric shape, such as a point, polygon, or line, and is used as a representation of a feature's spatial location.
The W3C Time Ontology: The W3C Time ontology [14] enables the description of time instants and intervals. Hence it may be useful when we need to describe the timestamp, or the time associated with the measurements made by the observers to the SES. We reuse the classes time:Interval and time:Instant, and the associated properties time:hasBeginning, time:hasEnd, time:inXSDDateTimeStamp, etc.

FRBR-aligned Bibliographic Ontology (FaBiO):
is an ontology [15] for recording and publishing on the Semantic Web descriptions of entities that are published or potentially publishable, and that contain or are referred to by bibliographic references, or entities used to define such bibliographic references. FaBiO classes are structured according to the FRBR schema of Works, Expressions, Manifestations and Items. Additional properties have been added to extends the FRBR data model by linking works and manifestations. We reuse the classes fabio:Expression, fabio:JournalArticle, fabio:Article, fabio:Presentation, fabio:Book, fabio:BookChapter, fabio:Notebook, and fabio:Dataset to identify the kind of document published. We also reuse the associated properties such as prism:doi, prism:publicationDate, etc.
Additionally, we have used vocabularies of the oceanographic domain as The Natural Environment Research Council (NERC) Vocabulary Server [16] supported by the British Oceanographic Data Center (BODC) 8 , provides access to lists of standardized terms that cover a broad spectrum of disciplines of relevance to the oceanographic and wider community. A list of prefixes and their corresponding URIs are listed in Table 3.

Data model and URIs
Based on the network of ontologies described in the previous section, we are now able to create a dataset containing all the individuals describing hydrographic profiles and occurrences taken during the censuses. Now we explain the decisions taken to create resource URIs and we provide examples of resource descriptions.

Resource URIs for hydrographic profiles
This section presents the main URI design decisions and conventions used. Table 4 provides a summary of the main types of URIs that we generate. The first column presents the type of resources. The second column indicates the associated class which types the resources. The last column contains the name pattern used to generate the resource URIs. The base URI for our dataset is http://linkeddata.cenpatconicet.gob.ar/resource/. Its prefix is base. Our generic name pattern to produce URIs for each object is Base URI +"/"+ nameOfClass +"/"+ objectIdentifier.

Platform and Sensor
We consider SES as an oceanographic sampling platform. The individual that represents the platform is an instance of the sosa:Platform class. Each sensor hosted by the SES is represented by an instance of the class sosa:Sensor. Figure 1 presents the description of the TDR sensor. The TDR is identified by an URI generated using the sensor type, e.g. TDR plus the manufacturing number 19793. This URI is typed by the class sosa:Sensor. The sosa:host property links the sosa:Platform instance to the TDR URI. The sosa:observes property links the TDR URI to an instance of sosa:ObservableProperty, in this case it corresponds to depth. To explore the sensor URI visually use the link http://linkeddata.cenpat-conicet.gob. ar/resource/sensor/TDR/id-19793

Observation
An observation describes the context of a measurement made by a sensor, in the case of TDRs, the measurements are location, time, depth and temperature. Properties sosa:hasFeatureOfInterest, sosa:hasResult, sosa:observedProperty and sosa:madeBySensor link our specific observation with its corresponding observed property, location, sensor, and measurement value. We create an instance of sosa:FeatureOfInterest class that represents the sample of water column during dive. GeoSPARQL is used to describe the precise location of the SES during the dive. As shown in Figure 2, the geometry of the trip made by the SES is a set of points expressed by a WKT string. This string is linked to a geo:Geometry instance by the geo:asWKT property. The geo:hasGeometry property links the sosa:Result instance to an instance of the geo:Geometry class. To explore the obervation URI visually use the link http://linkeddata.cenpat-conicet. gob.ar/page/observation/id-233/location Figure 3 presents an observation produced by the TDR. Sometimes a measurement is related to a period of time. For example, the TDR measures the duration of a dive during a immersion. The property sosa:phenomenonTime links the sosa:Observation to an instance of the class time:Interval. The properties time:hasBeginning, time:hasEnd points to an instance of the class time:Interval. To connects the time:Instant instance to an xsd:dateTime value we use time:inXSDDateTimeStamp property. The duration of the interval is described as instances of the class time:Duration. To explore the phenomenon time URI visually use the link http://linkeddata.cenpat-conicet. gob.ar/page/interval/sampleID-233

Occurrence
It is true that we can model the SES census using SOSA/SSN because sosa:Sensor can be an observation made by a human instead of an electronic equipment, we have decided to use BiGe-Onto since it was created to model species occurrences by doing use of the DwC. We believe that if we want to share the results in an interdisciplinary way, it is necessary to respect the standard adopted by biodiversity community. Using DwC will also allow the reuse of this    part of the dataset to perform more complex analyzes such as marine spatial planning. Based on these considerations we use the class dwc:Occurrence to rep-resent the SES observations made during a census.
In Figure 4 you can see the observation of two female pups on October 3, 2001, to represent gender we  use the nerc's URI (SDN:S10::S106). The property bigeonto:has_event connects the instance of occurrence with the instance of the event bigeonto:BioEvent. In the same way, each event instance is related to an instance of the geo:Geometry class through the bigeonto:has_location relationship. On the other hand, the dwciri:recordedBy property relates the instance of the occurrence to instances of the foaf:person class that perform the observation. Finally, the occurrence is associated with an instance of the class dwc:Organism through the bigeonto:associated property and the organism belongs to a specific taxon (dwc:Taxon) whose scientific name is Mirounga leonina. The red rectangles represent the links generated for the taxon in DB-Pedia and Wikidata, as well as the identifier of the person in ORCID. To explore the occurrence URI visually use the link http://linkeddata.cenpat-conicet.gob. ar/page/occurrence/ID-36202

Publication
Each publication is represented as an instance of fabio:Expression. These expressions are also split by type of publication and thus using the respective subclasses of fabio:Expression. For instance, published books are represented with the class fabio:Book and so on. The shared properties for all the publications include: date of publication (prism:publicationDate), title (dc:title), doi (prism:doi), authors (dc:creator), abstract (dc:abstract) and file format (dc:format). Finally, documents reference, (dc:references), to the respective platforms (sosa:platforms), which have been involved in the results reported in those documents. Figure 5 shows the modeling of a publication associated with a platform. To explore the publication URI visually use the link http://linkeddata.cenpat-conicet. gob.ar/page/paper/ID-45

Data Transformation Process
To create Linked Open Data, a conversion needs to take place from the data contained in SES database into RDF. As explained in section 2, measurements produced by sensors, and census data are stored in MySQL server. Fields that are no longer used or that contain confidential data are excluded, for example data that is still being processed. Transformation process is done by D2RQ Platform 9 , which consists of: The D2RQ Mapping Language, used to write mappings between database tables and RDF vocabularies or OWL ontologies. The D2RQ Engine, a SPARQL-to-SQL rewriter that can evaluate SPARQL queries over your mapped database, and D2R Server, a web application that provides access to the database via the SPARQL protocol, as Linked Data, and via a simple HTML interface, to see the complete mapping, use the link 10 to the project repository.
D2RQ runs in back-end at http://linkeddata.cenpatconicet.gob.ar to browse structured data, it also has a SPARQL endpoint to be accessed from other applications, and a SPARQL explorer to query our own database in a friendly manner. One of the advantages that D2RQ provides us is that after mapping, if the database is updated, it is not necessary to rewrite the mapping. Key statistics presented in Table 5 was computed in September 2020.

Interlinking
The external links were generated manually, using a MySQL table created for this purpose, which is then mapped using D2RQ. This table has in one column the URI of our dataset, for example http://linkeddata. cenpat-conicet.gob.ar/resource/person/MZA and in another column the equivalent URI in an external dataset such as orcid:0000-0001-8851-8602.
When possible, in the case of publications, the instances of fabio:Expression class were linked to OpenCitation dataset [17] as Figure 5 shows. The relationship between these URIs is done using owl:sameAs    property. Instances of sosa:ObservableProperty were linked to their corresponding URI in the NERC dataset, for example depth it has its equivalent URI, whose identifier is SDN:P01::ADEPZZ01 as shown in Figure 2.
For external links referring to places, we use Geonames 11 a geographical database that contains over 11.8 million geographical names. The structure behind the data is the Geonames ontology v3.2 12 , which closely resembles the flat-file structure. An individual in the database is an instance of type Feature and has a Feature Class (administrative divisions, populated places, etc.), a Feature Code (subcategories of Feature Class) along with latitude, longitude, etc. associated with it. In our case, the occurrence places such as beaches (instances of geo:Feature and dc:Location) where occurrences are recorded where linked to Geonames whenever possible. An example of this can be seen in Figure 4, the location corresponds to a beach located in the PV called Punta Buenos Aires whose identifier is geonames:3863776.
To link instances of people foaf:person we use links to Open Researcher and Contributor Identifiers (OR-CiDs) [18], they are intended to uniquely identify researchers so that those individuals can be correctly credited for their research work and links can be provided to express their professional affiliations, see

Dataset availability
The SES dataset can be downloaded, navigated and queried using a SPARQL endpoint, and they are published under Creative Commons Attribution 4.0 Inter- national (CC BY 4.0) 13 License. All the criteria for five star Linked Data as defined in [19] are met. There is a description of the data online, the data is available in RDF, there are many links to structured vocabularies and metadata about the collection is made available. Our dataset characteristics are listed in Table 6. To explore the dataset using the SPARQL endpoint, we have developed a set of queries to answer the most common questions that researchers need to answer, for example number of dives, trips, values of certain environmental variables, etc. Table 7 shows the developed queries and their corresponding links to the endpoint. 6. Use Case: Accessing and analyzing data from dives and censuses One crucial aspect is how to access and analyze data, and especially how to get only that part of data which is of interest for a given research question. To show the exploitation of the dataset, we developed a dashboard https://cesimar/DiveAnalysisDashboard that allows you to consult the statistics of the dives and the routes taken. We use the R flexdashboard 14 16 allows you to directly import results of SPARQL SELECT queries into the statistical environment of R as a data frame. The following describes each module of the dashboard. Diving statistics: This module summarizes the diving statistics (maximum depths recorded, number of dives, maximum temperatures and number of platforms). The information for each of the sensors used is also detailed. For bar charts, ggplot 17 library was used. Dive Analysis: this module allows you to see by platform the most important variables registered during dives. Temperatures and depths, as well as duration can be displayed. The line chart was built using the plot_ly 18 library. Platforms trips: This module retrieve the trips made by each platform and displays them on a map generated with the leaflet library 19 , a filter can be made for each one if necessary. A spatial cluster analysis using the dbscan 20 algorithm is also provided to understand the distribution of SES at sea. The parameters can be configured by the user for their best adjustment. Census statistics: This module allows analyzing the data of the census carried out during 1990 to 2017. Two charts were developed with ggplot, the first shows the annual population of SES grouped by category, while the second shows the trend of the SES breeding population.

Discussion
This paper presents the publication as LOD of a biological and physical dataset, collected for more than 20 years, and stored with early objective of to study the environment influence on the foraging, reproductive performance and population trend of the SES initially available for a small research group to a global community. Our development with Linked Open Data improves the discoverability of the content of the database and could be applied at new knowledge-building and cross-disciplinary. For example, we expect the hydrographic profiles become a useful tool together physical samples resulting from other science programs, to assess ocean changes associated with the climate change. The dataset comes from PV a geographic region under conservation regulations by UNESCO and there is a continue demand of the governmental authorities to develop spatial planning. This requirement helps the sustainability of the database, because it needs a high level of accuracy for data actualized of SES and access at other databases in a user-friendly manner. Coastal management Planning and Marine Spatial Planning (MSP) are concerned with the management of the distribution of human activities in space and time in and around seas and oceans to achieve ecological, economic and societal objectives and outcomes [20]. The next steps will be to promote the use vocabulary terms for discovery databases purposes of the institute CES-IMAR, to allow the availability and suitability of data, to be used at regular review cycles of the MSP process. In addition, it would be desirable to access to the physical dataset collected by tourist and commercial vessels that overlap the same range in the southwest Atlantic Ocean. These hydrographic profiles could cover changes of the environment in all influence area of the SES distribution.