Docs Italia beta

Public documents, made digital.

Dataset Concept

The main aim of DAF is to provide a framework to manage data, regardless on their dimension and nature (from small vocabulary tables to big unstructured data). That’s why we designed the DAF around an abstraction of the concept of dataset: potentially interconnected logical entities, made of metadata, data, storage options and “interactivity” capabilities. We tried to shape it to be generic enough to model both batch and streaming, structured, semi-structured and unstructured content.

Dataset lifecycle

Datasets follow a standard life cycle path, regardless on their nature and typology.

  • Dataset entity creation: a dataset entity is created via a metadata management form, where the user specifies information about the dataset (following the required items of the DCATAP_IT standard), its data structure, properties and annotations of its fields/attributes, as well as operational information to govern the management of dataset in the platform (where to store the dataset, where to be listening for new data coming, who can have access to it, etc.).
  • Ingestion: after the dataset entity has been created, a microservice activates an Apache NiFi pipeline that is listening for new data to ingest.DAF is currently ready to ingest data coming from SFTP (default option for batch data), pull and push from an external webservice.
  • Transformation and enrichment pipelines: before being stored into the appropriate storage engine, the data goes through several pipelines that add information to the incoming data. We are currently developing the following two pipelines:
    • Normalization pipeline, to apply DAF internal conventions to raw incoming data, such as format (UTF-8), management of null entries, refactoring of codified fields like date and URL, and so on.
    • Standardization pipeline, to make sure that fields marked as bound to a controlled vocabulary (via semantic annotation made during the dataset entity creation phase) are actually using the terms present in the vocabulary.

All pipelines are thought to enrich the incoming raw data, so not to modify the original content: the steps described above add new fields with the result of the transformation applied and, when applicable, an information about the “goodness” of the transformation applied.

  • Storage: after the dataset is ready, it will be persisted using one or more storage mechanism indicated during the dataset entity creation phase. We are currently working to support parquet files in HDFS, HBase, Kudu, MongoDB and ElasticSearch.
  • E-gestion: data are consumed by the end user via API (implemented within the dataset manager microservice), Spark (accessible via Livy and a Jupyter notebook) and data analytics & visualization application (currently we integrated with Metabase and Superset).

Types of datasets

From a logical point of view, the DAF manages two types of datasets:

  • Standard datasets describe phenomena that are common nationwide. These datasets are thus defined as datasets with national relevance and supported by the highest level of information/metadata. They follow a detailed set of rules and standardization mechanisms that make them homogeneous across data sources (there may be multiple data sources describing the same phenomena, e.g., the bike sharing phenomena can be analyzed using data coming from Milan, Turin, Rome, etc…).
  • Ordinary datasets have “owner” relevance, in the sense that they are defined and generated by a specific owner for its specific usage. They do not obey to a standard nationwide schema (data model), but the owner needs to specify metadata and information about the dataset before ingesting the data into the Big Data Platform of the DAF.