LUMI AI Factory Dataset-as-a-Service User Guide¶
What is LUMI AIF Dataset as a Service?¶
LUMI AI Factory Dataset as a Service (LUMI AIF DaaS) offers curated high-quality datasets close to high-performance computing resources, enabling researchers, innovators, and industry users to focus on research and development rather than data acquisition and infrastructure management. The datasets are usually accessible through a front-end portal and application programming interfaces (APIs).
You can find our current datasets in our data catalog. If you are interested in sharing your data as part of our curated collection, you will find more information in this guide.
LUMI AIF DaaS is a Minimum Viable Product (MVP)
LUMI AIF DaaS is currently a Minimum Viable Product (MVP). MVP is the simplest version of a product that includes only the core features necessary to deliver value to early users and validate the product concept. LUMI AIF DaaS is built upon existing services and as such the user experience and available features are already quite advanced for an MVP.
LUMI AIF DaaS is not intended for archival or digital preservation purposes.
Who can use LUMI AI Factory Dataset as a Service? What can they use it for?¶
LUMI AI Factory Dataset as a Service is developed for European AI research and innovation in private (especially startups and SME's), academic and public sectors. It serves two central user roles, data users and data providers.
Data users are those individual users who utilize the datasets in their research or development work. They can
- search or browse the public data catalogue to find datasets
- apply to access a dataset if necessary
- use the dataset in their research
- combine datasets with their own data, if needed, and delete the data they've uploaded for their own use
Data providers are the organizations that make their datasets available via Dataset-as-a-Service. They can
- make large, widely interesting and high-quality datasets (and associated metadata) available for data users
- limit the use and level of publicity of the dataset
Do I have to pay to use the datasets?¶
The LUMI AI Factory Dataset as a Service datasets can be accessed and utilized free-of-charge. To access the LUMI supercomputer and to run jobs with LUMI using the DaaS datasets, you need to be a member of a project that has been granted resources on LUMI. For startups and SME's, the easiest way to get started in using LUMI is by using one of the computing packages of LUMI AI Factory.
The LUMI consortium countries have different policies for accessing LUMI. An overview of the access policies is provided on the LUMI Supercomputer Get Started page.
LUMI AIF DaaS supports access management to datasets. It is possible for data providers and data users to agree upon payment for usage of datasets in the future. Currently we do not have such datasets.
Feedback¶
We are building a new service and would be happy to hear about your experiences and suggestions on how we can make LUMI AI Factory Dataset as a Service better.
Finding and using datasets¶
To start using LUMI AI Factory Dataset as a Service, head to our data catalogue, find an interesting dataset and use it in your LUMI project or elsewhere. If you need help or want more information, check this guide. You can contact LUMI AI Factory user support to receive support at any step of the process.
Searching for datasets¶
The currently available datasets are listed in the data catalogue. Take a look – you don't need have an account to browse it.
You can search for datasets by using e.g.
- access type (open/restricted)
- year
- field of science
- keywords
When you open a dataset, it has general information of the data, links to its documentation and information on how to access it. Many datasets have open API's or portals where you can browse the data. To access a dataset and to use it in your LUMI project you will need an account. For the LUMI access process and requirements, you can find information in the LUMI guide.
Applying to use a dataset¶
Some datasets are of restricted use. Applying to use such datasets is currently done by e-mail. Please contact us by e-mail if you want to apply to use one or several of the current restricted datasets in LUMI AI Factory Dataset as a Service. In your email, please state a short description of the use case you have in mind for the dataset.
Many datasets are open and thus you can use them without applying for access. However we are interested in the different use cases and can help with accessing the open datasets as well, so we appreciate if you sent us an email in such cases as well.
You will be notified by e-mail when you have been granted or denied access for a dataset. After you've been granted access to a dataset, the LUMI AI Factory support can help you to access the dataset so you can use it in your project with LUMI resources.
Using datasets on LUMI¶
LUMI docs provides general information on software on LUMI and running jobs on LUMI. The AI software environment and Containerized Workflows developed by LUMI AI Factory could be especially relevant to AI users on LUMI.
Citation¶
Datasets have unique URLs for citation. Licensing information is embedded in metadata and maintained by data providers. You can get citation examples on the datasets page by clicking the Copy Citation/References -button.
Combining datasets and working with your own data¶
You can combine your own data with a Dataset-as-a-Service dataset or use more than one dataset in your project, if you have been granted access for several datasets.
If your own dataset contains personal data, do remember to check LUMI's Terms of Use relating to it. Uploading and deleting your own data is on your own responsibility.
LUMI's data storage options are detailed in the user guide.
Sharing your dataset¶
We are always looking for new datasets that would be of value to AI research and innovation. If you have or know of a potential dataset, we'd be happy to hear from you! Please contact us by sending an email to LUMI AI Factory user support to tell us about your dataset. We will guide you through the selection and upload process. You can also find more information about the practicalities of sharing your dataset in this guide.
Why should I share my dataset?¶
LUMI AI Factory Dataset as a Service is a way for you to get visibility for your dataset. Our users are frontier developers and innovators of AI in Europe. Our vision is to empower AI start-ups, SMEs, academic researchers, and other public and private users to develop innovative AI models and applications. With this we aim to support trustworthy AI.
LUMI AIF DaaS is not a repository or a preservation service. It does not take away your ownership of data or remove your right to manage access to it. LUMI AIF DaaS facilitates access and findability. Publish your data to increase its impact, discover new use cases, meet compliance requirements, attract collaborators and customers, and participate in cutting‑edge scientific and industrial innovation.
What is required of a dataset¶
LUMI AI Factory Dataset as a Service offers a curated catalog. This means that every dataset offered has been through a process where its suitability has been evaluated. LUMI AI Factory curation criteria is evolving and it will be presented to data providers during our cooperation. To summarize the criteria, the datasets need to be relevant for AI usage, they need to be AI-ready and the data providers need to agree to provide the data under such a license or agreement that it is possible to use it in AI innovation. If you already have used good data management practices and you have a large dataset with some idea of AI useage, your dataset is most likely good to go.
Every data provider needs to accept the Terms of Use of LUMI AI Factory Dataset as a Service. If your data contains personal data, also a record of processing activities should be documented. LUMI AIF DaaS accepts datasets containing personal data and other confidential datasets. Datasets containing special categories of personal data are not currently accepted.
Adding the dataset¶
When you have contacted LUMI AI Factory user support and your dataset has been accepted for LUMI AI Factory Dataset as a Service, you will be guided to add information about the dataset. The metadata about the dataset will be added to Fairdata services. You can do this yourself or we can do it for you. Comprehensive metadata for datasets provides the data users detailed descriptions of the data and helps them to discover the data they need. You can find detailed description of the metadata upload process and possible metadata fields in Fairdata guide for adding dataset metadata.
In the dataset description, you will need to add documentation related to the dataset and ways to access the dataset. If the dataset is already accessible through a national repository or other services, you can add this access point. In addition or if such an access point is not possible, we can give you space to upload the data into LUMI-O. We will help you with the upload process where needed, but you can also check the LUMI-O user guide.
Reviewing applications from data users¶
Datasets can have restricted access. During the process of providing the dataset to LUMI AI Factory Dataset as a Service, criteria for the access to the restricted datasets will be agreed on. If so agreed, the data provider can evaluate every possible data use case. Currently this process is conducted by e-mail.
You will receive data permit applications from potential data users by e-mail and you can either accept or deny the requests by e-mail. If you have set a definitive criteria for acceptance, the data permit management can be done without your approval as well.
Withdrawing a dataset¶
If you want to withdraw your dataset, please contact the LUMI AI Factory support. The LUMI AI Factory Dataset as a Service will also evaluate the usefulness of datasets in the service at regular intervals. If the datasets are no longer useful for AI innovation, they can be removed from the catalog. In such cases the data providers will be notified one month in advance.
Information about use of the dataset¶
LUMI AI Factory Dataset as a Service has limited information about the use of the datasets. In each publication page of each dataset, there is a views-metric available. You can check this for information on the interest your dataset has generated. The views-metric is updated once a day.
If you have published the dataset as restricted, we will have information about access granted. In your access criteria you can also dictate what sort of information you require for the access to be granted (e.g. field of science, use case described).