There  are  several  approaches  and  techniques  in 
the  literature  to  manage  semi-structured  and 
structured data ((Bhroithe et al. 2020), (Alloghani et 
al. 2019), (Aftab et al. 2020), (Ouaret et al. 2019)). 
However, it only focuses on two formats (structured 
and  semi-structured)  but  does  not  examine 
unstructured  data.  In  addition  to  that,  most  of  the 
approaches  that  deal  with  unstructured  data  focus 
only on textual data (Yafooz and Fahad 2018).  
The  goal  of  this  paper  is  formed  as  follows: 
Firstly, we take a look at the state of art and present a 
comprehensive vision of DL concepts. Secondly, we 
introduce a new method for structuring unstructured 
data.  Especially  in  the  health  field,  because  in  this 
field, we often find data in different formats. Thirdly, 
the construction of our ontology which represents our 
Moroccan data lake.  
The remainder of this paper is structured like this: 
in Section 2, we parse the related literature. In Section 
3,  we  present  the  formalization  and  data  lake 
architectures adopted by our approach. Then we offer 
a  procedure  to  partly  structuring  unstructured  data 
sources.  In  Section  4,  we  describe  our  data  lake 
ontology-based model to enrich the representation of 
unstructured data sources. In Section 5, we give an 
example case of covid-19. In Section 6, we present 
the  evaluation  technique  and  describe  a  critical 
discussion of our approach. Finally, in Section 7, we 
conclude our paper. 
2  RELATED LITERATURE 
2.1  State of the Art 
Data lake relatively is a recent concept, introduced by 
James Dixon as an alternative to data marts; storing 
data into silos (Alserafi et al. 2016), to prevent them 
from being transformed into a data swamp must be 
accompanied  by  metadata  (Sawadogo  et  al.  2019). 
The data lake model demands that any raw data be 
combined  with a  set of  metadata. This  represents a 
crucial  competitive  differentiator  for  any  data  lake 
architecture.  Following  (Farrugia  et  al.  2016),  they 
proposed an approach to managing data lakes based 
on  the  extraction of  metadata  from  an  open-source 
data warehouse system named hive. To achieve their 
target, it applies Social Network Analysis techniques. 
In the literature, various metadata classifications 
have  been  introduced.  Thereby,  various  metadata 
models  are  used  to  design  metadata  classification. 
Among these models we find RDF. The power of this 
model is of course its semantic richness. However, its 
weakness is its complexity. Indeed, cannot maintain 
fast  processing  and  analysis  of  the  heterogeneous 
data. 
A  metadata  model  proposed  by  Oram  is  well-
suited for data lakes (Oram 2016). There is also the 
model  approved  by  Zaloni  (Ben  Sharma  2018), 
considered as one of the business managers in the data 
lake  domain.  Yet,  Zaloni  adopts  a  trinomial 
classification  of  metadata,  namely  operational, 
technical, and business metadata. 
2.2  Data Lake Definition 
A data lake is represented as an extensive system or 
repository  that  stores  heterogeneous  raw  data;  the 
diversification of concepts poses a significant issue. 
There  is  robust  compliance  in  the  literature  on  the 
definition of data lakes. Still, all existing definitions 
share  the  same  vision  about  the  definition  of  data 
lakes, respecting the idea that a data lake is a central 
repository of raw data stored in a natural format. For 
example, (Hai et al. 2016) defines data  lakes  as “a 
megadata repository that stores data in its native 
format and provides on-demand ingestion 
functionality using metadata description”. 
(Terrizzano et al. 2015) uses a definition of a data lake 
provided by (Madera and Laurent 2016) and asserts 
that  “a  data  lake  is  a  central  repository  containing 
enormous  amounts  of  raw  data  described  by 
metadata”. Thus, we ascertain that there is a strong 
agreement concerning the definition of data lakes. In 
the situation of big data analytics, user needs are not 
established during the primary draft. A data lake is an 
answer  that  came  with  the  appearance  of  big  data, 
ingests raw  data  from different sources,  and stocks 
source  data  in a  natural  format.  Enables  data  to  be 
processed  conforming  to  diverse  specifications. 
Indeed,  empowers  access  to  ready  data  for  various 
needs, and supervises data to ensure data governance.  
3  FORMALIZATION 
In this section, we will describe our network model to 
manage the data lake, which will be used in our paper.  
Our network model shows data lake as being a set 
of data sources, like this:  
DL = {DS
1
, DS
2
, … , DS
n 
 / DS = Data Source}  (1) 
It is important to note that each data source DS
k
 is 
arranged with a set of metadata commented by M
k
. 
We denote by M
DL
 the set of metadata repositories for 
heterogeneous data sources stored in the lake.