Big Data, Data Lakes, and Hadoop

As long as I am unemployed I decided I may as well make a bit of use of the time and learn some new things. The first topic area I decided to update myself on was 'big data' and data lakes. I knew what data warehouses were. I have even designed and populated a data warehouse; albeit many years ago. But I wasn't too sure what a data lake was or how data lakes related to big data.

A data warehouse is a schema-defined relation database, and holds data for analysis and reporting purposes. The data contained in a data warehouse is structured data meaning it is basically row and column data. Data is written to the data warehouse by one or more programs that obtain the data from the source systems, massage it to comply with the schema of the data warehouse, and then write it into the data warehouse database. Various reporting programs then get the required data from the warehouse and present it (report it) as required.

Data warehouses can be large but they are constrained by the storage and performance restrictions of relational databases. Terabyte data warehouses are rare.

Like a data warehouse, a data lake also holds data for analysis and reporting. The three big differences with data lakes compared to data warehouses are that: (a) data lakes can contain any format of data including unstructured data (e.g., log files, text files, document files); (b) data goes into the data lake in its native format (i.e., it is not massaged ingoing as data warehouse data is, nor is the format changed); and (c) data lakes span clusters and can be massive often going into the terabytes, sometimes the petabytes, and probably even exabytes.

What do you use to manage and maintain a data lake? Hadoop. Well Hadoop is the main data lake management framework—including packaged variants of Hadoop from various vendors.

Hadoop and data lakes are about storing and analysing 'big data'.

Who has 'big data' and what is it used for? That would be the topic for a different posting.

It is pretty unlikely I will ever be working on a 'big data' project but it is still an interesting topic to look into.