Skip to content

Latest commit

 

History

History
29 lines (19 loc) · 2.59 KB

cloudbigdata.md

File metadata and controls

29 lines (19 loc) · 2.59 KB

Homework for Cloud Infrastructures and Architectures of Big Data Platforms

Last modified: 21.01.2021 By Linh Truong([email protected])

This homework is not graded.

1 - Using Docker to deploy multiple nodes of MongoDB

The goal of this task is to help you to be familiar with dynamic provisioning of big data platform components using cloud technologies. We choose MongoDB as one component for this task, as it does not require a huge effort to deploy and test it. MongoDB is a common NoSQL database. You can run MongoDB using Docker container. We can also run a replica set of MongoDB using Docker Compose.

  • Setup docker and get MongoDB docker image
  • Deploy a MongoDB instance using Docker
  • Write a program with three functions: (i) test if an MongoDB instance is running, (ii) kill/stop a MongoDB instance, and (iii) start a MongoDB instance

We do not assume that you master MongoDB. If you do not know MongoDB, it is still possible to practice the homework as it is mainly about managing services (for big data platforms). In our tutorial code, there are some parts dealing with MongoDB that you might take a look:

2 - Analyzing data concerns in a big data pipeline

Assume that you take the data from Airbnb Dataset and combine it with crime data (e.g., from the government) for recommending accommodations. Which data concerns (e.g., accuracy, price, license) are important?

3 - Multiple types of data

Consider that your big data platform must support the analysis of Avian Vocalizations from CA & NV, USA. Would you consider to use different types of data storages/databases, where each storage/database (e.g., database or file storage) would store only one type of data.

4 - Partitioning

For storing the BTS data, should we partition data based on the station or the timestamp of the data?

5 - Distribution

Given the BTS monitoring, e.g. the BTS data, do you think we need to distribute data and analysis across multiple places?