Big Data
Áreas Científicas |
Classificação |
Área Científica |
OFICIAL |
Informática |
Ocorrência: 2022/2023 - 1S
Ciclos de Estudo/Cursos
Sigla |
Nº de Estudantes |
Plano de Estudos |
Anos Curriculares |
Créditos UCN |
Créditos ECTS |
Horas de Contacto |
Horas Totais |
BINF |
26 |
Study Plan |
3 |
- |
5 |
67,5 |
135 |
Docência - Responsabilidades
Língua de trabalho
Portuguese
Objetivos
This curricular unit will provide knowledge on tools for storage, processing and visualization of large volumes of data, the development of skills in the construction and testing of efficient algorithms for Big Data, namely the study of paradigms, models, tools and parallel programming languages.
At the end of the course the student should be able to
- Determine the solution to be applied and the instruments to be used in the storage, exploration and analysis of a large volume of data
- Select appropriate visualization options to summarize and extract knowledge from a large volume of data
- Understand the concept of parallel and distributed processing as a way to increase performance in data management and analysis
- Develop algorithms and models to solve problems that explore the management of concurrency, distribution and parallelism
- Recognize the different hardware architectures that support the operation of these algorithms
Resultados de aprendizagem e competências
Not applicable
Modo de trabalho
Presencial
Programa
1. Visualization of large data volumes
2. Large-scale storage
Non-relational databases (key-value, document-oriented, column family, graph-oriented) Comparison between relational and non-relational databases
3. Parallel Programming Models
Shared Memory Model
Thread Model
Distributed memory
Message passing model
Parallel Data Model
Hybrid model
Single Program Multiple Data (SPMD)
Multiple Program Multiple Data (MPMD)
4. Design of parallel programs
Automatic parallelization vs. Manual
Partitioning
Communications
Synchronization
Data dependencies
Load balancing
Granularity
I/O
Debugging
Performance analysis and tuning 5. Parallel Algorithms
Parallel Algorithms for Sequences and Strings
Parallel Algorithms for Trees and Graphics
Parallel Algorithms for Numerical/Scientific Computation
Bibliografia Obrigatória
Sadalage et al.; No SQL distilled : a brief guide to the emerging world of polyglot persistence, Pearson Education, 2012
O'Neil, C. and Schutt, R.; Doing Data Science: Straight Talk from the Frontline, 2013
Leskovec, J., Rajaraman, A., Ullman, K.; Mining of Massive Datasets, Cambridge University Press, 2nd Ed., 2014
White, T.; Hadoop: The Definitive Guide, O'Reilly, 2015
Wilke, C. O; Data Visualisation, O’Reilly, 2019
Pacheco, P.; Introduction to Parallel Algorithms (2nd ed.), 2021
Kleppmann, M. ; Designing data-intensive applications: The big ideas behind reliable, scalable, and maintainable systems, 2017
Bibliografia Complementar
Knaflick, N. C; Storytelling with data, Wiley, 2015
Métodos de ensino e atividades de aprendizagem
The predominant teaching methodologies will be the presentation of concepts, using slides and the demonstration of examples in the computer laboratory. Students will be constantly challenged to solve new problems, based on the examples already demonstrated, and to reflect on the results and performance of the storage and processing processes under study.
Software
Pyspark
Python
MongoDb
Tipo de avaliação
Distributed evaluation with final exam
Componentes de Avaliação
Designation |
Peso (%) |
Teste |
30,00 |
Trabalho escrito |
70,00 |
Total: |
100,00 |
Componentes de Ocupação
Designation |
Tempo (Horas) |
Estudo autónomo |
82,50 |
Frequência das aulas |
52,50 |
Total: |
135,00 |
Obtenção de frequência
Not applicable
Fórmula de cálculo da classificação final
Continuous assessment
- 30%*project1+35%*project2+35%*testt
Final assessment
- 30%*project1+35%*project2+35%*exam
The 100% assessment regime per exam is not applicable (that is, an exception regime is applied) since, according to the learning objectives and the skills to be acquired, the student must have a strong practical component in the use of tools for storage and processing of big data.