Appraise, analyse, design, develop and evaluate data warehousing and data analytics solutions using Oracle database system Personal Values Attributes (Global / Cultural awareness, Ethics, Curiosity) (PVA):.
Instructions on Assessment:
- You are expected to produce a word-processed answer to this assignment. Please use Arial font and a font size of 12. For SQL code and output, you can use courier new font, which preserves SQL format and layout.
- You are required to use the Harvard Style of referencing and citation. The “Cite them right” guide is recommended for referencing and citation(Pears and Shields, 2008) which should be followed throughout your work where needed. Please do not include references to lecture notes.
Mapping to Programme Goals and Objectives
This assignment covers the following learning outcomes for the module:
Knowledge & Understanding:
2. Demonstrate deep knowledge of key concepts of data warehousing, data analytics, data standards, and data quality
Intellectual / Professional skills & abilities:
4. Appraise, analyse, design, develop and evaluate data warehousing and data analytics solutions using Oracle database system
Personal Values Attributes (Global / Cultural awareness, Ethics, Curiosity) (PVA):
5. Develop critical awareness of the responsibilities of database developer with respect to professional, legal, security and ethical issues individually or as part of a team
Assessment Regulations
You are advised to read the guidance for students regarding assessment policies (Northumbria, 2019). They are available online here.
Late submission of work
Where coursework is submitted late without approval, after the published hand-in deadline, the following penalties will apply.
For coursework submitted up to 1 working day (24 hours) after the published hand-in deadline without approval, 10% of the total marks available for the assessment (i.e.100%) shall be deducted from the assessment mark.
Coursework submitted more than 1 working day (24 hours) after the published hand-in deadline without approval will be regarded as not having been completed. A mark of zero will be awarded for the assessment and the module will be failed, irrespective of the overall module mark.
These provisions apply to all assessments, including those assessed on a Pass/Fail basis.
The full policy can be found here.
Students must retain an electronic copy of this assignment (including ALL appendices) and it must be made available within 24hours of them requesting it be re-submitted.
Academic Misconduct
The Assessment Regulations for Taught Awards (ARTA) contain the Regulations and procedures applying to cheating, plagiarism and other forms of academic misconduct.
The full policy is available at here
You are reminded that plagiarism, collusion and other forms of academic misconduct as referred to in the Academic Misconduct procedure of the assessment regulations, which are taken very seriously. Assignments in which evidence of plagiarism or other forms of academic misconduct is found may receive a mark of zero.
Criteria for success:
For textual components :
80-100% – The description will excellently cover all the specific topics requested. The written work will be fluent, clearly presented and of out-standing quality.
70-79% – The description will comprehensively cover all the specific topics requested. The written work will be fluent and clearly presented and of distinctive quality.
60-69% – The student will show a very good knowledge of the specific topics, with very good presentation skills and quality.
50-59% – The student will show an above average knowledge of the specific topics, with above average presentation skills and quality.
40-49% – There will be an inadequate description of a significant proportion of the topics requested. There will be no major failures in presentation clarity though partly inadequate.
Less than 40% – There will be little or no information conveyed in an intelligible manner on the specific topics requested.
(e.g., following sound algorithms, standards, methods, error free SQL code),
For SQL and other database technical components:
80-100% – The students will produce exceptional models and solutions, and will demonstrate the use of notation/language, which have outstanding syntactic accuracy (e.g., following sound algorithms, standards, methods, error free SQL code) with exceptional semantic relevance (e.g., are relevant to the requirements of the particular scenario).
70-79% – The students will produce fully complete models and solutions, and will demonstrate the use of notation/language, which have high syntactic accuracy, with high semantic relevance.
60-69% – The students will produce almost models and solutions, and will demonstrate the use of notation/language, which have appropriate syntactic accuracy with reasonably well semantic relevance.
50-59% – The students will produce fairly complete models and solutions, and will demonstrate the use of notation/language, which have adequate syntactic accuracy with reasonable semantic relevance.
40-49% – The students will produce models and solutions, and will demonstrate the use of notation/language, which have some syntactic accuracy and semantic relevance but on balance inadequate as a whole.
Less than 40% – The students will not produce sufficient models and solutions, and/or will be unable to demonstrate the use of notation/language with significant syntactic accuracy and/or significant semantic relevance.
Assessment Background and Scenario
This assessment is based on two scenarios as follows:
- The Sales History (SH) Data Warehouse scenario. SH is a sample database schema provided by Oracle, which has been extensively used in the Oracle’s Data Warehousing Guide (Lane, 2013). The details of this scenario are provided in Appendix 1.
- UniTel’s Customer Churn dataset described in Appendix 2.
Assignment Questions
Part 1: Data Warehousing Tasks (50 Marks)
This part is based on the Sales History scenario as described in Appendix 1.
You must submit all the SQL queries and any other code that you wrote in answering any of the tasks / questions (e.g., the use of Explain Plan statements for the queries and their outputs using Spooling or other suitable means).
- Study the index definitions in sh_idx.sql. Discuss in detail (using cost-based analysis) why these indexes (choose three different ones) are useful for answering queries over the SH2 and DWU versions of the database. You should not run the sh_idx.sql script at all.
(9 marks)
- Identify three new indexes and justify why they could be useful. Write the SQL code for creating these indexes under your DWU account. Give example queries with cost-based analysis for both DWU account (which will have the new indexes) and SH2 shared schema (which will NOT have any of your new indexes). Alternatively, you may choose to run the same queries on your DWU account before and after creating your proposed three indexes.
(9 marks)
- Given the two materialized views (MVs) defined in sh_cremv.sql, discuss in detail why these MVs are useful for users of the SH database. You should provide detailed examples of cost based analysis, e.g., using Explain Plan for running sample queries on both SH2 and DWU to illustrate your answer. You should not run the sh_cremv.sql script at all.
(8 marks)
- Identify three new MVs based on the base tables in the SH schema under your DWU account and justify why they would be useful for the users of your data warehouse. Write the SQL code for creating these MVs. Moreover, run sample queries on both SH2 and DWU to ensure that queries running on DWU will be re-written by Oracle to use your proposed three MVs instead of the base tables used in the sample queries. Note that you must not query your MVs directly in the FROM clause; let the Oracle Query Optimizer re-write the queries and answer them using your proposed MVs.
(12 marks)
- Prior to the introduction of the special aggregation function CUBE, there was no possibility to express an aggregation over different levels within a single SQL statement without using the set operation UNION ALL. Every different aggregation level needed its own SQL aggregation expression, operating on the exact same data set n times, once for each of the n different aggregation levels. With the introduction of CUBE in the recent database systems, Oracle provided a single SQL command for handling the aggregation over different levels within a single SQL statement, not only improving the runtime of this operation but also reducing the number of internal operations necessary to run the query and reducing the workload on the system.
- Using CUBE, write an SQL query over the SH schema under your DWU account involving one fact table (SALES or COSTS) and at least twodimension tables and at least 3 grouping attributes. Provide output of successful execution of your query. Provide reasons why your query may be useful for users of the SH data warehouse.
(3 marks)
- Using set operation UNION ALL (and not CUBE), write an SQL query that produces the same result as the query in (a) above. Provide output of successful execution of your query.
(5 marks)
- Using EXPLAIN PLAN, provide a detailed discussion analysing costs of evaluating the above queries (i.e. with and without ROLLUP).
(4 marks)
Part 2: Data Mining Tasks (35 Marks)
This part is based on the UniTel scenario as described in Appendix 2. Moreover, you must use the DMUn Oracle Data Mining Account (where 1 <= n <= 75, e.g., DMU1, DMU2) allocated to your group.
Jessica is the customers relation manager at UniTel. She wants to know the possibility of potential churn of the company’s customers based on previous experience, so she may be able take some actions accordingly to retain their customers.
To help Jessica in doing her analysis, we need to investigate what could be a suitable algorithm for solving her problem. The data from last year are used as the training data and the data of February of this year are taken as the testing data to verify the model accuracy. Data of all the columns are used to set up the model. To meet the requirement, many algorithms can be selected.
Oracle Data Mining (ODM) provides the following algorithms for classification:
- Decision Tree
- Naive Bayes
- Generalized Linear Models (GLM)
- Support Vector Machines (SVM)