13 Assessment Briefings
- All assessments are to be submitted via Canvas
- Group assignments should be submitted by all team members
- Submissions must have a title page with subject name, student name(s) and ID numbers, and the title of the assessment
- All submitted files should follow this naming format:
- For individual assignments: StudentName_TaskName_Date (e.g.
JaneSmith_Task1PartA_21032017.docx
) - For group assignments: TeamName_TaskName_Date (e.g.
Team1_Task1PartA_21032017.docx
)
- For individual assignments: StudentName_TaskName_Date (e.g.
- You can apply for an extension for up to 1 week (via email to the subject coordinator) with a valid reason
- Extensions for more than 1 week require a formal application for extension
- Unless extension arrangements are made, each late submission is penalised 10% per each day after the due date
- All marks and feedback comments will be recorded in Review
13.1 Assessment 1
Collaborative Development using Centralized Code Repositories
In 2017 a small group of motivated MDSI students wrote an e-book to help new students learn everything they needed to know about the MDSI program, as well as providing help and advice about how to approach their learning.
The e-book is falling into a state of disrepair and for your first assessment you will be collaborating as part of a team to bring it back up to date, making it the go-to resource for new students.The MDSI Student Guide is great, but it is showing it’s age. Two years is a long time in data science (and at UTS), so there are some sections which need a rewrite:
- MDSI Technology makes reference to a bunch of tools which are no longer popular, and is missing some of the most popular technologies in use today.
- Data Futures was an elective in 2018, and one of the assessments required students to write an essay for inclusion in MDSI Student Guide (which was known as FlipAround at the time). There is some great material here, but the way it is included feels a bit awkward.
- The Guide for Contributors is a little light on details, and would benefit with a rewrite using some of the content from Data Science Practice!
- Despite making up 25% of the MDSI program, there is nothing in FlipAround about how to get the most out of your iLab projects.
- There is no information to help students when choosing electives to support their career objectives.
- There are many references to the Connected Intelligence Centre (CIC), which is where the MDSI program used to live. We’ve since moved to the Faculty of Transdiciplinary Innovation (FTDI) and this requires a bunch of updates.
- The book mentions our MDSI Slack but doesn’t talk about any of the Slack Conventions
- The documentation about core courses and electives is out of date, and needs to be updated to include Deep Learning and Data Science Practice
For this assignment, you and your team (3-5 students per team) will be tasked with selecting one of the above improvements and writing new content to bring the MDSI Student Guide back to it’s former glory. You will submit your assignment by raising a Pull Request with your improvements to the MDSI Student Guide.
You may select one of the above improvements, or propose your own based on your MDSI experience. If you think there is something that all new MDSI students need to know, please discuss your idea with the teaching team (get agreement from your team first) and we’ll let you know whether or not suggestion is appropriate for the assessment.Your team will:
- Sign up for Bitbucket accounts with your student email
- Fork the MDSI Student Guide repository (one person per team), then add each member of your team as a collaborator.
- Clone the repository to each of your computers
- Use branching to make changes to the book without clashing with each other
- Use Pull Requests to discuss your changes to the master branch of your forked repository, and merge your changes as necessary
- Continue creating branches using Pull Requests to merge your changes until the whole team is happy with the changes to the MDSI Student Guide
- Raise a Pull Request to merge the changes from your forked repository into the main repository
- Negotiate with the MDSI Student Guide Custodian (repository owner) who may ask for further changes before accepting your improvements
- Merge your work onto the master branch of the main repository using Bitbucket.
- Create a short report about your team’s use of branches and Pull Requests to collaborate through Git.
- Include a visualisation of the commits and branches (can be created using any tool you like - doesn’t have to be generated from data)
- Include information about the final changes your team made by using the git diff command
- Include links to each of your team’s Pull Requests, and to the final Pull Request in the MDSI Student Guide repository
- Expected length is approximately 1-2 typed pages (including images), but there is no formal word limit.
- Submit through Canvas.
Atlassian refers to this as the Forking Workflow, and you can read more about at the Atlassian Git Documentation
Assessment Criteria
Your team will be assessed on the following criteria:
- Quality and clarity of written content submitted to the MDSI Student Guide, including appropriateness of format and communication style (10%)
- Appropriateness of commits and branches to collaborate within a team using Git, adhering to one of the documented workflows (20%)
- Clarity and efficiency of content review and change negotiation using Pull Requests, and successful incorporation of individual changes into the team’s master branch (30%).
- Each team member must raise at least one Pull Request.
- Clarity and efficiency of change summary using Pull Request to successfully negotiate and merge changes into the MDSI Student Guide master branch (20%).
- Your pull request will likely involve additional changes, by negotiation with the MDSI Student Guide repository owner.
- Make to raise this Pull Request at least a week before the due date, to make sure you have enough time to negotiate, make edits, get approval, and merge your changes.
- Insightfulness in identifying appropriate Git workflow and persuasive rationale for selecting it (20%)
13.2 Assessment 2A
Programming Cheat Sheets
Whilst working through the assigned courses on DataCamp, you’ll be learning about different ways to achieve the same task, using both R and Python.
In order to help you incorporate and retain your new knowledge, you will be tasked with creating your own custom cheat sheet which you can refer to when you need to work with R or Python in future.This cheat sheet can take any form you like, and may be as long or as short as you like, with the following conditions:
- Your cheat sheet must provide R and Python 3 code examples for each task on your cheat sheet. You may use CRAN or PyPI packages as part of your examples, however you should limit yourself to the “popular” packages for each language.
- You must identify at least 10 data science tasks to include in your cheat sheet. For example:
print()
is not a data science task- filtering a data frame is a data science task
- loading data from a CSV file is a data science task
- For each task, you must provide equivalent statements for those tasks using R and Python 3
- You may take inspiration from cheat sheets you find on the internet, however each code example must be written by you based on what you have learned from the DataCamp courses.
- You must include some explanation for each example, outlining what the code does, and when you would use it in a data science project
- You must use a consistent coding style throughout the assignment
Expected length is approximately 2-5 typed pages (including images and code snippets), but there is no formal word limit. You must submit your assignment through Canvas.
Assessment Criteria
You will be assessed on the following criteria:
- Relevance of at least 10 common data science tasks identified (10%)
- Correctness of identification of equivalent code snippets using R and Python 3 (50%)
- Consistency of programming style for all R snippets (10%)
- Consistency of programming style for all Python snippets (10%)
- Clarity of explanations (20%)
13.3 Assessment 2B
MDSI Slack Analysis
MDSI Students have been using Slack since 2016 to chat with friends, collaborate with teammates, assist with troubleshooting, and share resources.
Slack allows administrators to download chat log data for all public conversations on the MDSI Slack account (this excludes direct messages and private channels - your private messages remain private). For this assignment we have downloaded the full chat logs from the start of the MDSI Slack until March 2019, and made a subset of the data available in a SQL database for you to analyse.This assessment will task you with preparing an end-to-end data science project using all three of the languages we have learned so far in the course. Students must:
- Write an appropriate SQL query to construct a dataset for analysis
- You must use at least one join
- You must use at least one filter
- Use Python or R to connect to the database, and execute your SQL query to load the dataset directly into Python or R
- Use R and Python to analyse and present insights about the dataset
- You must pass data between R and Python at least once
- You must produce at least one visualisation in either language
Outside of these requirements, your choice of how to analyse the data and what insights to present is up to you. You may refer to the assessment criteria below for clear information about what is expected, keeping in mind that the primary aim of this assessment is to demonstrate your ability to create a meaningful piece of analysis using a combination of R, Python and SQL.
13.3.1 Connecting to the database
The Slack logs have been pre-processed and loaded into an AWS PostgreSQL RDS database. There are three tables in this database:
- users - contains publicly-viewable data about each user in the MDSI Slack instance.
- channels - contains data about each public channel in the MDSI Slack instance
- messages - contains all messages posted in public channels in the MDSI Slack instance.
To connect to the database, you may like to refer to these code snippets for R and Python: (connection credentials will be provided in class)
# R
library(RPostgreSQL)
con <- dbConnect(drv = dbDriver('PostgreSQL'),
host = <host>,
port = <port>,
user = <user>,
password = <password>,
dbname = 'mdsislack')
users <- dbGetQuery(con, "select * from users")
print(users)
dbDisconnect(con)
# Python
from sqlalchemy import create_engine
engine = create_engine("postgresql+psycopg2://<user>:<password>@/mdsislack?host=<host>&port=<port>")
con = engine.connect()
rs = con.execute('select * from users')
df = pd.DataFrame(rs.fetchall())
df.columns = rs.keys()
print(df)
con.close()
Assessment Criteria
You will be assessed on the following criteria:
- Correctness and appropriateness of SQL query (10%)
- Readability and consistency of style for SQL query (10%)
- Appropriateness of use of Python or R to successfully load dataset from database using a SQL query (10%)
- Efficiency and conciseness of use of Python to perform analysis tasks on the dataset (10%)
- Consistency and clarity of Python code style (10%)
- Efficiency and conciseness of use of R to perform analysis tasks on the dataset (10%)
- Consistency and clarity of R code style (10%)
- Appropriateness of use of R and Python commands to reliably transfer data between languages (10%)
- Insightfulness and clarity of written explanations (20%)
13.4 Assessment 3
Reproducibility and Risk Audit
Throughout your MDSI course, you’ve probably done more than a few projects that haven’t aged very well. You have likely found yourself in a second year course learning about a new way of doing something and thought about how it would have helped you on a previous assignment. You have probably hoped that no one ever looks at those old assignments again, because with everything you now know, you look back on those assignments with a tinge of embarrassment.
Luckily for you, this assessment task will give you a chance to go back and right past wrongs!Your individual assignment is to select one of your own projects from a previous class (e.g. DSI, iLab) and conduct a reproducibility and risk audit. This report will identify recommendations for how to improve the reproducibility of the analysis, and general commentary on reproducibility best-practice. You’ll be required to:
- identify risks and reproducibility issues that may prevent someone else reproducing your work
- recommend specific, detailed strategies to improve the project by reducing or removing those risks
- think about how tools and techniques from Data Science Practice can help improve reproducibility: version control, code style and layout, using code for every step of your analysis, literate programming, containerisation, well-defined APIs, etc
- you may like to take the opportunity to research one of the many open source tools currently in development that aim to solve some of the most common reproducibility challenges: conda, packrat, mlflow, etc
- there are no shortage of blog posts, opinion pieces and tutorials online with more ideas about how to improve reproducibility
- re-write the core analysis work from the project you have selected, using literate programming techniques (notebooks) and focusing on writing code which can be clearly understood and easily reproduced by others
- include a section outlining the benefits of containers and a guide on how to use them to improve the reproducibility of data science projects.
For your assessment, you must:
- Prepare a report detailing the reproducibility and risk issues created by a previous analytical task, including recommendations for how to remediate the analytical task and ensure reproducibility.
- Identify reproducibility issues present in the analysis
- Identify reproducibility risks which could become issues
- Detail how these issues could cause problems for reproducibility if someone else wanted to recreate the findings of the project
- Suggest remediation strategies for all issues presented
- Re-write the core components of the analytical task (the same one discussed in the report) using literate programming techniques (notebooks)
- Suggest attaching as an appendix to the report
- Articulate how containers could be used to eliminate or reduce certain types of reproducibility risks
- Outline what containers are and how they operate
- Outline how containers can be used to create and share data science environments
- Specify how one or more containers could be designed for your project, and how it would improve items called out in the audit
- You may write and submit a Dockerfile for your project if you like, although this is not required for the assessment
Assessment Criteria
You will be individually assessed on the following criteria:
- Insightfulness, clarity and persuasiveness of audit findings, and appropriateness of remedial recommendations. (60%)
- Quality and level of alignment of re-written analysis code to the body of the report. (20%)
- Persuasiveness of justification of the use of containerisation as a strategy for reducing certain reproducibility risks (20%)