Teaching. Foundations

Jargon Busting

Terms Description
Dataset A collection of related sets of information that is composed of separate elements, but can be manipulated as a unit by a computer.
Pivot table A pivot table can automatically sort, count, total or average the data stored in one table or spreadsheet, displaying the results in a second table showing the summarized data. We'll be covering some of these processes in the next couple of exercises.
Data science An interdisciplinary field about scientific methods, processes, and systems to extract knowledge or insights from data in various forms, either structured or unstructured.
Formula An expression telling the computer what mathematical operation to perform upon a specific value. Instructor: demonstrate formula in spreadsheet
Average A number expressing the central or typical value in a set of data, in particular the mode, median, or (most commonly) the mean, which is calculated by dividing the sum of the values in the set by their number.
Algorithm A list of rules to follow in order to solve a problem.
Big data Extremely large data sets that may be analysed computationally to reveal patterns, trends, and associations, especially relating to human behaviour and interactions.
Regular expression A sequence of symbols and characters expressing a string or pattern to be searched for within a longer piece of text. Instructor: show some examples of regular expressions
Data protection Data protection principles govern how your personal information is used by organisations, businesses or the government.
Open data Open data is data that can be freely used, re-used and redistributed by anyone - subject only, at most, to the requirement to attribute and sharealike.
Aggregation A process in which information is gathered and expressed in a summary form, for purposes such as statistical analysis.
Anonymisation The process of turning data into a form which does not identify individuals and where identification is not likely to take place.

We'll now spend some time on a combination of best practice and generic skills.

The Computer is Stupid

This does not mean that the computer isn't useful. Given a repetitive task, an enumerative task, or a task that relies on memory, it can produce results faster than you or I. But the computer only does what you tell it to. If it throws up an error it is often not your fault, rather in most cases the computer has failed to interpret what you mean because it can only work with what it knows (ergo, it is bad at interpreting).

This is not to say that the people who told the computer what to tell you when it doesn't know what to do couldn't have done a better job with error messages, for they could. It isn't the computer's fault that it is giving you an archaic and incomprehensible error message, it is a human person's.

To do: put an example of an error message here.

Why take an automated or computational approach?

Otherwise known as the 'why not do it manually?' question. There are still plenty of things that we can do manually that a machine could do in an automated way because either:

• a) We don't know how to automate the task or;
• b) We're unlikely to repeat the task automating it would take longer

However, once you know you'll need to repeat a task, you have a compelling reason to consider automating it. This is one of the main areas in which programmatic ways of doing outside of IT service environments are changing library practice. Andromeda Yelton, a US based librarian closely involved in the Code4Lib movement, put together an excellent American Library Association Library Technology Report called Coding for Librarians: Learning by Example.

The report is pitched at a real world relevance level, and in it Andromeda describes scenarios library professionals told her about where learning a little programming, usually learning ad-hoc, had made a difference to their work, to the work of their colleagues, and to the work of their library.

Main lessons:

• Borrow, Borrow, and Borrow again. This is a mainstay of programming and a practice common to all skill levels, from professional programmers to people like us hacking around in libraries;
• The correct language to learn is the one that works in your local context. There truly isn’t a best language, just languages with different strengths and weaknesses, all of which incorporate the same fundamental principles;
• Consider the role of programming in professional development. That is both yours and of those you manage; Knowing (even a little) code helps you evaluate projects that use code. Programming can seem alien. Knowing some code makes you better at judging the quality of software development or planning activity that include software development
• Automate to make the time to do something else! Taking the time to gather together even the most simple programming skills can save time to do more interesting stuff! (even if often that more interesting stuff is learning more programming skills …)

Why Automate?: see Geeks and repetitive tasks image.

Working with data can often involve repetitive and be a strain if using a mouse. It's always worth learning keyboard shortcuts.

Shortcut What does it do
Ctrl + s Saves the current document
Ctrl + c Copies the current selection to the clipboard
Ctrl + x Cuts the current selection to the clipboard
Ctrl + v Pastes the current clipboard contents to the current document
Ctrl + Shift + End In spreadsheets highlights to the bottom of a column

...many many more.

Plain text formats are your friend

Why? Because computers can process them!

If you want computers to be able to process your stuff, try to get in the habit where possible of using platform-agnostic formats such as .txt for notes and .csv or .tsv for tabulated data (the latter pair are just spreadsheet formats, separated by commas and tabs respectively).

CSV and Text Files

A CSV is a comma separated values file which allows data to be saved in a table structured format. CSVs look like a garden-variety spreadsheet but with a .csv extension (Traditionally they take the form of a text file containing information separated by commas, hence the name).

These plain text formats are preferable to the proprietary formats used as defaults by Microsoft Office because they can be opened by many software packages and have a strong chance of remaining viewable and editable in the future. Most standard office suites include the option to save files in .txt, .csv and .tsv formats, meaning you can continue to work with familiar software and still take appropriate action to make your work accessible.

Compared to .doc or .xls, these formats have the additional benefit of containing only machine-readable elements. Whilst using bold, italics, and colouring to signify headings or to make a visual connection between data elements is common practice, these display-orientated annotations are not (easily) machine-readable and hence can neither be queried and searched nor are appropriate for large quantities of information.

Markdown

Though it is likely that notation schemes will emerge from existing individual practice, existing schema are available to represent headers, breaks, et al. One such scheme is Markdown, a lightweight markup language. Markdown files, .md, are machine readable, human readable, and used in many contexts - GitHub for example, renders text via Markdown. An excellent Markdown cheat sheet is available on GitHub for those who wish to follow – or adapt – this existing schema.

This set of notes is all written in markdown. See the raw markdown files, stored in GitHub.

Notepad++ is recommended for Windows users as a tool to write Markdown text in, though it is by no means essential for working with .md files. Mac or Unix users may find Komodo Edit, Text Wrangler, Kate, or Atom helpful.

Naming files sensible things is good

Working with data is made easier by structuring your stuff in a consistent and predictable manner.

Examining URLs is a good way of thinking about why structuring data in a consistent and predictable manner might be useful in your work. Good URLs represent with clarity the content of the page they identify, either by containing semantic elements or by using a single data element found across a set or majority of pages.

A typical example of the former are the URLs used by news websites or blogging services. WordPress URLs often follow the format:

ROOT/YYYY/MM/DD/words-of-title-separated-by-hyphens

A similar style is used by news agencies such as a The Guardian newspaper:

ROOT/SUB_ROOT/YYYY/MMM/DD/words-describing-content-separated-by-hyphens

http://www.theguardian.com/uk-news/2014/feb/20/rebekah-brooks-rupert-murdoch-phone-hacking-trial

What we learn from these examples is that a combination of semantic description and data elements make for consistent and predictable data structures that are readable both by humans and machines. Transferring this to your stuff makes it easier to browse, to search, and to query.

In practice, the structure of a good archive might look something like this:

A base or root directory, perhaps called 'work'.

A series of sub-directories such as 'events', 'data', 'projects' et cetera

Within these directories are series of directories for each event, dataset or project. Introducing a naming convention here that includes a date element keeps the information organised without the need for subdirectories by, say, year or month.

All this should help you remember something you were working on when you come back to it later (call it real world preservation). The crucial bit for our purposes, however, is the file naming convention you choose. The name of a file is important to ensuring it and its contents are easy to identify. 'Data.xslx' doesn’t fulfil this purpose. A title that describes the data does. And adding dating convention to the file name, associating derived data with base data through file names, and using directory structures to aid comprehension strengthens those connection.

Key Points

• Data structures should be consistent and predictable
• Consider using semantic elements or data identifiers to data directories