Thursday, July 10, 2014

Analysing Excel

Over the past few days, I've been revisiting some of the links which I had for my research but discarded. These are connected to people and companies who analyse spreadsheets and suggest how they can be improved. One person with whom I was in contact is Felienne Hermans, who is a researcher at Delft University of Technology; an offshoot from her PhD research is the company Infotron whose site allows one to upload an Excel spreadsheet and have it analysed for syntactic errors. Whilst sites like these are very interesting, they are not what I am primarily concerned with.

I decided to take a spreadsheet (belonging to someone else as mine are too simple) and analyse it at the Infotron site. Here are some of the results:

Sheet!CellFormulaRefactoring
Levels!J125 J124*3 Consider placing 3 in separate cell
Levels!J126 J124*6 Consider placing 6 in separate cell
Combined!F28 95298+91404 Consider placing 91404, 95298 in separate cells
Combined!F46 451583+218684 Consider placing 218684, 451583 in separate cells
Combined!F53 30306+59191 Consider placing 30306, 59191 in separate cells
International!E52 549788+9594 Consider placing 9594, 549788 in separate cells

What this report is saying is that one should never have a formula consisting of naked values (such as 3, 6 or 95298); instead, a formula should consists solely of cell references, like J124*J125. When one wants to know what J125 is, one goes to row 125 and sees what the row's name is. In programming, this is known as never using magic constants; for example, if one is writing a standard playing card game, then one should never use the naked value 52, but rather declare a constant "decksize = 52' then always refer to the constant decksize.

I discussed the report with the person who created the spreadsheet, who said that the spreadsheet was based on data whose source is Priority. I pointed out that there should be no problem as a worksheet will be created with (at least) two cells per row: description and value. When one has to add two separate values together, then one uses the cell references which hint at the cell's description. "And if I have to change a value or add something which doesn't feature in the original spreadsheet?", he asked, presumably referring to transactions which haven't yet been posted in Priority. The answer is just as simple: add another row to the spreadsheet, where the description is something like "Provision for rent (not yet posted)", and the value is whatever it is.

This activity overcomes the problem which I listed in my research proposal as referential integrity: it is not possible to know the source of a datum in the spreadsheet; it could come from ERP, it could be a personal projection, it may be invented or it may be the result of an earlier error. A malicious user could enter false data into a spreadsheet in order to support fraudulent activity.

As mentioned above, a tool such as Infotron's overcomes syntactic errors; it wouldn't detect a user adding a row with the description 'fraudulent activity' and the value '10000'! That, as they say, is left as an exercise for the reader.

Last week (June 2014), the Institute of Chartered Accountants in England and Wales (ICAEW) launched their Twenty principles for good spreadsheet practice. In this document the ICAEW recommend that organisations adopt a standard. The document contains several examples of good practice, but of course ignores the fact that maybe the program from which the data originated might have better data manipulation tools.

No comments: