ESI Africa Issue 2 2015

DATA MANAGEMENT THINK ‘data science’ to manage energy distribution Data science is an emerging field and plays an intricate part in the so-called ‘big data’ drive, where the challenge is to extract value from vast amounts of data. This article aims to provide a backdrop and case study for the application of data science thinking in the energy distribution sector. T echnology industry giants such as Google, Apple, Facebook, Amazon, and social media companies, generate in the order of petabytes of user data daily and these volumes are growing rapidly. Moreover, the Internet of Things (IoT) is also contributing to high volumes of data as a wide variety of devices, sensors, systems, and services are connecting to the internet in an effort to achieve greater value by exchanging information more efficiently. THE HIDDEN VALUE OF YOUR DATA Possibly the primary reason why data is growing is due to advances made in physics and engineering, allowing progressively faster information processing and information storage capability. Subsequently, companies now gather and store more data than they can effectively manage in terms of business potential. This is where data science aims to bridge the gap between business opportunity and all the data. The need to analyse extremely large amounts of information in near real-time (in some cases), to drive value from it, is undoubtedly increasing with this data explosion. Data scientists specialising in the field of machine learning aim to build algorithms capable of detecting patterns 40 in the data (hidden information), which can be used to better understand the underlying dynamics captured in the form of digital information or to develop data products that can be implemented in real-time systems that mimic or enhance human information processing tasks. It also empowers us to address uncertainty. WHAT DO UTILITIES NEED TO ACHIEVE THIS? The data needs to be stored on proper data management platforms that can scale well and provide high speed processing (particularly for machine learning applications). Platforms (from the open source community) gaining popularity are: Hadoop with its two stage MapReduce paradigm, Apache Spark with its in-memory iterative computation advantages and Cluster Map Reduce that is a Hadoop-like framework in a distributed environment. Alternative platforms are emerging, but the choice of which platform to use will depend on factors such as the business objective (end applications), data structures and machine learning algorithms. Apache Spark for example, was originally developed at UC Berkeley and is built on top of the Hadoop Distributed File System (HDFS) and fits into the Hadoop open-source community, which received code contributions from over 30 companies including Yahoo and Intel. This framework promises much higher performance than Hadoop MapReduce for machine learning algorithms. The pitfalls to consider when using open source may include the supplementary custom code development and technical complications in the ecosystem, which require experts to manage, deploy and monitor. From an analytical point of view sometimes more data does not make a big difference (which is highly dependent on the application) and does not guarantee that more insights will be gained. However, some machine learning algorithms require ample training data to help the algorithms generalise well over the true underlying regularities in what is being modeled. It has been shown that simple modelling methods coupled with more data can outperform more complex modelling methods. However, this depends on the underlying system’s dynamics, data types and data quality. Data quality should be over-emphasised and approximately 80-90% of all efforts should revolve around it. Evidently, most ‘big data’ efforts are likely to end up in a ‘garbage in, garbage out’ scenario if data quality or data consistency is neglected. WHAT IS THE POTENTIAL VALUE TO ENERGY DISTRIBUTION? Many technology giants mentioned previously are utilising machine learning, aiming to improve user experience on their platforms by converting the user data into recommendation engines. Some of these recommendation engines ESI AFRICA ISSUE 2 2015