CIOs may be the key to helping organizations understand and improve the efficiency of digital transformation efforts—or at least, the data they can access is. Read on to see what CIO says about the future of performance optimization through machine learning.
Source code is the new printing press, the new coal, the new oil, the new assembly line; the generator of the next new economy — some call it the fourth industrial revolution. From the auto industry manufacturing self-driving cars with millions of lines of code to doctors performing surgery with robots halfway around the world, source code is everywhere.
With software security breaches costing millions of dollars and an estimated $3 trillion global GDP loss coming from developer inefficiency, businesses are only just beginning to understand how critical their code and the processes that manage it really are. Just as businesses audit their financial statements, their processes, even their assembly lines, it is becoming as critical (if not more) to do the same with their software portfolio. That’s where machine learning on code comes in.
With enough data, machine learning can solve challenging problems for many industries. From facial recognition for automated photo tagging to movie-recommendation engines based on user preferences, machine learning is poised to create code bases that are more secure and easier to maintain.
Currently, companies have no easy way to measure progress with regard to key digital transformation initiatives — like adopting a new logging system, a major API change or painful projects such as becoming GDPR compliant. Their code is constantly changing, often fragmented across different repositories and programming languages, making it very hard to have any visibility into the state of the whole codebase. And with the increasing use of open-source code bringing external dependencies, while services keep on becoming smaller as the source code of the monolith is split into microservices, the task is getting even more difficult.
Treating code as the rich dataset that it is
As we turn everything into data in an effort for better understanding the processes that surround us, from open government to open source, Code as Data is inevitable. Code as Data is about extracting insights from code repositories, including the source and all of the versions it went through before reaching the current state. Code as Data tasks includes code retrieval, language classification, program parsing, token extraction and other language-agnostic analysis that allows us to compute any metrics and easily see its evolution over time and predict future trends.
For instance, source{d} has been developing a platform-leveraging machine learning to automate code review for developers while helping executives measure engineering effectiveness and inform their IT strategy based on data rather than feelings. It can track framework and programming language adoption, and help management with hiring decisions. Cumbersome questions such as “How far are we with our migration from Angular to Angular 2?” can be easily answered. Codebase sanity can be checked; for every commit, the technology can make sure that the code respects predefined technical guidelines and is free of the most common security vulnerabilities such as SQL injection or API key leaks. Another startup in the space called Semmle goes even further by discovering new types of source-code vulnerabilities.
Source-code repository analysis can also reveal information about the developers writing it. Team dynamics can be highlighted by analyzing commits time and content: managers can identify when software engineers are the most productive, arranging meetings and encouraging cross-team collaboration accordingly. Looking at programming languages and frameworks trend can inform hiring managers on what type of talent to hire and what upskilling education resources can they provide. Adding source code as a new dataset in enterprises’ data warehouses and visualization platforms such as Power BI, Looker or Tableau will provide everyone in the engineering organization with a whole new level of source-code and development-process observability.
Learning from source code to build better tooling
Yet the most exciting aspect of looking at code as a dataset is that it can be used to train machine-learning models that can automate many different repetitive tasks for developers. We’re already starting to see new machine-learning-based applications for assisted code review or suggestions on GitHub. Imagine how much time developers could save if bots were to remind them of style or naming conventions or look for similar code detection from project to function level. There is also a class of tasks where automating actually means we are able to perform the task with higher performance than humans, for instance finding whether a piece of code is a duplicate from some other existing dependency or even from any popular open source project — in this case, a human would fail to memorize millions of lines of code, while it’s an easy win for a good algorithm.
Taking this further, the future of software engineering may lay in training and managing machine-learning models and get them to do the coding work that humans are currently doing. For instance, Diffblue uses machine learning to automatically write unit tests for your code. Unlike humans, computers can work 24/7 and easily identify patterns or flag issues over really large codebases. These new machine-learning based tools will enable developers to build better and faster software as a team by focusing on what’s really important and let the non-essential tasks to bots. These machine-learning models and applications are the building blocks for the next generation developer tools that will forever change the way students and developers learn programming as well as how they write and review code. It’s inevitable: Machine learning on code is the next frontier of a whole new series of software building tools.
This article was written by Sylvain Kalache from CIO and was legally licensed through the NewsCred publisher network. Please direct all licensing questions to legal@newscred.com.