Harmony and analytics are two terms not often found together, especially when executing data science chores using a team of diverse data professionals. A data science team is often made up of people from diverse backgrounds, with diverse skillsets – from the machine learning specialist to the master Python coder, to the beginning data analyst. To successfully build and execute any sized data science project requires harmony across all of the team members. Everyone needs to work effectively and efficiently, using the tools they know best.
GigaOM recently had the opportunity to discuss the dynamics of building data teams with Florian Douetteau, CEO of Dataiku. Douetteau offered some sage advice and was able to point out what challenges face those trying to build teams.
Douetteau said “one of the biggest challenges is finding, hiring, and keeping people with a background in machine learning. There is a high demand for experienced data scientists, meaning that it can cost a lot to hire one, especially considering the opportunities data scientist have.”
Douetteau added “yet so many of those data professionals are specialized in their area of expertise, meaning that it is difficult for a business to maximize the return on investment of hiring a data scientists. In other words, to build a team, businesses need to move beyond a single individual’s domain of expertise and hire individuals with different skill sets and enable them to work cooperatively and productively.”
The growing deficit of Data Scientists, along with the closed nature of many analytics tools have made building effective teams a near impossibility. Yet all is not lost. Douetteau said, “there exists a vast ecosystem of opensource tools that are available to the masses, which can help to level the playing field, and bring data analytics capabilities to professionals of all stripes.”
However, much like the cola wars of the 80’s, there is an almost infinite variety of flavors and formulas that drive tastes, at least when it comes to standardizing on analytics tool sets. That proves to be a conundrum that can only be solved by creating harmony among team members and their tools of choice. In the case of data science, harmony takes on the form of people being able to interact with tools of choice, as well as having some mechanism to orchestrate those tools as well.
Douetteau said “using and supporting open source solutions enables a company to widen the pool of potential candidates for the data team – instead of sticking to people who can only code in Python (for example), the company can hire people with different skill sets in different open source technologies that can substitute for other interchangeable technologies.”
Naturally, orchestration and harmony cannot happen without a conductor, and in the world of data science, that conductor takes the form of a software platform that fuels interoperability, and tears down barriers. Douetteau explained “We like to think of Dataiku as a “control room” of the wildly dynamic and diverse plethora of open source technologies — whether you use Hive or Pig, or code in Python, R or Scala, Dataiku will let you use whatever solution you already have and know and seamlessly integrate it with the next step in the process. And because we’ve build a visual interface on top of these open source solutions, you can still use many of these solutions even if you don’t know how to code in a particular language — or at all.”
A Closer Look at Dataiku:
Dataiku digitizes that control room with Dataiku Data Science Studio (DSS), a platform that embraces the ideologies of open source technologies, and bridges those technologies together to give teams choice, without destroying collaboration. Dataiku DSS connects to more than 25 different data storage systems, including closed source and open source databases, such as SQL Server, HDFS, NoSQL, and so forth.
Dataiku DSS also supports numerous programming languages, allowing data professionals to work with the programming tools of their choice and still have connectivity to the data shared by the team. Critical features such as team knowledge sharing, change management, and project monitoring further fuel collaboration, while eliminating silos of operation.
Dataiku DSS’s platform approach centralizes open source elements, creating an environment where team knowledge is shared, and never lost when teams are reconfigured. What’s more, integrated to-do lists, document sharing, and unified logs make it easier to onboard new team members, as well as perform forensics on previous projects.
Simply put, open source tools may become the lifeblood of enterprise data science projects, but without proper orchestration, that lifeblood, as well as communal knowledge is sure to be lost over a short period of time.
Douetteau added “we often say that the bleeding edge of data science algorithms and architecture, which is often driven by large internet companies such as Facebook or LinkedIn, is only slightly ahead of what is being open sourced, whether directly by these companies or via original development or reverse engineering. Having a team that keeps up with open source developments also tends to mean that they’ll be more likely to find innovative solutions to the problems you’re solving.”