# Introduction to Data Science

Data Science is a discipline that combines aspects of statistics, mathematics, programming, and domain expertise. Data Scientists solve big problems in public and private sectors.

Some common definitions of Data Science are as follows:

• A powerful new approach to make discoveries from data
• An automated way to analyze enormous amounts of data and extract information
• A new discipline that combines aspects of statistics, mathematics, programming, and visualization to turn data into information

## Components of Data Science

When we combine domain expertise and scientific methods with technology, we get Data Science.

### Domain Expertise and Scientific Methods

Data Scientists collect data and explore, analyze, and visualize it. They apply mathematical and statistical models to find patterns and solutions in the data. Domain Expertise and Scientific Methods can be divided into three parts:

• Analysis – Analysis can be of three types:
• Descriptive: Study a dataset to decipher the details
• Predictive: Create a model based on existing information to predict outcome and behavior
• Prescriptive: Suggest actions for a given situation using the collected information
• Mathematical and Statistical
• Scientific Tools and Methods Models

### Technology in Data Science

Traditionally, the data scientists would perform the analysis in a lab kind of setup. It used to be a tedious and a time consuming job. Modern tools and technologies have made data processing and analytics faster and efficient.

• Python Language
• Data Processing Tools
• Operating System
• Application Design
• Library

These technologies help Data Scientists to:

• Build and train machine learning models – This is called Machine Learning
• Manipulate data with technology
• Extract information from data
• Build data tools, applications, and services

A very serious note of concern: Data analysis that uses only technology and domain knowledge without mathematical and statistical knowledge often leads to incorrect patterns and wrong interpretations. This can cause serious damage to businesses.

## Roles and responsibilities of a Data Scientist

• Then they use data acquisition to collect data sets from The Real World.
• The process of data wrangling is implemented with data tools and the modern technologies that include data cleansing, data manipulation, data Discovery and data pattern identifications.
• The next step is to create and a train models for machine learning.
• They then design mathematical or statistical models after designing a data model.
• It’s represented using data visualization techniques.
• The next task is to prepare a data report after the report is prepared.
• They finally create data products and services.

## Basic Skills of Data Scientist

• Ask the Right Questions – Data scientists should ask the right questions for which they need domain expertise, the Curiosity to learn and create concepts and the ability to communicate questions effectively to domain experts
• Understand Data Structures – Data scientists should think analytically to understand the hidden patterns in a data structure.
• Interpret and wrangle Data – They should Wrangle the data by removing redundant and irrelevant data collected from various sources.
• Apply statistical and mathematical methods – Statistical thinking and the ability to apply mathematical methods are important traits for data scientists.
• Visualize and communicate with stakeholders – Data should be visualized with graphics and proper storytelling to summarize and communicate the analytical results to the audience.

To get these skills, they should follow a distinct roadmap. It’s important they adopt the required tools and techniques like Python and it’s libraries. They should build projects using real-world data sets that include data.gov.in, Open Government Data (OGD) Platform, etc. They should also build data-driven applications for Digital Services and data products.

## Data Science and Big Data

Scientists work with different types of datasets for various purposes. Now that big data is generated every second through different media, the role of data science has become more important. So we need to know what big data is and how we’re connected to it to figure out a way to make it work for you.

Every time

• you record your heartbeat through your phone’s biometric sensors post or
• tweet on The Social Network
• create any blog or website
• switch on your phone’s GPS Network
• Load or View an image, video or audio.

you are generating data about yourself your preferences and your lifestyle big data is a collection of these and a lot more data that the world is constantly creating. In this age of the Internet of Things or IoT, big data is a reality and a need.

### Three V’s

Big data is usually referenced by Three V’s

• Volume – Volume refers to the enormous amount of data generated from various sources.
• Velocity – Big data is also characterized by velocity huge amounts of data flow at a tremendous speed from different devices sensors and applications. To deal with it, an efficient and timely data processing is required.
• Variety – Variety is the third V of Big Data because big data can be categorized into different formats, like structured semi-structured and unstructured.

Structured data is usually referenced to as rdbms data, which can be stored and retrieved easily through sql’s. The data is usually neatly arranged in tables. Think for example, the database of students in a university.

Semi-structured data are usually in the form of files like XML, Json documents and nosql database.

Text files images videos or multimedia content are examples of Unstructured data.

In short big data is a very large information database usually stored on distributed systems or machines popularly referred to as Hadoop clusters, but to be able to use this database we have to find a way to extract the right information and data patterns from it. That’s where data science comes in. Data science helps to build information driven Enterprises.

## Application of Data Science in different Sectors

Let’s go on to see the applications of data science in different sectors.

Social network platforms such as Google Yahoo, Facebook and so on collect a lot of data every day, which is why they have some of the most advanced data centers spread across the world. Having data centers all over the world and not just in the US help these companies serve their International customers better and faster without any network latency. They also help them deal effectively with the enormous amount of data.

So what do all these different sectors do with all this big data their team of data scientist analyze all the raw data with the help of modern algorithms and data models to turn it into information. They then use this information to build Digital Services data products and information driven – now, let’s see how these products and services work.

Let’s suppose that you are a data scientist based in Bengaluru. So it’s quite likely that you would want to join a group or build connections with people related to data science in Bengaluru. Now what LinkedIn does with the help of data science is that it looks at your profile your posts and likes the city you are from the people who are connected to and the group you belong to then it matches all that information with its own database to provide you with information that is most relevant to you. This information could be in the form of news updates that you might be interested, getting industry connections or professional groups that you might want to get in touch with or even job postings related to your field and designation. These are all examples of data services.

Let’s now look at something that we use every day Google search engine Google search engine has the most unique search algorithm which allows machine learning models to provide the relevant search recommendations, even as the user types in his or her query. This feature is called autocomplete. It is an excellent example of how powerful machine learning can be. There are several factors that influence this feature. The first one is query volume. Google’s algorithms identify unique and verifiable users that search for any particular keyword on the web based on that. It builds a query volume. For instance Republican debate 2016, Ebola threat CDC or the center for Disease Control and so on are some of the most common user queries.

Another important factor is a geographical location. The algorithms tag a query with the locations from where it is generated. This makes a query volume location-specific. It’s a very important feature because this allows Google to provide relevant search recommendations to accuser based on his or her location and then of course the algorithms consider the actual Words and phrases that the user types in it takes up those words and crawls the web looking for similar instances the algorithms also try to filter or scrub out inappropriate content for instance sexual violent or terrorism related content hate speeches and legal cases are scrubbed out from the search recommendations.

### Health Care Industry

But how does data science help you today? Even the healthcare industry is beginning to tap into the various applications of data science. To understand this let’s look at wearable devices.

These devices have biometric sensors and a built-in processor to gather data from your body when you are wearing them. They transmit this data to the big data analytics platform via the IoT Gateway. Ideally the platform collects hundreds of thousands of data points. And the collected data is ingested into the system for further processing. The big data analytics platform applies data models created by data scientists and extracts the information that is relevant to you.

It sends the information to the engagement dashboard where you can see how many steps you walked, what your heart rate is over a period of Time, how good your sleep was, how much calories you burned and so on. Knowing such details would help you to set personal goals for a healthy lifestyle and reduce overall Healthcare and insurance costs.