In this article, I am going to give a brief introduction to Data Science. Data science is all about understanding the data and using that data to solve complex business problems. Its main goal is to find out the hidden pattern from the raw data. For achieving this goal data scientists use various tools, machine learning principles, and algorithms. This in turn allows organizations to manage costs, boost their market, and increase efficiency. At the end of this article, you will understand the following pointers.
Inferential statistics use measurements from the sample of subjects in the experiment to compare the treatment groups and make generalizations about the larger population of subjects. There are many types of inferential statistics and each is appropriate for a specific research design and sample characteristics.
Descriptive statistics summarize the characteristics of a data set. Inferential statistics allow you to test a hypothesis or assess whether your data is generalizable to the broader population.
Abstractly, vectors are objects that can be added together (to form new vectors) and that can be multiplied by scalars (i.e., numbers), also to form new vectors.
Concretely (for us), vectors are points in some finite-dimensional space. Although you might not think of your data as vectors, they are a good way to represent numeric data.
For example, if you have the heights, weights, and ages of a large number of people, you can treat your data as three-dimensional vectors (height, weight, age).
If you’re teaching a class with four exams, you can treat student grades as four-dimensional vectors (exam1, exam2, exam3, exam4).
Matrices are the building blocks of data science. ... The matrix in its most basic form is a collection of numbers arranged in a rectangular or array-like fashion. This can represent an image, or a network or even an abstract structure. A rectangular array of 3 rows and 4 columns.
The probability plot (Chambers et al., 1983) is a graphical technique for assessing whether or not a data set follows a given distribution such as the normal or Weibull. The data are plotted against a theoretical distribution in such a way that the points should form approximately a straight line.
Linear Regression establishes a relationship between dependent variable (Y) and one or more independent variables (X) using a best fit straight line (also known as regression line).
Autocorrelation function plot (ACF): Autocorrelation refers to how correlated a time series is with its past values whereas the ACF is the plot used to see the correlation between the points, up to and including the lag unit.
In linear algebra, the Singular Value Decomposition (SVD) of a matrix is a factorization of that matrix into three matrices. It has some interesting algebraic properties and conveys important geometrical and theoretical insights about linear transformations. It also has some important applications in data science
Big data is a term used to describe data sets that are too large or complex for typical relational databases to acquire, maintain, and handle in a timely manner. Big data has one or more of the following characteristics: a large volume, a fast rate, or a wide diversity. Data complexity is being driven by new forms of artificial intelligence (AI), mobile, social, and the Internet of Things (IoT).
In this article, I am going to discuss the 5 Vs of Big Data. Please read our previous article, where we discussed What is Big Data and the history of Big Data in detail. At the end of this article, you will understand everything about the 5 Vs of Big Data in detail.