eStatBookU

Introduction to Data Science and Artificial Intelligence using 『eStat』, R, and Python

This book introduces data science and artificial intelligence using 『eStat』, R and Python.

Project leader: Professor Jung Jin Lee, email: jjlee@ssu.ac.kr
Soongsil University, Korea, ADA University, Azerbaijan
New Uzbekistan University, Uzbekistan

This work is in the public domain. Therefore, it can be copied and reproduced without limitation. However, we would appreciate the citation of 『eStat』, http://www.estat.me.

『eStat』 is a web-based freeware for statistics education which can be used anytime and anywhere using PC, tablet, or mobile phone.

Basic operation of 『eStat』 [pdf] [Video]

R is a free software environment for statistical computing and graphics.

R site and to download; https://www.r-project.org
Basic operation of R [pdf]

Python is a free software environment for statistical computing and graphics.

Python site and to download; https://www.python.org
Basic operation of Python

Basic operation of Python

Table of Contents [book]

Chapter 1 Data science and artificial intelligence [book]

1.1 Statistics, data science, machine learning, and artificial intelligence
1.2 General process of data analysis
1.3 Data classification
1.4 Software programs for data analysis
1.5 References
1.6 Exercise

Chapter 2 Data visualization [book]

2.1 Visualization of qualitative data
     2.1.1 Visualization of raw data of a categorical variable
     2.1.2 Visualization of frequency data of a categorical variable
     2.1.3 Visualization of text data
2.2 Visualization of quantitative data
     2.2.1 Visualization of a single quantitative variable
     2.2.2 Visualization of two or more quantitative variables
2.3 R and Python practice
2.4 Exercise

Chapter 3 Data summary and transformation [book]

3.1 Categorical data summary using tables
     3.1.1 Frequency table for a single variable
     3.1.2 Two-dimensional frequency table for two variables
     3.1.3 Multi-dimensional frequency table for several variables
3.2 Quantitative data summary using measures
     3.2.1 Measures for a single quantitative variable
     3.2.2 Measures for several quantitative variables
     3.2.3 Similarity measures of observations
3.3 Data manipulation and transformation
3.4 Dimension reduction: Principal component analysis
3.5 R and Python practice
3.6 Exercise

Chapter 4 Probability and distribution [book]

4.1 Probability
     4.1.1 Calculation rules of probability and conditional probability
     4.1.2 Bayes theorem
4.2 Random variable and distribution
     4.2.1 Binomial distribution
     4.2.2 Normal distribution
4.3 Multivariate probability distribution
4.4 Estimation of a distribution
4.5 Exercise

Chapter 5 Testing hypothesis and regression [book]

5.1 Sampling distribution and estimation
     5.1.1 Sampling distribution of sample means
     5.1.2 Estimation of a population mean
5.2 Testing hypothesis for a population mean
5.3 Testing hypothesis for two populations meanss
5.4 Testing hypothesis for several population means: Analysis of variance
5.5 Regression analysis
     5.5.1 Correlation analysis
     5.5.2 Simple linear regression
     5.5.3 Multiple linear regression
5.6 R and Python practice
5.7 Exercise

Chapter 6 Supervised machine learning for categorical data [book]

6.1 Basic concepts of supervised machine learning and classification
     6.1.1 Evaluation measures of classification model
     6.1.2 Spliting method for training and testing data
6.2 Decision tree model
     6.2.1 Decision tree algorithm
     6.2.2 Selection of a variable for branching
     6.2.3 Categorization of a continuous variable
     6.2.4 Overfitting and pruning decision tree
     6.2.5 R and Python practice - Decision tree
6.3 Naive Bayes classification model
     6.3.1 Bayes classification model
     6.3.2 Naive Bayes classification model for categorical data
     6.3.3 Stepwise variable selection
     6.3.4 R and Python practice - Naive Bayes classification
6.4 Evaluation and comparison of classification model
     6.4.1 Evaluation of classification model
     6.4.2 Comparison of classification models
6.5 Exercise

Chapter 7 Supervised machine learning for continuous data [book]

7.1 Bayes classification model
     7.1.1 R and Python practice - Bayes classification
7.2 Logistic regression model
     7.2.1 R and Python practice - Logistic regression
7.3 Nearest neighbor classification model
     7.3.1 R and Python practice - Nearest neighbor classification
7.4 Neural network model
     7.4.1 Single-layer neural network
     7.4.2 Multilayer neural network
     7.4.3 Artificial intelligence
     7.4.5 R and Python practice - Neural network
7.5 Support vector machine model
     7.5.1 Linear support vector machine
     7.5.2 Nonlinear support vector machine
     7.5.3 R and Python practice - Support vector machine
7.6 Ensemble model
     7.6.1 Bagging      7.6.2 R and Python practice - Bagging
     7.6.3 Boosting
     7.6.4 R and Python practice - Boosting
     7.6.5 Random Forest
     7.6.6 R and Python practice - Random forest
7.7 Classification of multiple groups
7.8 Exercise

Chapter 8 Unsupervised machine learning [book]

8.1 Basic concepts of unsupervised machine learning and clustering
8.2 Hierarchical clustering model
     8.2.1 Method of linkage
     8.2.2 R and Python practice - Hierarchical clustering
8.3 K-Means clustering model
     8.3.1 R and Python practice - K-Means clustering
8.4 Exercise

Chapter 9 Artificial intelligence and other applications [book]

9.1 Artificial intelligence, machine learning, and deep learning
9.2 Text mining
9.3 Web data mining
9.4 Multimedia data mining
9.5 Spatial data analysis

Authors of eBook and developers of 『eStat』

Jung Jin Lee
Emeritus Professor, Soongsil University, Korea
Professor, ADA University, Azerbaijan
Ph.D. in Operations Research, Case Western Reserve University
M.S. in Statistics, Seoul National University
B.S. in Computer Science and Statistics, Seoul National University
President, Korean Statistical Society
Vice President, International Association for Statistical Computing
Council Member, International Statistical Institute (ISI)

Tae Rim Lee
Emeritus Professor, Korea National Open University
Ph.D. in Statistics, Choongang University
M.S. in Statistics, Seoul National University
B.S. in Computer Science and Statistics, Seoul National University
Vice President, Korean Statistical Society
Vice President, International Association for Statistics Education
Vice President, International Biometric Society

Geunseog Kang
Emeritus Professor, Soongsil University, Korea
Ph.D. in Statistics, University of Wisconsin - Madison
M.S. in Statistics, Seoul National University
B.S. in Computer Science and Statistics, Seoul National University

Sung Soo Kim
Emeritus Professor, Korea National Open University
Ph.D. in Statistics, Seoul National University
M.S. in Statistics, Seoul National University
B.S. in Computer Science and Statistics, Seoul National University

Heon Jin Park
Professor, Inha University
Ph.D. in Statistics, Iowa Stat University
M.S. in Statistics, Seoul National University
B.S. in Computer Science and Statistics, Seoul National University
President, Korean Data Mining Society
Dean, College of Natural Science, Inha University

Song Yong Sim
Hallym University
Ph.D. in Statistics, University of Wisconsin - Madison
M.S. in Statistics, Seoul National University
B.S. in Computer Science and Statistics, Seoul National University

Yoon Dong Lee
Professor, Sogang University
Ph.D. in Statistics, Iowa State University
M.S. in Statistics, Seoul National University
B.S. in Computer Science and Statistics, Seoul National University

Hyun Jo You
Professor, Chungnam National University
Ph.D in Statistics, Soongsil University
M.S., Ph.D in Linguistics, Seoul National University
B.S. in Micro-biology,

Hulisi Ogut
Professor, ADA University, Azerbaijan
Ph.D. in Management Science, University of Texas at Dallas
M.S., University of Texas at Dallas
M.A., Boston University
B.A., Bilkent University

Preface

Over the last half century, Computer Science has been evolved at a tremendous rate, bringing about previously unimaginable changes in many areas of our society and enriching our lives. Recent merging of Computer Sciences with Communication Technologies has created a digital revolution called the 4th industrial revolution that will lead for another future change.

The 4th Industrial Revolution aims at super-connectedness, super-intelligence and super-forecasting and many new changes will occur in our lives revolutionary. The revolution would help us to solve many problems, but it would also give us new challenges to be solved at the same time. The biggest challenge is analysis and utilization of Big Data.

The analysis of Big Data can be done by multi-disciplinary areas such as Statistics, Mathematics, Computer Science, and other application areas such as Management which is called Data Science. Data Science is primarily based on traditional statistical methods, applied mathematics, and requires lots of data manipulation using computer software such as R, SAS and SPSS which are widely used require some training from professionals. Authors of this book have been developed 『eStat』 for years which can help all level of students to learn Data Science easily.

This book introduces basic visualization data in Chapter 2, data summary and trasformation in Chater 3. Chapter 4 and 5 review basic statistical model for big data analysis. Chapter 6 and 7 discuss models of supervised machine learning, and Chapter 8 discusses models of unsupervised machine learning. Chapter 9 introduces artificial intelligence and other applications of data science.

I appreciate all of you who have developed 『eStat』 together over the past few years. I appreciate also to all internet communities who have helped us during the development of『eStat』.

Spring 2025

Project Leader: Jung Jin Lee