[Artificial Intelligence / Machine Learning] Handwritten Digit Recognition Using k-NN Algorithm

07-22 21:00

Notice

Recent Posts

Recent Comments

Link

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

Scientific Computing & Data Science

[Artificial Intelligence / Machine Learning] Handwritten Digit Recognition Using k-NN Algorithm 본문

Artificial Intelligence/Machine Learning

[Artificial Intelligence / Machine Learning] Handwritten Digit Recognition Using k-NN Algorithm

cinema4dr12 2016. 6. 19. 21:37

지난 글(k-Nearest Neighbor Algorithm)을 통해 R에서 k-NN 알고리즘 코드를 작성해 보았습니다. 이제 이 코드를 이용하여 숫자 필기 인식을 하는 R 코드를 작성해 보도록 하겠습니다.

데이터 준비

우선 0~9의 숫자를 손으로 쓴 데이터를 준비합니다. 두 그룹을 준비하는데, 하나는 Training Dataset으로 사용될 그룹이며 다른 하나는 Test Dataset으로 사용될 그룹입니다.

이 데이터들은 Manning Publications의 "Machine Learning in Action"에서 제공하는 데이터를 활용하였습니다.

데이터 다운로드를 받으려면 [여기]를 클릭합니다.

해당 데이터 경로는 MLiA_SourceCode/machinelearninginaction/Ch02/digits.zip 입니다. digits.zip 압축 파일을 R의 프로젝트 폴더에 해제하도록 합니다. 즉, R의 프로젝트 폴더의 하위구조에 위치하도록 하며, 다음과 같은 하이라키를 갖습니다.

{YOUR_R_PROJECT_ROOT} /digits/testDigits, 그리고

{YOUR_R_PROJECT_ROOT} /digits/trainingDigits 입니다.

테스트 데이터 준비

"Machine Learning in Action"에서 제공하는 테스트 데이터 외에 자신이 직접 쓴 숫자를 테스트 하기 원한다면, "Binary Image를 Text 파일로 변환하기"를 참고하도록 합니다.

참고로 본 글에 실을 R 함수가 요구하는 이미지의 크기는 가로, 세로 모두 32이다. 즉 32 × 32입니다. 직접 작성하기 어려운 분은 다음 소스 코드를 다운 받도록 합니다:

ConverImageToText.R

ConverImageToText.R

####################################################################################################
# @function : ConvertImageToText
# @author : Geol Choi, phD
# @email  : cinema4dr12@gmail.com
# @date   : 06/10/2016
####################################################################################################
ConvertImageToText <- function(imgName, threshold, fileName) {
  library("EBImage");
  
  img <- readImage(imgName);
  
  # convert input image to grayscale image
  colorMode(img) <- Grayscale;
  
  # image resize to 64-by-64
  img = resize(img, w = 32, h = 32);
  
  # make a binary image
  img_bin <- img > threshold;
  
  # display the binary image to plot
  display(img_bin, method="raster");
 
  # extract the first color element of the grayscaled image 
  img_new <- img_bin[,,1];
  
  height <- dim(img_new)[1];
  width <- dim(img_new)[2];
  
  # if "TRUE" set the value to "0" and if "FALSE" set the value to "1"
  img_new[img_new == "TRUE"] <- "0";
  img_new[img_new == "FALSE"] <- "1";
  
  # write into file
  cat(paste(img_new[,1], sep = "", collapse = ""), file = fileName, sep = "\n");
  for(i in 2:height) {
    cat(paste(img_new[,i], sep = "", collapse = ""), file = fileName, sep="\n", append = TRUE);
  }
}

k-NN 알고리즘

필기 인식을 위한 알고리즘으로 k-NN 알고리즘을 사용할 것입니다. k-NN 알고리즘을 구현하는 R 코드 작성은 "Handwritten Digit Recognition Using k-NN Algorithm"을 참고하도록 합니다.

직접 작성이 어려운 분은 다음 코드를 다운받도록 합니다.

kNN.R

이 파일을 각자의 R 프로젝트 경로에 복사합니다.

kNN.R

#####################################################
# @function: KNN() - k-nearest neighbor algorithm
# @input:  
#   - df  : data frame for training data set
#   - inX : vector for test data
#   - k   : k-nearest neighbor
# @return:  result classifier
#####################################################
# @author: Geol Choi, ph.D
# @email:  cinema4dr12@gmail.com
# @date:   16/06/2016
#####################################################

kNN <- function(df, inX, k) {
  
  # extract group and label
  len <- dim(df)[2] - 1;
  
  # initialize matrix
  dataSet <- matrix(ncol = len, nrow = dim(df)[1]);
  
  for(i in 1:len) {
    dataSet[,i] <- as.matrix(df[,i]);
  }
  
  # set classes (last column of df)
  labels <- df[,dim(df)[2]];
  
  # size of dataset
  dataSetSize <- dim(dataSet)[1];
  
  # create test matrix
  testMat = matrix(nrow = dataSetSize, ncol = length(inX));
  
  for(i in 1:dataSetSize) {
    testMat[i,] <- inX;
  }
  
  # difference between testMat and dataSet
  diffMat <- testMat - dataSet;
  
  # squared matrix difference
  sqDiffMat <- diffMat**2.0;
  
  # row sums of sqDiffMat
  sqDistances <- rowSums(sqDiffMat, na.rm = FALSE, dims = 1);
  
  # order of index
  sortedDistIndicies <- sort.int(sqDistances, index.return = TRUE)$ix;
  
  result <- NULL;
  
  for(i in 1:k) {
    iLabel <- labels[sortedDistIndicies[i]];
    result <- c(result, iLabel);
  }
  
  uniqueLabels <- unique(labels);
  
  # initialize data frame
  ResultClass <- data.frame(matrix(ncol = 2, nrow = 1));
  names(ResultClass) <- c("LABEL", "COUNT");
  
  for(i in 1:length(uniqueLabels)) {
    label <- uniqueLabels[i];
    count <- sum(result == label);
    ResultClass[i,] <- c(label, count);
  }
  
  # transform COUNT to numeric data
  ResultClass[,2] <- as.numeric(ResultClass$COUNT);
  
  # sorting by COUNT (descending order)
  ResultClass <- ResultClass[order(-ResultClass$COUNT),];
  
  return(ResultClass$LABEL[1]);
}

숫자 필기 인식

Training Dataset과 Test Dataset을 이용하여 이제 손으로 쓴 숫자를 인식해 보도록 하겠습니다.

전체적인 알고리즘은 간단합니다:

Step 1. 컬러 이미지를 Grayscale로 변환합니다.

Step 2. 변환된 Grayscale 이미지를 32 × 32 크기로 변환합니다.

Step 3. Threshold를 지정하여 binary 텍스트 형태로 변환합니다.

Step 4. Step 3.에서 생성한 2D binary matrix를 1D vector 형태로 변환합니다.

Step 5. 각 Training Set에 대한 1D vector를 matrix의 행(row)으로 저장합니다.

Step 6. 저장된 matrix의 행(row)의 크기는 Training Set의 데이터 개수이며, 열(column)의 크기는 32*32, 즉, 1024입니다.

Step 7. 각 Training Data에 대한 0~9의 label vector를 저장합니다.

Step 8. 각 Test Data에 대하여 각각의 Training Data와의 거리를 구하고, 거리가 짧은 순서대로 정렬합니다.

Step 9. 거리가 짧은 순으로 지정된 k개 만큼 각각 label에 대해 count하고, 가장 많이 count된 label(class)를 인식된 숫자로 출력합니다.

전체 코드는 다음과 같습니다:

ClassifyHandWrittenDigit.R

#####################################################
# @function: ClassifyHandWrittenDigit()
# @input::  NONE
# @return:  NONE
#####################################################
# @author: Geol Choi, ph.D
# @email:  cinema4dr12@gmail.com
# @date:   16/06/2016
#####################################################

ClassifyHandWrittenDigit <- function() {
  
  source('kNN.R');
  
  ###############################################
  # GENERATE DATA.FRAME FOR TRAINING DATA
  ###############################################
  
  # file list
  trainingDigitsPath <- "./digits/trainingDigits/";
  fileList <- list.files(path = trainingDigitsPath, all.files = TRUE);
  
  nRow <- 0;
  pixelCounts <- 32 * 32;
  rowElems <- matrix(ncol = pixelCounts, nrow = length(fileList) - 2);
  labels <- NULL;
  
  ## fill in matrix with binaray data from filelist
  for(i in 3:length(fileList)) {
    nRow <- nRow + 1;
    fileName <- fileList[i];
    tmp <- strsplit(fileName, "_");
    digit <- tmp[[1]][1];
    labels[nRow] <- as.integer(digit);
    fileName <- paste(trainingDigitsPath, fileName, sep = "");
    
    ## create file connection
    con <- file(description = fileName, open = "r");
    line = readLines(con);
    long = length(line);
    
    ## new row elements
    newRowElem <- NULL;
    for (j in 1:long) {
      tmp <- line[j];
      tmp <- as.integer(unlist(strsplit(tmp,"")));
      newRowElem <- c(newRowElem, tmp);
    }
    
    ## close file connection
    close(con);
    
    ## add new row elements
    rowElems[nRow,] <- newRowElem;
  }
  
  ## combine labels to df
  df <- as.data.frame(rowElems);
  df$CLASS <- labels;
  
  ###############################################
  # k-NN TEST FOR HANDWRITEEN DIGITS RECOGNITION
  ###############################################
  
  testDigitsPath <- "./digits/testDigits/";
  fileList <- list.files(path = testDigitsPath, all.files = TRUE);
  
  errCount <- 0;
  trials <- 0;
  
  for(i in 3:length(fileList)) {
    fileName <- fileList[i];
    tmp <- strsplit(fileName, "_");
    digit <- as.integer(tmp[[1]][1]);
    classNum <- as.integer(digit);
    fileName <- paste(testDigitsPath, fileName, sep = "");
    
    ## create file connection
    con <- file(description = fileName, open = "r");
    line = readLines(con);
    long = length(line);
    
    ## new row elements
    inX <- NULL;
    for (j in 1:long) {
      tmp <- line[j];
      tmp <- as.integer(unlist(strsplit(tmp,"")));
      inX <- c(inX, tmp);
    }
    
    ## close file connection
    close(con);
    
    classifierResult <- kNN(df, inX, 3);
    
    if(digit != as.integer(classifierResult)) {
      errCount <- errCount + 1;
    }
    
    trials <- trials + 1;
    prt <- sprintf("Trials: %d , Classifier result: %d , Real answer : %d",
                   as.integer(trials),
                   as.integer(classifierResult),
                   as.integer(digit));
    print(prt);
  }
  
  print("==========================================================");
  prt <- sprintf("Error counts: %d out of %d trials.", errCount, trials);
  print(prt);
  prt <- sprintf("Error rate: %f %%", errCount/(length(fileList)-2));
  print(prt);
}
}

각자 설명된 알고리즘을 이용하여 코드를 분석해 보기 바합니다.

결과

코드를 실행한 결과는 다음과 같다.

입력

> source('D:/MyProjects/PROGRAMMING/R/Working/Projects/011-MachineLearning/ClassifyHandWrittenDigit.R')
> ClassifyHandWrittenDigit()

출력

.............
[1] "Trials: 938 , Classifier result: 9 , Real answer : 9"
[1] "Trials: 939 , Classifier result: 9 , Real answer : 9"
[1] "Trials: 940 , Classifier result: 9 , Real answer : 9"
[1] "Trials: 941 , Classifier result: 9 , Real answer : 9"
[1] "Trials: 942 , Classifier result: 9 , Real answer : 9"
[1] "Trials: 943 , Classifier result: 9 , Real answer : 9"
[1] "Trials: 944 , Classifier result: 9 , Real answer : 9"
[1] "Trials: 945 , Classifier result: 9 , Real answer : 9"
[1] "Trials: 946 , Classifier result: 9 , Real answer : 9"
[1] "=========================================================="
[1] "Error counts: 12 out of 946 trials."
[1] "Error rate: 0.012685 %"

출력된 결과를 보면 알 수 있듯, 946개의 Test Data 중 12개만이 오답을 내어 오답률은 0.0127% 정도입니다.

역으로 말하면, 정답률이 98.7%이므로, 상당히 높은 수준이라고 할 수 있을 것 같습니다.

'Artificial Intelligence > Machine Learning' 카테고리의 다른 글

[Artificial Intelligence / Machine Learning] Naive Bayes Spam Filter Part 1. (0)	2016.11.07
[Artificial Intelligence / Machine Learning] Decision Tree - C5.0 Algorithm (0)	2016.07.23
[Artificial Intelligence / Machine Learning] k-Nearest Neighbor Algorithm (0)	2016.06.18
[Artificial Intelligence / Machine Learning] k-means with R (0)	2016.01.02
[Artificial Intelligence / Machine Learning] k-Nearest Neighbor Algorithm (0)	2015.08.22