Table of contents

Volume 971

2018

Previous issue Next issue

International Conference on Data and Information Science 5–6 December 2017, Telkom University, Indonesia

Accepted papers received: 15 February 2018
Published online: 05 April 2018

Preface

011001
The following article is Open access

ICoDIS

In the name of Allah, The Most Gracious, The Most Merciful.

This inaugural conference is organized by School of Computing, Telkom University, and supported by Indonesia Data Science Society (IDSS) and Indonesia Association of Computational Linguistics (INACL). The advancement of today's computing technology has driven people to generate a vast amount of data with the size and variety that have never been experienced in the history of computing. The need to process and analyze big data attracts researchers interest to propose solutions. ICoDIS is organized to gather researchers to disseminate their relevant work on data science, computational linguistics, and information science.

011002
The following article is Open access

All papers published in this volume of Journal of Physics: Conference Series have been peer reviewed through processes administered by the proceedings Editors. Reviews were conducted by expert referees to the professional and scientific standards expected of a proceedings journal published by IOP Publishing.

Papers

Data Science

012001
The following article is Open access

, and

In this paper, we propose a design of fuzzy relational database to deal with a conditional probability relation using fuzzy relational calculus. In the previous, there are several researches about equivalence class in fuzzy database using similarity or approximate relation. It is an interesting topic to investigate the fuzzy dependency using equivalence classes. Our goal is to introduce a formulation of a fuzzy relational database model using the relational calculus on the category of fuzzy relations. We also introduce general formulas of the relational calculus for the notion of database operations such as 'projection', 'selection', 'injection' and 'natural join'. Using the fuzzy relational calculus and conditional probabilities, we introduce notions of equivalence class, redundant, and dependency in the theory fuzzy relational database.

012002
The following article is Open access

and

Crowdfunding platform is a place where startup shows off publicly their idea for the purpose to get their project funded. Crowdfunding platform such as Kickstarter are becoming popular today, it provides the efficient way for startup to get funded without liabilities, it also provides variety project category that can be participated. There is an available safety procedure to ensure achievable low-risk environment. The startup promoted project must accomplish their funded goal target. If they fail to reach the target, then there is no investment activity take place. It motivates startup to be more active to promote or disseminate their project idea and it also protect investor from losing money. The study objective is to predict the successfulness of proposed project and mapping investor trend using data mining framework. To achieve the objective, we proposed 3 models. First model is to predict whether a project is going to be successful or failed using K-Nearest Neighbour (KNN). Second model is to predict the number of successful project using Artificial Neural Network (ANN). Third model is to map the trend of investor in investing the project using K-Means clustering algorithm. KNN gives 99.04% model accuracy, while ANN best configuration gives 16-14-1 neuron layers and 0.2 learning rate, and K-Means gives 6 best separation clusters. The results of those models can help startup or investor to make decision regarding startup investment.

012003
The following article is Open access

and

Cancer is a leading cause of death worldwide although a significant proportion of it can be cured if it is detected early. In recent decades, technology called microarray takes an important role in the diagnosis of cancer. By using data mining technique, microarray data classification can be performed to improve the accuracy of cancer diagnosis compared to traditional techniques. The characteristic of microarray data is small sample but it has huge dimension. Since that, there is a challenge for researcher to provide solutions for microarray data classification with high performance in both accuracy and running time. This research proposed the usage of Principal Component Analysis (PCA) as a dimension reduction method along with Support Vector Method (SVM) optimized by kernel functions as a classifier for microarray data classification. The proposed scheme was applied on seven data sets using 5-fold cross validation and then evaluation and analysis conducted on term of both accuracy and running time. The result showed that the scheme can obtained 100% accuracy for Ovarian and Lung Cancer data when Linear and Cubic kernel functions are used. In term of running time, PCA greatly reduced the running time for every data sets.

012004
The following article is Open access

and

Cancer is one of the deadly diseases, according to data from WHO by 2015 there are 8.8 million more deaths caused by cancer, and this will increase every year if not resolved earlier. Microarray data has become one of the most popular cancer-identification studies in the field of health, since microarray data can be used to look at levels of gene expression in certain cell samples that serve to analyze thousands of genes simultaneously. By using data mining technique, we can classify the sample of microarray data thus it can be identified with cancer or not. In this paper we will discuss some research using some data mining techniques using microarray data, such as Support Vector Machine (SVM), Artificial Neural Network (ANN), Naive Bayes, k-Nearest Neighbor (kNN), and C4.5, and simulation of Random Forest algorithm with technique of reduction dimension using Relief. The result of this paper show performance measure (accuracy) from classification algorithm (SVM, ANN, Naive Bayes, kNN, C4.5, and Random Forets).The results in this paper show the accuracy of Random Forest algorithm higher than other classification algorithms (Support Vector Machine (SVM), Artificial Neural Network (ANN), Naive Bayes, k-Nearest Neighbor (kNN), and C4.5). It is hoped that this paper can provide some information about the speed, accuracy, performance and computational cost generated from each Data Mining Classification Technique based on microarray data.

012005
The following article is Open access

, , and

Infertility in the women reproduction system due to inhibition of follicles maturation process causing the number of follicles which is called polycystic ovaries (PCO). PCO detection is still operated manually by a gynecologist by counting the number and size of follicles in the ovaries, so it takes a long time and needs high accuracy. In general, PCO can be detected by calculating stereology or feature extraction and classification. In this paper, we designed a system to classify PCO by using the feature extraction (Gabor Wavelet method) and Competitive Neural Network (CNN). CNN was selected because this method is the combination between Hemming Net and The Max Net so that the data classification can be performed based on the specific characteristics of ultrasound data. Based on the result of system testing, Competitive Neural Network obtained the highest accuracy is 80.84% and the time process is 60.64 seconds (when using 32 feature vectors as well as weight and bias values respectively of 0.03 and 0.002).

012006
The following article is Open access

, , , and

With internet, anyone can publish their creation into digital data simply, inexpensively, and absolutely easy to be accessed by everyone. However, the problem appears when anyone else claims that the creation is their property or modifies some part of that creation. It causes necessary protection of copyrights; one of the examples is with watermarking method in digital image. The application of watermarking technique on digital data, especially on image, enables total invisibility if inserted in carrier image. Carrier image will not undergo any decrease of quality and also the inserted image will not be affected by attack. In this paper, watermarking will be implemented on digital image using Singular Value Decomposition based on Discrete Wavelet Transform (DWT) and Discrete Cosine Transform (DCT) by expectation in good performance of watermarking result. In this case, trade-off happen between invisibility and robustness of image watermarking. In embedding process, image watermarking has a good quality for scaling factor < 0.1. The quality of image watermarking in decomposition level 3 is better than level 2 and level 1. Embedding watermark in low-frequency is robust to Gaussian blur attack, rescale, and JPEG compression, but in high-frequency is robust to Gaussian noise.

012007
The following article is Open access

, and

Customer churn has become a significant problem and also a challenge for Telecommunication company such as PT. Telkom Indonesia. It is necessary to evaluate whether the big problems of churn customer and the company's managements will make appropriate strategies to minimize the churn and retaining the customer. Churn Customer data which categorized churn Atas Permintaan Sendiri (APS) in this Company is an imbalance data, and this issue is one of the challenging tasks in machine learning.

This study will investigate how is handling class imbalance in churn prediction using combined Synthetic Minority Over-Sampling (SMOTE) and Random Under-Sampling (RUS) with Bagging method for a better churn prediction performance's result. The dataset that used is Broadband Internet data which is collected from Telkom Regional 6 Kalimantan.

The research firstly using data preprocessing to balance the imbalanced dataset and also to select features by sampling technique SMOTE and RUS, and then building churn prediction model using Bagging methods and C4.5.

012008
The following article is Open access

, , and

Process mining is a data analytics approach to discover and analyse process models based on the real activities captured in information systems. There is a growing body of literature on process mining in healthcare, including oncology, the study of cancer. In earlier work we found 37 peer-reviewed papers describing process mining research in oncology with a regular complaint being the limited availability and accessibility of datasets with suitable information for process mining. Publicly available datasets are one option and this paper describes the potential to use MIMIC-III, for process mining in oncology. MIMIC-III is a large open access dataset of de-identified patient records. There are 134 publications listed as using the MIMIC dataset, but none of them have used process mining. The MIMIC-III dataset has 16 event tables which are potentially useful for process mining and this paper demonstrates the opportunities to use MIMIC-III for process mining in oncology. Our research applied the L* lifecycle method to provide a worked example showing how process mining can be used to analyse cancer pathways. The results and data quality limitations are discussed along with opportunities for further work and reflection on the value of MIMIC-III for reproducible process mining research.

012009
The following article is Open access

, and

Microblogging sites have millions of people sharing their thoughts daily because of its characteristic short and simple manner of expression. Sentiments analysis are often being used to analyse the user customer opinions regarding brand images or products. For some reasons, not all sentiment generated using this existing machine-based algorithms yields satisfying results. This is mostly due to the uniformity of the informal language used in the social media sentences. This condition also occurs in Telkom UData on our preliminary study, where the machine-based provided less then optimal results in analysing the sentiment. This research offers concepts with human interaction using crowdsourcing where people are involved to analyse sentiments, while forming the new training dataset at the same time. From the research results found that sarcastic and contradictory sentences can be recognized by humans, to be utilized as new training datasets for further machine learning. From this experiments, that approach are likely increase the accuracy of the sentiments in UData from neutral to become positive or negative polarized up to 39%. We do as well simulated trust concept through sociometric to ensure the crowdsource workers are trusted and capable enough in analysing the sentiments on social media.

012010
The following article is Open access

and

Dengue hemorrhagic disease, is a disease caused by the Dengue virus of the Flavivirus genus Flaviviridae family. Indonesia is the country with the highest case of dengue in Southeast Asia. In addition to mosquitoes as vectors and humans as hosts, other environmental and social factors are also the cause of widespread dengue fever. To prevent the occurrence of the epidemic of the disease, fast and accurate action is required. Rapid and accurate action can be taken, if there is appropriate information support on the occurrence of the epidemic. Therefore, a complete and accurate information on the spread pattern of endemic areas is necessary, so that precautions can be done as early as possible. The information on dispersal patterns can be obtained by various methods, which are based on empirical and theoretical considerations. One of the methods used is based on the estimated number of infected patients in a region based on spatial and time. The first step of this research is conducted by predicting the number of DHF patients in 2016 until 2018 based on 2010 to 2015 data using GSTAR (1, 1). In the second phase, the distribution pattern prediction of dengue disease area is conducted. Furthermore, based on the characteristics of DHF epidemic trends, i.e. down, stable or rising, the analysis of distribution patterns of dengue fever distribution areas with IDW and Kriging (ordinary and universal Kriging) were conducted in this study. The difference between IDW and Kriging, is the initial process that underlies the prediction process. Based on the experimental results, it is known that the dispersion pattern of epidemic areas of dengue disease with IDW and Ordinary Kriging is similar in the period of time.

012011
The following article is Open access

, , , and

Microarray Technology is one of technology which able to read the structure of gen. The analysis is important for this technology. It is for deciding which attribute is more important than the others. Microarray technology is able to get cancer information to diagnose a person's gen. Preparation of microarray data is a huge problem and takes a long time. That is because microarray data contains high number of insignificant and irrelevant attributes. So, it needs a method to reduce the dimension of microarray data without eliminating important information in every attribute. This research uses Mutual Information to reduce dimension. System is built with Machine Learning approach specifically Bayes Theorem. This theorem uses a statistical and probability approach. By combining both methods, it will be powerful for Microarray Data Classification. The experiment results show that system is good to classify Microarray data with highest F1-score using Bayesian Network by 91.06%, and Naïve Bayes by 88.85%.

012012
The following article is Open access

and

Nowadays, electrooculogram is regarded as one of the most important biomedical signal in measuring and analyzing eye movement patterns. Thus, it is helpful in designing EOG-based Human Computer Interface (HCI). In this research, electrooculography (EOG) data was obtained from five volunteers. The (EOG) data was then preprocessed before feature extraction methods were employed to further reduce the dimensionality of data. Three feature extraction approaches were put forward, namely statistical parameters, autoregressive (AR) coefficients using Burg method, and power spectral density (PSD) using Yule-Walker method. These features would then become input to both artificial neural network (ANN) and support vector machine (SVM). The performance of the combination of different feature extraction methods and classifiers was presented and analyzed. It was found that statistical parameters + SVM achieved the highest classification accuracy of 69.75%.

012013
The following article is Open access

, and

It has been very significant to visualize time series big data. In the paper we shall discuss a new analysis method called "statistical shape analysis" or "geometry driven statistics" on time series statistical data in economics. In the paper, we analyse the agriculture, value added and industry, value added (percentage of GDP) changes from 2000 to 2010 in Asia. We handle the data as a set of landmarks on a two-dimensional image to see the deformation using the principal components. The point of the analysis method is the principal components of the given formation which are eigenvectors of its bending energy matrix. The local deformation can be expressed as the set of non-Affine transformations. The transformations give us information about the local differences between in 2000 and in 2010. Because the non-Affine transformation can be decomposed into a set of partial warps, we present the partial warps visually. The statistical shape analysis is widely used in biology but, in economics, no application can be found. In the paper, we investigate its potential to analyse the economic data.

012014
The following article is Open access

and

The challenge for the facial based security system is how to detect facial image falsification such as facial image spoofing. Spoofing occurs when someone try to pretend as a registered user to obtain illegal access and gain advantage from the protected system. This research implements facial image spoofing detection method by analyzing image texture. The proposed method for texture analysis combines the Local Binary Pattern (LBP) and Gray Level Co-occurrence Matrix (GLCM) method. The experimental results show that spoofing detection using LBP and GLCM combination achieves high detection rate compared to that of using only LBP feature or GLCM feature.

012015
The following article is Open access

, and

k-Path centrality is deemed as one of the effective methods to be applied in centrality measurement in which the influential node is estimated as the node that is being passed by information path frequently. Regarding this, k-Path centrality has been employed in the analysis of this paper specifically by adapting random-algorithm approach in order to: (1) determine the influential user's ranking in a social media Twitter; and (2) ascertain the influence of parameter α in the numeration of k-Path centrality. According to the analysis, the findings showed that the method of k-Path centrality with random-algorithm approach can be used to determine user's ranking which influences in the dissemination of information in Twitter. Furthermore, the findings also showed that parameter α influenced the duration and the ranking results: the less the α value, the longer the duration, yet the ranking results were more stable.

012016
The following article is Open access

, , and

Polycystic Ovary Syndrome (PCOS) is a reproduction problem that causes irregular menstruation period. Insulin and androgen hormone have big roles for this problem. This syndrome should be detected shortly, since it is able to cause a more serious disease, such as cardiovascular, diabetes, and obesity. The detection of this syndrome is done by analyzing ovary morphology and hormone test. However, the more economical way of test is by identifying the ovary morphology using ultrasonography. To classify whether one ovary is normal or it has polycystic ovary (PCO) follicle, the analysis will be done manually by a gynecologist. This paper will design a system to detect PCO using Gabor Wavelet method for feature extraction and Elman Neural Network is used to classify PCO and non-PCO. Elman Neural Network is chosen because it contains context layer to recall the previous condition. This paper compared the accuracy and process time of each dataset, then also did testing on elman's parameters, such as layer delay, hidden layer, and training function. Based on tests done in this paper, the most accurate number is 78.1% with 32 features.

012017
The following article is Open access

, and

The paper discusses the prediction of Jakarta Composite Index (JCI) in Indonesia Stock Exchange. The study is based on JCI historical data for 1286 days to predict the value of JCI one day ahead. This paper proposes predictions done in two stages., The first stage using Fuzzy Time Series (FTS) to predict values of ten technical indicators, and the second stage using Support Vector Regression (SVR) to predict the value of JCI one day ahead, resulting in a hybrid prediction model FTS-SVR. The performance of this combined prediction model is compared with the performance of the single stage prediction model using SVR only. Ten technical indicators are used as input for each model.

012018
The following article is Open access

, and

Cancer is one of the major causes of mordibility and mortality problems in the worldwide. Therefore, the need of a system that can analyze and identify a person suffering from a cancer by using microarray data derived from the patient's Deoxyribonucleic Acid (DNA). But on microarray data has thousands of attributes, thus making the challenges in data processing. This is often referred to as the curse of dimensionality. Therefore, in this study built a system capable of detecting a patient whether contracted cancer or not. The algorithm used is Genetic Algorithm as feature selection and Momentum Backpropagation Neural Network as a classification method, with data used from the Kent Ridge Bio-medical Dataset. Based on system testing that has been done, the system can detect Leukemia and Colon Tumor with best accuracy equal to 98.33% for colon tumor data and 100% for leukimia data. Genetic Algorithm as feature selection algorithm can improve system accuracy, which is from 64.52% to 98.33% for colon tumor data and 65.28% to 100% for leukemia data, and the use of momentum parameters can accelerate the convergence of the system in the training process of Neural Network.

Information Science

012019
The following article is Open access

, and

This paper is devoted to compare the numerical solutions for the steady and unsteady state heat distribution model on projectile. Here, the best location for installing of the MEMS on the projectile based on the surface temperature is investigated. Numerical iteration methods, Jacobi and Gauss-Seidel have been elaborated to solve the steady state heat distribution model on projectile. The results using Jacobi and Gauss-Seidel are shown identical but the discrepancy iteration cost for each methods is gained. Using Jacobi's method, the iteration cost is 350 iterations. Meanwhile, using Gauss-Seidel 188 iterations are obtained, faster than the Jacobi's method. The comparison of the simulation by steady state model and the unsteady state model by a reference is shown satisfying. Moreover, the best candidate for installing MEMS on projectile is observed at pointT(10, 0) which has the lowest temperature for the other points. The temperature using Jacobi and Gauss-Seidel for scenario 1 and 2 atT(10, 0) are 307 and 309 Kelvin respectively.

012020
The following article is Open access

, , and

In this paper, a numerical implementation of 1D Variational Boussinesq (VB) wave model in a staggered grid scheme is discussed. The staggered grid scheme that is used is based on the idea proposed by Stelling & Duinmeijer (2003) who implemented the scheme in a non-dispersive Shallow Water Equations in a conservative form. Here, we extend the idea of the staggered scheme to be applied for VB wave model. To test the accuracy of the implementation, we test the numerical implementation of VB wave model for simulating propagation of solitary wave against analytical solution. Moreover, to test dispersiveness of the model, we simulate a standing wave against analytical solution. Results of simulations show a good agreement with analytical solutions.

012021
The following article is Open access

and

Online transportation service is known for its accessibility, transparency, and tariff affordability. These points make online transportation have advantages over the existing conventional transportation service. Online transportation service is an example of disruptive technology that change the relationship between customers and companies. In Indonesia, there are high competition among online transportation provider, hence the companies must maintain and monitor their service level. To understand their position, we apply both sentiment analysis and multiclass classification to understand customer opinions. From negative sentiments, we can identify problems and establish problem-solving priorities. As a case study, we use the most popular online transportation provider in Indonesia: Gojek and Grab. Since many customers are actively give compliment and complain about company's service level on Twitter, therefore we collect 61,721 tweets in Bahasa during one month observations. We apply Naive Bayes and Support Vector Machine methods to see which model perform best for our data. The result reveal Gojek has better service quality with 19.76% positive and 80.23% negative sentiments than Grab with 9.2% positive and 90.8% negative. The Gojek highest problem-solving priority is regarding application problems, while Grab is about unusable promos. The overall result shows general problems of both case study are related to accessibility dimension which indicate lack of capability to provide good digital access to the end users.

012022
The following article is Open access

, and

Simulation of breaking waves by using Navier-Stokes equation via moving particle semi-implicit method (MPS) over close domain is given. The results show the parallel computing on multicore architecture using OpenMP platform can reduce the computational time almost half of the serial time. Here, the comparison using two computer architectures (AMD and Intel) are performed. The results using Intel architecture is shown better than AMD architecture in CPU time. However, in efficiency, the computer with AMD architecture gives slightly higher than the Intel. For the simulation by 1512 number of particles, the CPU time using Intel and AMD are 12662.47 and 28282.30 respectively. Moreover, the efficiency using similar number of particles, AMD obtains 50.09 % and Intel up to 49.42 %.

012023
The following article is Open access

and

The impact of a dam-break wave on an erodible embankment with a steep slope has been studied recently using both experimental and numerical approaches. In this paper, the semi-implicit staggered scheme for approximating the shallow water-Exner model will be elaborated to describe the erodible sediment on a steep slope. This scheme is known as a robust scheme to approximate shallow water-Exner model. The results are shown in a good agreement with the experimental data. The comparisons of numerical results with data experiment using slopes Φ = 59.04 and Φ = 41.42 by coefficient of Grass formula Ag = 2 × 10−5 and Ag = 10−5 respectively are found the closest results to the experiment. This paper can be seen as the additional validation of semi-implicit staggered scheme in the paper of Gunawan, et al (2015).

012024
The following article is Open access

, and

Recommender System is software that is able to provide personalized recommendation suits users' needs. Recommender System has been widely implemented in various domains, including tourism. One approach that can be done for more personalized recommendations is the use of contextual information. This paper proposes a context aware recommender based ontology system in the tourism domain. The system is capable of recommending tourist destinations by using user preferences of the categories of tourism and contextual information such as user locations, weather around tourist destinations and close time of destination. Based on the evaluation, the system has accuracy of of 0.94 (item recommendation precision evaluated by expert) and 0.58 (implicitly from system-end user interaction). Based on the evaluation of user satisfaction, the system provides a satisfaction level of more than 0.7 (scale 0 to 1) for speed factors for providing liked recommendations (PE), informative description of recommendations (INF) and user trust (TR).

012025
The following article is Open access

, and

In this article we conduct a theoretical security analysis of Megrelishvili protocol—a linear algebra-based key agreement between two participants. We study the computational complexity of Megrelishvili vector-matrix problem (MVMP) as a mathematical problem that strongly relates to the security of Megrelishvili protocol. In particular, we investigate the asymptotic upper bounds for the running time and memory requirement of the MVMP that involves diagonalizable public matrix. Specifically, we devise a diagonalization method for solving the MVMP that is asymptotically faster than all of the previously existing algorithms. We also found an important counterintuitive result: the utilization of primitive matrix in Megrelishvili protocol makes the protocol more vulnerable to attacks.

012026
The following article is Open access

, and

The increasing of layers in shallow water equations (SWE) produces more dynamic model than the one-layer SWE model. The two-layer 1D SWE model has different density for each layer. This model becomes more dynamic and natural, for instance in the ocean, the density of water will decreasing from the bottom to the surface. Here, the source-centered hydro-static reconstruction (SCHR) numerical scheme will be used to approximate the solution of two-layer 1D SWE model, since this scheme is proved to satisfy the mathematical properties for shallow water equation. Additionally in this paper, the algorithm of SCHR is adapted to the multicore architecture. The simulation of runup by under water avalanche is elaborated here. The results show that the runup is depend on the ratio of density of each layers. Moreover by using grid sizes Nx = 8000, the speedup and efficiency by 2 threads are obtained 1.74779 times and 87.3896 % respectively. Nevertheless, by 4 threads the speedup and efficiency are obtained 2.93132 times and 73.2830 % respectively by similar number of grid sizes Nx = 8000.

012027
The following article is Open access

and

The process of shoreline changes due to transport of sediment by littoral drift is studied in this paper. Pelnard-Considère is the commonly adopted model. This model is based on the principle of sediment conservation, without diffraction. In this research, we adopt the Pelnard-Considère equation with diffraction, and a numerical scheme based on the finite volume method is implemented. Shoreline development in a groyne system is then simulated. For a case study, the Sanur Bali Beach, Indonesia is considered, in which from Google Earth photos, the beach experiences changes of coastline caused by sediment trapped in a groyne system.

012028
The following article is Open access

and

The aim of this paper is to investigate the performances of openACC platform for computing 2D radial dambreak. Here, the shallow water equation will be used to describe and simulate 2D radial dambreak with finite volume method (FVM) using HLLE flux. OpenACC is a parallel computing platform based on GPU cores. Indeed, from this research this platform is used to minimize computational time on the numerical scheme performance. The results show the using OpenACC, the computational time is reduced. For the dry and wet radial dambreak simulations using 2048 grids, the computational time of parallel is obtained 575.984 s and 584.830 s respectively for both simulations. These results show the successful of OpenACC when they are compared with the serial time of dry and wet radial dambreak simulations which are collected 28047.500 s and 29269.40 s respectively.

012029
The following article is Open access

and

Constraint-based data cleaning captures data violation to a set of rule called data quality rules. The rules consist of integrity constraint and data edits. Structurally, they are similar, where the rule contain left hand side and right hand side. Previous research proposed a data repair algorithm for integrity constraint violation. The algorithm uses undirected hypergraph as rule violation representation. Nevertheless, this algorithm can not be applied for data edits because of different rule characteristics. This study proposed GraDit, a repair algorithm for data edits rule. First, we use bipartite-directed hypergraph as model representation of overall defined rules. These representation is used for getting interaction between violation rules and clean rules. On the other hand, we proposed undirected graph as violation representation. Our experimental study showed that algorithm with undirected graph as violation representation model gave better data quality than algorithm with undirected hypergraph as representation model.

012030
The following article is Open access

and

Survey data that are collected from year to year have metadata change. However it need to be stored integratedly to get statistical data faster and easier. Data warehouse (DW) can be used to solve this limitation. However there is a change of variables in every period that can not be accommodated by DW. Traditional DW can not handle variable change via Slowly Changing Dimension (SCD). Previous research handle the change of variables in DW to manage metadata by using multiversion DW (MVDW). MVDW is designed using relational model. Some researches also found that developing nonrelational model in NoSQL database has reading time faster than the relational model. Therefore, we propose changes to metadata management by using NoSQL. This study proposes a model DW to manage change and algorithms to retrieve data with metadata changes. Evaluation of the proposed models and algorithms result in that database with the proposed design can retrieve data with metadata changes properly. This paper has contribution in comprehensive data analysis with metadata changes (especially data survey) in integrated storage.

012031
The following article is Open access

, and

Ontology is used as knowledge representation while database is used as facts recorder in a KMS (Knowledge Management System). In most applications, data are managed in a database system and updated through the application and then they are transformed to knowledge as needed. Once a domain conceptor defines the knowledge in the ontology, application and database can be generated from the ontology. Most existing frameworks generate application from its database. In this research, ontology is used for generating the application. As the data are updated through the application, a mechanism is designed to trigger an update to the ontology so that the application can be rebuilt based on the newest ontology. By this approach, a knowledge engineer has a full flexibility to renew the application based on the latest ontology without dependency to a software developer. In many cases, the concept needs to be updated when the data changed. The framework is built and tested in a spring java environment. A case study was conducted to proof the concepts.

012032
The following article is Open access

and

The goal of this paper is to analyze the parallel performance using OpenMP platform on shallow water-sediment concentration coupled model. The sediment model is coupled with the shallow water model for generating the water flow. In this paper, convection-diffusion equation is used to describe the sediment movement. The numerical results using OpenMP platform is shown satisfying to reduce the computational time. Indeed, the parallel time is faster than the serial time even along the increasing of the discrete points. The result using number of grid sizes 1600, the speedup is obtained 1.583 times. Meanwhile the efficiency is observed 39.57%. Moreover, the average efficiency from 5 times experiments is found 38.9%.

012033
The following article is Open access

, and

Shallow water equations or commonly referred as Saint-Venant equations are used to model fluid phenomena. These equations can be solved numerically using several methods, like Lattice Boltzmann method (LBM), SIMPLE-like Method, Finite Difference Method, Godunov-type Method, and Finite Volume Method. In this paper, the shallow water equation will be approximated using LBM or known as LABSWE and will be simulated in performance of parallel programming using OpenMP. To evaluate the performance between 2 and 4 threads parallel algorithm, ten various number of grids Lx and Ly are elaborated. The results show that using OpenMP platform, the computational time for solving LABSWE can be decreased. For instance using grid sizes 1000 × 500, the speedup of 2 and 4 threads is observed 93.54 s and 333.243 s respectively.

012034
The following article is Open access

, and

The simulation of erodible dambreak using two-layer shallow water equations and SCHR scheme are elaborated in this paper. The results show that the two-layer SWE model in a good agreement with the data experiment which is performed by Louvain-la-Neuve Université Catholique de Louvain. Moreover, the parallel algorithm with multicore architecture are given in the results. The results show that Computer I with processor Intel(R) Core(TM) i5-2500 CPU Quad-Core has the best performance to accelerate the computational time. Moreover, Computer III with processor AMD A6-5200 APU Quad-Core is observed has higher speedup and efficiency. The speedup and efficiency of Computer III with number of grids 3200 are 3.716050530 times and 92.9% respectively.

012035
The following article is Open access

, and

The existence of waterfall in many nations, such as Indonesia has a potential to develop and to fulfill the electricity demand in the nation. By utilizing mechanical flow energy of the waterfall, it would be able to generate electricity. The study of mechanical energy could be done by simulating waterfall flow using 2-D smoothed particle hydrodynamics (SPH) method. The SPH method is suitable to simulate the flow of the waterfall, because it has an advantage which could form particles movement that mimic the characteristics of fluid. In this paper, the SPH method is used to solve Navier-Stokes and continuity equation which are the main cores of fluid motion. The governing equations of fluid flow are used to obtain the acceleration, velocity, density, and position of the SPH particles as well as the completion of Leapfrog time-stepping method. With these equations, simulating a waterfall flow would be more attractive and able to complete the analysis of mechanical energy as desired. The mechanical energy that generated from the waterfall flow is calculated and analyzed based on the mass, height, and velocity of each SPH particle.

012036
The following article is Open access

and

In this paper, a parallel implementation of an elliptic solver in solving 1D Boussinesq model is presented. Numerical solution of Boussinesq model is obtained by implementing a staggered grid scheme to continuity, momentum, and elliptic equation of Boussinesq model. Tridiagonal system emerging from numerical scheme of elliptic equation is solved by cyclic reduction algorithm. The parallel implementation of cyclic reduction is executed on multicore processors with shared memory architectures using OpenMP. To measure the performance of parallel program, large number of grids is varied from 28 to 214. Two test cases of numerical experiment, i.e. propagation of solitary and standing wave, are proposed to evaluate the parallel program. The numerical results are verified with analytical solution of solitary and standing wave. The best speedup of solitary and standing wave test cases is about 2.07 with 214 of grids and 1.86 with 213 of grids, respectively, which are executed by using 8 threads. Moreover, the best efficiency of parallel program is 76.2% and 73.5% for solitary and standing wave test cases, respectively.

Computational Linguistic

012037
The following article is Open access

, and

Al-Hadith is a collection of words, deeds, provisions, and approvals of Rasulullah Shallallahu Alaihi wa Salam that becomes the second fundamental laws of Islam after Al-Qur'an. As a fundamental of Islam, Muslims must learn, memorize, and practice Al-Qur'an and Al-Hadith. One of venerable Imam which was also the narrator of Al-Hadith is Imam Bukhari. He spent over 16 years to compile about 2602 Hadith (without repetition) and over 7000 Hadith with repetition. Automatic text categorization is a task of developing software tools that able to classify text of hypertext document under pre-defined categories or subject code[1]. The algorithm that would be used is Random Forest, which is a development from Decision Tree. In this final project research, the author decided to make a system that able to categorize text document that contains Hadith that narrated by Imam Bukhari under several categories such as suggestion, prohibition, and information. As for the evaluation method, K-fold cross validation with F1-Score will be used and the result is 90%.

012038
The following article is Open access

and

Most existing name matching methods are developed for English language and so they cover the characteristics of this language. Up to this moment, there is no specific one has been designed and implemented for Indonesian names. The purpose of this thesis is to develop Indonesian name matching dataset as a contribution to academic research and to propose suitable feature set by utilizing combination of context of name strings and its permute-winkler score. Machine learning classification algorithms is taken as the method for performing name matching. Based on the experiments, by using tuned Random Forest algorithm and proposed features, there is an improvement of matching performance by approximately 1.7% and it is able to reduce until 70% misclassification result of the state of the arts methods. This improving performance makes the matching system more effective and reduces the risk of misclassified matches.

012039
The following article is Open access

, and

The presence of the word negation is able to change the polarity of the text if it is not handled properly it will affect the performance of the sentiment classification. Negation words in Indonesian are 'tidak', 'bukan', 'belum' and 'jangan'. Also, there is a conjunction word that able to reverse the actual values, as the word 'tetapi', or 'tapi'. Unigram has shortcomings in dealing with the existence of negation because it treats negation word and the negated words as separate words. A general approach for negation handling in English text gives the tag 'NEG_' for following words after negation until the first punctuation. But this may gives the tag to un-negated, and this approach does not handle negation and conjunction in one sentences. The rule-based method to determine what words negated by adapting the rules of Indonesian language syntactic of negation to determine the scope of negation was proposed in this study. With adapting syntactic rules and tagging "NEG_" using SVM classifier with RBF kernel has better performance results than the other experiments. Considering the average F1-score value, the performance of this proposed method can be improved against baseline equal to 1.79% (baseline without negation handling) and 5% (baseline with existing negation handling) for a dataset that all tweets contain negation words. And also for the second dataset that has the various number of negation words in document tweet. It can be improved against baseline at 2.69% (without negation handling) and 3.17% (with existing negation handling).

012040
The following article is Open access

, , and

Tracking human development and humanitarian action has been enhanced by the growth of social media. Twitter is a data source with potential, when used alongside data from surveys, especially the national census, to understand the situation on the ground and track changes. In Indonesia, a country with one of the highest Twitter penetration rates, we seize this opportunity by using Twitter data to produce more timely insights and to enhance evidence-based decision-making. Despite social media's limitations, namely representativeness and validity, we are able to show its potential by looking at case studies on five different topics; (a) food and agriculture, (b) public health (c) economic well-being (d) urban resilience and (e) humanitarian action. We observe that the insights gained by using Twitter data were derived not only from the content of posts such as understanding public opinion or sentiment, but also from activities related to it, for instance the location and time-stamp of the post, which furthers our real-time understanding of the situation and user behavior changes. In this paper, we also briefly explain "social listener", a social media monitoring tool that used by Government of Indonesia to understand citizen opinions in social media related to government priorities.

012041
The following article is Open access

, and

Language is used to express not only facts, but also emotions. Emotions are noticeable from behavior up to the social media statuses written by a person. Analysis of emotions in a text is done in a variety of media such as Twitter. This paper studies classification of emotions on twitter using Bayesian network because of its ability to model uncertainty and relationships between features. The result is two models based on Bayesian network which are Full Bayesian Network (FBN) and Bayesian Network with Mood Indicator (BNM). FBN is a massive Bayesian network where each word is treated as a node. The study shows the method used to train FBN is not very effective to create the best model and performs worse compared to Naive Bayes. F1-score for FBN is 53.71%, while for Naive Bayes is 54.07%. BNM is proposed as an alternative method which is based on the improvement of Multinomial Naive Bayes and has much lower computational complexity compared to FBN. Even though it's not better compared to FBN, the resulting model successfully improves the performance of Multinomial Naive Bayes. F1-Score for Multinomial Naive Bayes model is 51.49%, while for BNM is 52.14%.

012042
The following article is Open access

, and

SMS (Short Message Service) is on e of the communication services that still be the main choice, although now the phone grow with various applications. Along with the development of various other communication media, some countries lowered SMS rates to keep the interest of mobile users. It resulted in increased spam SMS that used by several parties, one of them for advertisement. Given the kind of multi-lingual documents in a message SMS, the Web, and others, necessary for effective multilingual or cross-lingual processing techniques is becoming increasingly important. The steps that performed in this research is data / messages first preprocessing then represented into a graph model. Then calculated using GKNN method. From this research we get the maximum accuracy is 98.86 with training data in Indonesian language and testing data in indonesian language with K 10 and threshold 0.001.

012043
The following article is Open access

, and

Research on the semantic argument classification requires semantically labeled data in large numbers, called corpus. Because building a corpus is costly and time-consuming, recently many studies have used existing corpus as the training data to conduct semantic argument classification research on new domain. But previous studies have proven that there is a significant decrease in performance when classifying semantic arguments on different domain between the training and the testing data. The main problem is when there is a new argument that found in the testing data but it is not found in the training data. This research carries on semantic argument classification on a new domain that is Quran English Translation by utilizing Propbank corpus as the training data. To recognize the new argument in the training data, this research proposes four new features for extending the argument features in the training data. By using SVM Linear, the experiment has proven that augmenting the proposed features to the baseline system with some combinations option improve the performance of semantic argument classification on Quran data using Propbank Corpus as training data.

012044
The following article is Open access

and

Sundanese language is the second biggest local language used in Indonesia. Currently, Sundanese language is rarely used since we have the Indonesian language in everyday conversation and as the national language. We built a Sundanese lexical database based on WordNet and Indonesian WordNet as an alternative way to preserve the language as one of local culture. WordNet was chosen because of Sundanese language has three levels of word delivery, called language code of conduct. Web user participant involved in this research for specifying Sundanese semantic relations, and an expert linguistic for validating the relations. The merge methodology was implemented in this experiment. Some words are equivalent with WordNet while another does not have its equivalence since some words are not exist in another culture.

012045
The following article is Open access

Folksonomy, as one result of collaborative tagging process, has been acknowledged for its potential in improving categorization and searching of web resources. However, folksonomy contains ambiguities such as synonymy and polysemy as well as different abstractions or generality problem. To maximize its potential, some methods for associating tags of folksonomy with semantics and structural relationships have been proposed such as using ontology learning method. This paper evaluates our previous work in ontology learning according to gold-standard evaluation approach in comparison to a notable state-of-the-art work and several baselines. The results show that our method is comparable to the state-of the art work which further validate our approach as has been previously validated using task-based evaluation approach.

012046
The following article is Open access

, , and

As one of the Muslim life guidelines, based on the meaning of its sentence(s), a hadith can be viewed as a suggestion for doing something, or a suggestion for not doing something, or just information without any suggestion. In this paper, we tried to classify the Bahasa translation of hadith into the three categories using machine learning approach. We tried stemming and stopword removal in preprocessing, and TF-IDF of unigram, bigram, and trigram as the extracted features. As the classifier, we compared between SVM and Neural Network. Since the categories are new, so in order to compare the results of the previous pipelines, we created a baseline classifier using simple rule-based string matching technique. The rule-based algorithm conditions on the occurrence of words such as "janganlah, sholatlah, and so on" to determine the category. The baseline method achieved F1-Score of 0.69, while the best F1-Score from the machine learning approach was 0.88, and it was produced by SVM model with the linear kernel.

012047
The following article is Open access

, and

Abdul Baquee Muhammad [1] have built Corpus that contained AlQur'an domain, WordNet and dictionary. He has did initialisation in the development of knowledges about AlQur'an and the knowledges about relatedness between texts in AlQur'an. The Path based measurement method that proposed by Liu, Zhou and Zheng [3] has never been used in the AlQur'an domain. By using AlQur'an translation dataset in this research, the path based measurement method proposed by Liu, Zhou and Zheng [3] will be used to test this method in AlQur'an domain to obtain similarity values and to measure its correlation value.

In this study the degree value is proposed to be used in modifying the path based method that proposed in previous research. Degree Value is the number of links that owned by a lcs (lowest common subsumer) node on a taxonomy. The links owned by a node on the taxonomy represent the semantic relationship that a node has in the taxonomy. By using degree value to modify the path-based method that proposed in previous research is expected that the correlation value obtained will increase.

After running some experiment by using proposed method, the correlation measurement value can obtain fairly good correlation ties with 200 Word Pairs derive from Noun POS SimLex-999. The correlation value that be obtained is 93.3% which means their bonds are strong and they have very strong correlation. Whereas for the POS other than Noun POS vocabulary that owned by WordNet is incomplete therefore many pairs of words that the value of its similarity is zero so the correlation value is low.

012048
The following article is Open access

and

One of the important aspects in human to human communication is to understand emotion of each party. Recently, interactions between human and computer continues to develop, especially affective interaction where emotion recognition is one of its important components. This paper presents our extended works on emotion recognition of Indonesian spoken language to identify four main class of emotions: Happy, Sad, Angry, and Contentment using combination of acoustic/prosodic features and lexical features. We construct emotion speech corpus from Indonesia television talk show where the situations are as close as possible to the natural situation. After constructing the emotion speech corpus, the acoustic/prosodic and lexical features are extracted to train the emotion model. We employ some machine learning algorithms such as Support Vector Machine (SVM), Naive Bayes, and Random Forest to get the best model. The experiment result of testing data shows that the best model has an F-measure score of 0.447 by using only the acoustic/prosodic feature and F-measure score of 0.488 by using both acoustic/prosodic and lexical features to recognize four class emotion using the SVM RBF Kernel.

012049
The following article is Open access

, and

Deep learning is a new era of machine learning techniques that essentially imitate the structure and function of the human brain. It is a development of deeper Artificial Neural Network (ANN) that uses more than one hidden layer. Deep Learning Neural Network has a great ability on recognizing patterns from various data types such as picture, audio, text, and many more. In this paper, the authors tries to measure that algorithm's ability by applying it into the text classification. The classification task herein is done by considering the content of sentiment in a text which is also called as sentiment analysis. By using several combinations of text preprocessing and feature extraction techniques, we aim to compare the precise modelling results of Deep Learning Neural Network with the other two commonly used algorithms, the Naϊve Bayes and Support Vector Machine (SVM). This algorithm comparison uses Indonesian text data with balanced and unbalanced sentiment composition. Based on the experimental simulation, Deep Learning Neural Network clearly outperforms the Naϊve Bayes and SVM and offers a better F-1 Score while for the best feature extraction technique which improves that modelling result is Bigram.

012050
The following article is Open access

, and

Support Vector Machine or commonly called SVM is one method that can be used to process the classification of a data. SVM classifies data from 2 different classes with hyperplane. In this study, the system was built using SVM to develop Arabic Speech Recognition. In the development of the system, there are 2 kinds of speakers that have been tested that is dependent speakers and independent speakers. The results from this system is an accuracy of 85.32% for speaker dependent and 61.16% for independent speakers.

012051
The following article is Open access

, and

Paraphrase identification is an important process within natural language processing. The idea is to automatically recognize phrases that have different forms but contain same meanings. For examples if we input query "causing fire hazard", then the computer has to recognize this query that this query has same meaning as "the cause of fire hazard. Paraphrasing is an activity that reveals the meaning of an expression, writing, or speech using different words or forms, especially to achieve greater clarity. In this research we will focus on classifying two Indonesian sentences whether it is a paraphrase to each other or not. There are four steps in this research, first is preprocessing, second is feature extraction, third is classifier building, and the last is performance evaluation. Preprocessing consists of tokenization, non-alphanumerical removal, and stemming. After preprocessing we will conduct feature extraction in order to build new features from given dataset. There are two kinds of features in the research, syntactic features and semantic features. Syntactic features consist of normalized levenshtein distance feature, term-frequency based cosine similarity feature, and LCS (Longest Common Subsequence) feature. Semantic features consist of Wu and Palmer feature and Shortest Path Feature. We use Bayesian Networks as the method of training the classifier. Parameter estimation that we use is called MAP (Maximum A Posteriori). For structure learning of Bayesian Networks DAG (Directed Acyclic Graph), we use BDeu (Bayesian Dirichlet equivalent uniform) scoring function and for finding DAG with the best BDeu score, we use K2 algorithm. In evaluation step we perform cross-validation. The average result that we get from testing the classifier as follows: Precision 75.2%, Recall 76.5%, F1-Measure 75.8% and Accuracy 75.6%.

012052
The following article is Open access

, and

We present our work in the area of sentiment analysis for Indonesian language. We focus on bulding automatic semantic orientation using available resources in Indonesian. In this research we used Indonesian corpus that contains 9 million words from kompas.txt and tempo.txt that manually tagged and annotated with of part-of-speech tagset. And then we construct a dataset by taking all the adjectives from the corpus, removing the adjective with no orientation. The set contained 923 adjective words. This systems will include several steps such as text pre-processing and clustering. The text pre-processing aims to increase the accuracy. And finally clustering method will classify each word to related sentiment which is positive or negative. With improvements to the text preprocessing, can be achieved 72% of accuracy.

012053
The following article is Open access

, and

The biggest e-Commerce challenge to understand their market is to chart their level of service quality according to customer perception. The opportunities to collect user perception through online user review is considered faster methodology than conducting direct sampling methodology. To understand the service quality level, sentiment analysis methodology is used to classify the reviews into positive and negative sentiment for five dimensions of electronic service quality (e-Servqual). As case study in this research, we use Tokopedia, one of the biggest e-Commerce service in Indonesia. We obtain the online review comments about Tokopedia service quality during several month observations. The Naïve Bayes classification methodology is applied for the reason of its high-level accuracy and support large data processing. The result revealed that personalization and reliability dimension required more attention because have high negative sentiment. Meanwhile, trust and web design dimension have high positive sentiments that means it has very good services. The responsiveness dimension have balance sentiment positive and negative.

012054
The following article is Open access

, and

The relations between scientific papers are very useful for researchers to see the interconnection between scientific papers quickly. By observing the inter-article relationships, researchers can identify, among others, the weaknesses of existing research, performance improvements achieved to date, and tools or data typically used in research in specific fields. So far, methods that have been developed to detect paper relations include machine learning and rule-based methods. However, a problem still arises in the process of sentence extraction from scientific paper documents, which is still done manually. This manual process causes the detection of scientific paper relations longer and inefficient. To overcome this problem, this study performs an automatic sentences extraction while the paper relations are identified based on the citation sentence. The performance of the built system is then compared with that of the manual extraction system. The analysis results suggested that the automatic sentence extraction indicates a very high level of performance in the detection of paper relations, which is close to that of manual sentence extraction.

012055
The following article is Open access

, , , and

This paper evaluates Part-of-Speech Tagging for the formal Indonesian language can be used for the tagging process of Indonesian tweets. In this study, we add five additional tags which reflect to social media attributes to the existing original tagset. Automatic POS tagging process is done by stratified training process with 1000, 1600, and 1800 of annotated tweets. It shows that the process can achieve up to 66.36% accuracy. The experiment with original tagset gives slightly better accuracy (67.39%) than the experiment with five additional tags, but will lose important informations which given by the five additional tagset.POS-Tagging for Informal Language (Study in Indonesian Tweets).

012056
The following article is Open access

, , and

With technological advances, all information about movie is available on the internet. If the information is processed properly, it will get the quality of the information. This research proposes to the classify sentiments on movie review documents. This research uses Support Vector Machine (SVM) method because it can classify high dimensional data in accordance with the data used in this research in the form of text. Support Vector Machine is a popular machine learning technique for text classification because it can classify by learning from a collection of documents that have been classified previously and can provide good result. Based on number of datasets, the 90-10 composition has the best result that is 85.6%. Based on SVM kernel, kernel linear with constant 1 has the best result that is 84.9%

012057
The following article is Open access

A variety of language resources already exist online. Unfortunately, since many language resources have usage restrictions, it is virtually impossible for each user to negotiate with every language resource provider when combining several resources to achieve the intended purpose. To increase the accessibility and usability of language resources (dictionaries, parallel texts, part-of-speech taggers, machine translators, etc.), we proposed the Language Grid [1]; it wraps existing language resources as atomic services and enables users to create new services by combining the atomic services, and reduces the negotiation costs related to intellectual property rights [4]. Our slogan is "language services from language resources." We believe that modularization with recombination is the key to creating a full range of customized language environments for various user communities.