Veri Ön İşleme 1 : Veri Temizleme (Veri Madenciliği Teorik 2)

Data Quality: Multidimensional data quality criteria: Why preprocess? Accuracy: Correct and incorrect data Completeness: Unrecorded or inaccessible data Consistency: Some data is outdated, dangling Timeliness Believability Interpretability: How easily the data can be understood Real-life data is messy: It can experience numerous machine, human, or computer errors and transmission disruptions. Incomplete Data: Missing some attributes (data), only aggregate data is available. e.g., Occupation = " " (not entered). Noisy Data: Noise, error, or outlier data. e.g., Salary = "−10" (error). Inconsistent Data: Different data from different sources. Age = "42", Date of Birth = "03/07/2010" Old grading: "1, 2, 3", new grading: "A, B, C" Discrepancies in duplicate records. Intentional Problems: January 1st is recorded for everyone whose birth date is unknown. Data is not always accessible. e.g., some records were not captured. Customer income levels were not recorded during the sale. Missing data generally occurs in the following situations: Hardware failures Data deleted due to incompatibility Unclear data not entered Data not prioritized during data entry Changes to data not recorded Missing data must be resolved Omission: Missing data is not processed and is treated as if it did not exist. The effects on the results should be known depending on the VM method used. Manually filling in missing data: Not always possible and can be very time-consuming and costly. Automatically filling in missing data Creating a new class for all missing data (such as "unknown") Putting in the mean Putting in means by class Bayesian formula and decision tree application Noise: randomly generated values ​​in the measurement Incorrect feature values ​​can occur in the following situations: Errors in data collection tools Data entry problems Data transmission problems Technology limitations Naming inconsistencies Other situations requiring data cleaning Duplicate records Missing data Inconsistent data Binning Data is sorted and divided into equally frequent packets. Missing data is filled using different methods: Mean Median Boundary Regression Inserting missing data using regression functions Segmentation (Clustering) Finding and cleaning outliers Joint use of computer and human knowledge Detect suspicious values ​​and check by humans (e.g., deal with possible outliers) Capturing differences in data Using metadata (e.g., domain, range, dependency, distribution) Field Overloading Rule checks on data (unique, consecutive, null) Using commercial software Data scrubbing: Checking simple field information using rules (e.g., postal code, spell-check) Data auditing: Extracting rules from data and identifying those that do not comply with the rules (e.g., finding outliers through correlation or clustering) Data Migration and Integration Data Migration Tools: Allows data transformation ETL (Extraction/Transformation/Loading) Tools: Allows for managing transformations, usually through a graphical interface Integrated execution of two different tasks Iterative/Interactive (e.g., Potter's Wheels) Cleaning Overloaded Areas Chaining Coupling Multipurpose Şadi Evren ŞEKER