Infoworks 6.0.0
Glossary

Glossary

Infoworks Terms and Phrases

Terms and PhrasesDefinition
Compute ("Compute Template," "Cluster Template," "Cluster")As used by Infoworks, compute refers to a configuration within an environment template referring to a compute provider available from the data engine configured for this environment. An individual compute configuration may serve as a template for an ephemeral cluster to be provisioned on demand. Or, it may be a reference to a persistent cluster, which is either provisioned and continuously available, or is controlled via a cluster action node within a related workflow.
Data EngineAs used by Infoworks, a data engine is a provider of data management services, which runs over distributed compute. Data engines may reflect a data lake architecture, a data warehouse architecture, or expose features blending both approaches. As multiple engines power an airliner, data engines support the data ingestion, transformation, and migration services automated by Infoworks.
Data ModelAs used by Infoworks, a data model is a data table generated as an output of a transformation pipeline, in storage available to the environment in which the transformation pipeline is run.
Data SourceAs used by Infoworks, data source refers to the metadata and data ingested through a connection defined and made based on a source configuration. Data ingested as a data source may serve as a source data table within a transformation pipeline.
Data TableAs used by Infoworks, a data table is either data ingested as a data source, or data subsequently modified by a transformation pipeline into a new data model. Either form of data table may be used as source data for a transformation pipeline. A transformation pipeline must have one, and may have many, data tables.
DomainAs used by Infoworks, a domain is a named set of logically related users and data sources, to support collaboration on transformation pipelines and workflow orchestrations built within the domain. Pipeline extensions are also assigned by domain.
EnvironmentAs used by Infoworks, an environment is a one specific data engine running within a specific cloud service provider, configured to work with a specific set of storages and computes. An environment is defined by an environment template.
Environment Template ("Environment Configuration")As used by Infoworks, an environment template is set of configurations comprised of settings related to (1) a cloud service provider, (2) a data engine in this cloud, including its related metastore, (3) one or more storages available to this engine, in this cloud, and (4) one or more computes available to this engine, in this cloud.
External Data SourceAs used by Infoworks, an external data source is one of over two hundred distinct data management systems and formats for which a data source can be configured by Infoworks to ingest, transform, and migrate data within and across environments.
External Data TargetAs used by Infoworks, an external data target is a cloud data warehouse, cloud data lake, or other external data management system, to which Infoworks may export data via transformation pipeline, if intermediate transformation is required, or sync-to-target, if direct export of ingested data without transformation, is required.
Incremental IngestionAs used by Infoworks, incremental ingestion is the process of moving data from an external data source to a storage within an environment, in batches scheduled to optimize overall system behavior, and to synchronize changing data sets across environments.
IngestionAs used by Infoworks, ingestion is the process of moving data from an external data source to storage within an environment, for further transformation
Ingestion ConfigurationAs used by Infoworks, an ingestion configuration is the result of a design process based on metadata crawled from an external data source. An ingestion configuration may be run as an ad hoc or scheduled job, and may be incorporated within a workflow either via UI or API.
JobAs used by Infoworks, a job is a set of related tasks defined by 1) an ingestion configuration, whether for a single data table or table group, 2) a transformation pipeline, and/or 3) a Sync-to-target execution. Jobs are defined via UI or API, compiled to environment-specific code, then passed to compute available for a data engine within the relevant environment for execution. Jobs are commonly executed as part of a workflow.
Metadata CrawlAs used by Infoworks, metadata crawl is a process that discovers schema, types, and other relevant metadata from an external data source, to support designing ingestion configurations.
MigrationAs used by Infoworks, migration is the process of moving data from an external data source, or from storage within an environment, to storage in a different environment, for the purpose of use or transformation by an external system.
OnboardAs used by Infoworks, onboarding refers to ingesting data from an external data source to storage in a target environment. As a feature set, onboarding is focused on the ingestion configurations users design from information derived from metadata crawls, along with governance features for the ingested data.
OperationalizeAs used by Infoworks, operationalizing refers to the scheduling, logical flow, monitoring, execution, and re-execution of ingestion, transformation, and migration jobs. As a feature set, operationalizing is focused on the workflow orchestrations users design either through the Infoworks UI or API.
Pipeline ExtensionAs used by Infoworks, a pipeline extension is a custom-designed transformation type made available for use by transformation pipelines within a domain to which the pipeline extension has been applied.
PrepareAs used by Infoworks, preparing refers to automating the transformation, cleansing, enrichment, and modeling of data to support known and discovered use cases and processes. As a feature set, preparing is focused on the transformation pipelines users design to transform one or more data tables into one or more new data models.
Source ConfigurationAs used by Infoworks, a source configuration is a named and configured reference to an external data management system. Through connections made based on this configuration, Infoworks executes a metadata crawl to derive information supporting the subsequent design of one or many ingestion jobs.
StorageAs used by Infoworks, a storage is a physical location for holding binary data, configured for availability to a specific environment. An environment can have any number of related storages. Storage is assigned when designing ingestion, transformation, and migration jobs for an environment.
Sync-to-targetAs used by Infoworks, sync-to-target is a feature enabling the direct export of a data table or table group to an external data target, without use of an intermediate transformation pipeline.
Table GroupAs used by Infoworks, a named set of data tables to be ingested as a single job.
TargetAs used by Infoworks, a target is a terminal node, within a transformation pipeline, which names and defines a data model to which transformed data will be output. A transformation pipeline must have one, and may have many, targets.
TransformationAs used by Infoworks, transformation refers to merging, cleansing, and enriching data through one or more transformation pipelines. The term may also be used more narrowly, to refer to a single node within a transformation pipeline as a transformation.
Transformation PipelineAs used by Infoworks, a transformation pipeline is an individual, visually defined sequence of interrelated transformations, drawing on one or more data tables, to generate one or more data models, each time the transformation pipeline is run as a job. When executed, a transformation pipeline is converted to SQL (and related code extensions), appropriate to the data engine for which it is designed.
Workflow ("Workflow Orchestration," "Orchestration")As used by Infoworks, a workflow is a set of ingestion, transformation, and migration processes, which are logically sequenced and scheduled by the workflow. A workflow may be defined visually or via REST API, and run either via UI or API.
Workflow RunAs used by Infoworks, a workflow run is one execution of a coordinated set of workflow tasks which have been defined visually via UI as a workflow, or via REST API.
Workflow TaskAs used by Infoworks, a workflow task is a job, shell script, notification, or related behavior, coordinated as one part of a workflow.
WorkloadAs used by Infoworks, the term workload abstractly describes a set of data ingestion, transformation, and migration processes, orchestrated as one or more jobs within one or more workflows, to collectively accomplish a specified data managed goal.

Data Management Industry Terms and Phrases

Terms and PhrasesDefinition
Analytics AgilityThe ability to quickly and efficiently transform data to support newly discovered analytic requirements.
Apache HadoopAn open source technology implementing a MapReduce algorithm to divide and distribute segments of very large data sets across many nodes in a cluster for analysis, increasing the total processing speed through parallelization, then stitching the distributed results into a final set
API (application programming interface)An endpoint, or set of endpoints, exposed by one software system to support direct communication from another. Commonly, one or more values will be passed to an API, which will then respond with a behavior, and may pass back one or more values to the system which is called the API.
AutomationOptimizing human-involved processes by using configuration, not code, to implement and run best practice software solutions.
AWS (Amazon Web Services)The Amazon cloud service provider, which among many other services, hosts EMR, a distributed data lake style management system based on Apache Spark, and supported by Infoworks. AWS also supports RedShift, a distributed data warehouse style data management system.
AzureThe Microsoft cloud service provider, which among many other services, hosts HDInsight, a distributed data lake style management system based on Apache Hadoop. It also supports Azure Databricks, a data lake style data management system, supported by Infoworks. It also supports Azure Synapse, a data warehouse style data management system.
Best PracticeAn industry term referring to a specific business process and/or software implementation or configuration which is widely accepted as an optimal approach to solving a recurring problem.
Big DataData domains in which the total volume, variety, and velocity of data points to be ingested and analyzed exceeds the capacity of traditional monolithic RDBMS, and instead requires clustered computing solutions.
Change Data Capture (CDC)A set of design patterns by which the pre-change and post-change state of a record are captured, with timestamps and similar metadata, to enable robust logging, pause/restart, and similar features, as part of a data ingestion or export process.
Closed Source SoftwareClosed source software delivers binary products compiled from source code, which itself is retained as confidential work product. The compiled binary code may be used in whatever manner is allowed by the license under which it is distributed and/or sold. In these ways, it is the opposite of open source software.
Cloud Data Lake (CDL)Software implementing a data lake architecture from within a cloud service provider. Migrating data from on-premise data sources to a cloud data lake or cloud data warehouse, or between these architectures, is a core function of Infoworks.
Cloud Data Migration (CDM)The automated processes by which on-premise data is selectively or comprehensively moved to cloud service providers, commonly into data lake and/or data warehouse architectures, or both, depending on the structure and strategy needed to support intended workloads. Cloud Data Migration is a core Infoworks function.
Cloud Data Warehouse (CDW)Software implementing a data warehouse architecture from within a cloud service provider. Historically, most data warehouse software ran as an on-premise system. Cloud Data Migration, which is the migration of data from on-premise data management systems, of all types, including data warehouses (e.g., Teradata) and distributed data processing systems (e.g., Hadoop), is a core Infoworks function.
Cloud Service Provider (CSP)A vendor offering computing infrastructure and software as a service (e.g., GCP, Azure, AWS, etc).
Coarse DataData which has not been cleansed.
Compute ProviderA predefined processing resource capable of being assigned to a job to run a workload.
Configuration, not codeAn industry phrase referring to a trend towards abstracting tasks, which previously required code, to interfaces and applications which need only to be provided relevant settings, then run, to generate the same outcomes which previously required designing, writing, and maintaining custom code.
CSV (Comma Separated Value) FileA structured data file containing text written as rows and columns, to express tabular data, where each value in a row is separated by a comma. Commonly, but not necessarily, the first row lists column headers relating to values in the same ordinal position in subsequent rows. Common variants include tab and pipe ( | ) separated values. CSV files may be ingested by Infoworks
CubeA common data warehouse design pattern in which a fact table, along with a set of related dimension tables, are designed, and all relevant measurements (aggregations) across all possible rows, have been pre-calculated for fast lookup.
Data ArchitectureThe systems and processes available to deliver the storage and compute resources needed to manage ongoing workloads related to the structured data and unstructured data in an enterprise data ecosystem.
Data CleansingThe process of detecting, correcting, and potentially removing incomplete, inaccurate, incorrect, or irrelevant data from a table or database. Uncleansed data is commonly described as dirty or coarse.
Data ComputeCompute resources within a data engine which are available for assignment to a workload. These data compute resources are typically persistent clusters or ephemeral clusters.
Data EngineA system designed to perform big data operations and analytics using clustered computing infrastructure. Infoworks abstractly automates data engines, ensuring long term flexibility for enterprises using these tools. Data engines generally fall into two categories, those based on Apache Hadoop and those based on Apache Spark. Hadoop based data engines include: Cloudera, Hortonworks, MapR, and HDInsight. Spark based data engines include: Databricks, GCP Dataproc, AWS EMR, Snowflake, and GCP BigQuery.
Data EnrichmentA process by which fields and records may be modified to add additional data and/or structure during an ingestion or transformation process.
Data FabricA technical product marketing term describing a data architecture exposing, to end users, a searchable, self-service set of data services which provide consistent capabilities to integrate, aggregate, prepare, govern, and orchestrate all data types, exposed via endpoints from across a hybrid, multi-cloud environment. Infoworks helps enable data fabric architectures, which are themselves integrated from many systems.
Data GovernanceA set of processes by which data access, data change, and related regulatory compliance procedures, can be monitored and controlled.
Data IngestionThe process of transferring data from a source (e.g., RDBMS table, CSV file, API, etc.) to a data lake (e.g., Infoworks). The transfer process may be variously configured. For example, a small table with few changes over time may be fully loaded, and periodically replaced. By contrast, a large table may be incrementally ingested, and it’s ongoing updates captured by CDC.
Data LakeA repository of unstructured, structured, or semi-structured data, ingested from many sources, available for machine learning, similar use cases, and for transformation into structures needed to support workloads which have been newly discovered via machine learning. A data lake may be on-premise, in the cloud, or hybrid, and may be part of a modern data architecture.
Data LakehouseAn environment in which data structured using a data warehouse architecture (for business reporting) is simultaneously available in an unstructured data lake architecture (for machine learning). Different vendors implement this architecture differently. For example, Cloud Data Warehouse vendor Snowflake can enable a "snowpark" of unstructured data, available for machine learning, as part of its onboarding process. And, Cloud Data Lake vendor Databricks can enable a "lakehouse" of structured data via a virtual data warehouse exposed over an underlying data lake.
Data MartA subset of a data warehouse focused around a specific business domain, built to increase access speed for that domain's analytics.
Data OperationsOften abbreviated as "DataOps," this industry term refers to an automated, process-oriented methodology, used by analytic and data teams, to improve the quality and reduce the cycle time of data analytics, by automating the flow and transformation of data from source to target. While DataOps began as a set of best practices, it has matured to become a new and independent approach to data analytics.
Data PipelineAn industry term referring to the comprehensive flow of data from a source to a target, including physical ingestion and transformation, as well as scheduling and latency operations related to these processes. In Infoworks, a transformation pipeline handles only the transformation aspects of this flow. Ingestion and workflow orchestration are handled by separate aspects of the product.
Data TransformationThe process of merging, cleansing, and enriching data to support new use cases. Data transformation is commonly, but not necessarily or entirely, accomplished through the use of SQL. See transformation pipeline.
Data WarehouseA repository of structured data, ingested from many sources, and pre-transformed into structures needed to support known business intelligence workloads, and related analytics. It may be on-premise, in the cloud, or hybrid. A data warehouse may be part of a modern data architecture. Commonly, warehoused data will be denormalized to better support standardized analytical tasks.
Data Warehouse AutomationThe process of applying automation to data ingestion and transformation processes relating to a data warehouse.
Data-Driven OrganizationAn organization which has built an enterprise data architecture to proactively gather, share, and mine data, towards supporting business decisions before specific needs have been identified.
DataprocGCP distribution of Apache Spark.
DenormalizationThe process of enriching records with redundant data from other tables, forming a single composite record, eliminating the need to join related records to access composite data, and thus optimizing for greater access speed. See normalization.
Digital TransformationA change process by which automated data ingestion and data transformation workloads are implemented to integrate legacy data systems into a modern data architecture, creating an agile, comprehensive data platform, through which enterprise-wide data is pervasively gathered, measured, described, located, and consumed by business users, to drive business efficiencies.
Dirty DataData which has not been cleansed.
Distributed ComputeA compute architecture by which workloads are scattered among multiple individual processing nodes, then gathered back to assemble and return results, when workload processing is complete. This architecture enables the processing of data workloads too large for an individual computing node.
EDO2 (Enterprise Data Operations and Orchestration)A technology by which the onboarding, preparation, and operationalization of any number of data sources and targets may be automated, enabling the efficient implementation of a hybrid, multi-cloud data fabric.
EDW MigrationSee Cloud Data Migration.
ELT (Extract, Load, Transform)A data migration pattern by which applications read (extract) data from a data source (commonly OLTP systems), and insert (load) it directly to a data lake, where it may later be cleansed, enriched, targeted for machine learning, or have structure added to the data (transform).
EMRAWS (Amazon Web Services) distribution of Hadoop
Enterprise Cloud Bridge (ECB)On premise Infoworks component that crawls metadata, and ingests data into an Infoworks managed cloud data lake or cloud data warehouse.
Ephemeral ClusterA set of virtual compute nodes, designed to run together as a cluster within a distributed data management system (e.g., Apache Spark, etc), which are instantiated ("spun up") based on a configuration template then, once running, are assigned to perform a specified job. When the job is complete, the ephemeral cluster is decommissioned ("spun down"), to minimize compute resource cost. By spinning down rather than idling, ephemeral clusters are more cost-efficient for certain job processing patterns than persistent clusters.
ETL (Extract, Transform, Load)A data migration pattern by which applications read (extract) data from a data source (commonly OLTP systems), denormalize, cleanse, and enrich the data (transform), then insert the transformed data (load) to a data warehouse.
Expert SystemsA system capturing the process flow of a particular problem domain, such that if particular factors are exhibited, then a particular decision should be made. They are fundamentally different from ML, in that they simply transcribe - more or less accurately - a human generated decision making model (heuristic). By contrast, ML develops its own heuristic based on analyzing many examples and observing the relationships between them.
FlumeA technology for ingesting streaming data into Hadoop.
GCP (Google Cloud Platform)The Google cloud service provider, which among many other services, hosts Dataproc, a distributed data lake style management system based on Apache Spark, and supported by Infoworks. It also supports BigQuery, a distributed data warehouse style data management system.
GranularityThe level of specificity at which facts are recorded in a table (e.g., total_day___sales vs. sales_id for a each product_id sold.) Higher granularity may be more flexible, but also increases storage and analytical effort.
Hadoop MigrationThe process of moving existing workloads and data from on-premise Hadoop implementations to a cloud data warehouse or cloud data lake.
HDFSA technology for distributing data files across a cluster, relying on a central NameNode to manage access to potentially thousands of DataNode instances.
HiveA technology which provides an abstraction layer over HDFS data stores, by rewriting SQL (HiveQL) statements into corresponding Hadoop MapReduce (or Spark, Tez, etc.) jobs, executing those jobs, and returning a resulting dataset or response. Historically focused on high throughput ETL for Data Engineers.
Hybrid, Multi-CloudA digital infrastructure in which on-premise compute infrastructure participates in common, secure networks with other systems and services exposed through one or more cloud service providers.
IaaS (infrastructure as a service)The provisioning of compute and storage resources by a cloud service provider.
ImpalaA technology, similar to Hive, providing an abstraction layer over HDFS data stores, scattering SQL statements for parallel processing on individual HDFS nodes, then gathering and assembling results. Historically focused on low latency interactive query, for analysts. Less fault tolerant than Hive.
JDBCAn open standard Java API that manages connecting to a database, issuing queries and commands, and handling result sets obtained from the database.
JSON (Javascript Object Notation) FileA structured data file containing text written in Javascript Object Notation, enabling the expression of more complex, hierarchical relationships than CSV notation. JSON files may be ingested by Infoworks.
KafkaAn open source distributed event-streaming platform used to build data pipelines, streaming analytics, and data integration capabilities.
Logical Data WarehouseA recent industry term describing a logical (virtual) structure created over disparate sources, with pre-integrated analytical tools, governance, and unified metadata, to provide a "single version of truth" for data warehouse use cases, while underlying data remains available, for use cases better suited to a data lake or other architecture. A core design goal is "store once, share many."
Machine Learning (ML)

Machine Learning (ML) - The application of standard algorithms to prepared data sets in order to detect patterns and derive predictive insights supporting ongoing decision-making processes.

There are six steps in the standard ML lifecycle:

  1. Acquisition

  2. Preparation (discover, structure, sequence, clean, enrich)

  3. Analysis/Labelling

  4. Algorithm Training

  5. Testing

  6. Deployment

MapReduceAn open source technology for distributing Jobs and their Tasks across the nodes of an HDFS file system, for processing in a scatter-gather architecture.
MergeAn ETL pattern in which a target table is incrementally updated by comparing source and target data, and either updating or inserting target rows from source data, as needed.
MetadataData which describes other data, and falls into three broad categories: descriptive, administrative, and structural. Descriptive data enables discovery, identification, and selection (e.g., title, author, subject, keywords, etc). Administrative metadata enables management (e.g., permissions, timestamps, etc). Structural metadata identifies data types and relationships (e.g., column headers, data types, foreign keys, indexes, etc).
ModelAn algorithmic decision making framework which is trained by relevant data exposure so that, based on a given set of inputs (A), it is capable of predicting an output (B) within a targeted degree of statistical accuracy.
Modern Data ArchitectureAn industry term describing systems integrated to expose both structured data and unstructured data to appropriate, corresponding workloads, typically deployed over hybrid, multi-cloud infrastructure.
Normalizationhe process of organizing data within related tables, to reduce redundancy, by placing records with logically common attributes in the same table, then defining dependencies from those records to other tables holding related records (e.g., a Customer table may be related to an Order table). This process tends to improve data consistency and reduce total storage cost, but also reduce access speed, while increasing the cost and complexity of data retrieval. Successively more comprehensive degrees of normalization have been defined as Normal Forms (First Normal Form, Second Normal Form, etc). See denormalization.
OLAP (Online Analytical Processing)Umbrella term for applications loading and reading denormalized facts and their related dimensions for analytical use. OLAP data is commonly stored in 2NF relational tables, OLAP cubes, Hadoop, etc.
OLTP (Online Transactional Processing)Umbrella term for applications creating, reading, updating, and deleting business-critical facts. OLTP data is commonly stored in 3NF relational tables.
On-PremiseA term commonly referring to compute infrastructure directly owned and maintained by its end user.
Open Source Software (OSS)A term which refers to how software is licensed and distributed. Open source software provides access to uncompiled source code, and may be used, re-used, modified, and recompiled without cost, provided the user complies with the terms of its license. In this way, it is the opposite of closed source software.
Orchestration of Data OperationsThe process of coordinating, scheduling, monitoring, and controlling the ongoing state of a set of data ingestion, transformation, optimization, cataloging, and export operations.
Parallelization of Data LoadsA technique by which data ingestion or export is configured to run via two or more connections, to increase total throughput speed.
ParquetA column oriented data storage format widely used in the Hadoop ecosystem.
Persistent ClusterA set of virtual compute nodes, designed to run together as a cluster within a distributed data management system (e.g., Apache Spark, etc), which are maintained in a running state to be available for assignment to specified jobs. When the job is complete, the persistent cluster remains available for the next job. By remaining available, and avoiding spin-up and spin-down times, persistent clusters are more cost-efficient for certain job processing patterns than ephemeral clusters.
PigA technology and high level data manipulation syntax (PigLatin) for generating Hadoop MapReduce jobs, to reduce reliance on lower level Java coding.
PipelineThe full sequence of data migration related activity, from ingestion, through transformation, to the scheduling and orchestration of these operations. See transformation pipeline.
PlatformA computing environment providing a comprehensive set of tools for defining and executing processes specific to the business, content, or technical domain to which the platform applies (e.g., personal computer operating system, cloud service provider, RDBMS, storage network, etc). Commonly, platforms overlap and interrelate in their capabilities.
RDBMS (relational data base management system)A technology enabling the use of SQL for defining, creating, reading, updating, and deleting records from tables, while expressing relationships among these tables to support querying for rows comprised of data which resides, in parts, among multiple tables in the database. Many additional user, process, access latency, and security related features are commonly packaged as part of an RDBMS product.
SaaS (software as a service)Software available from a cloud service provider, generally billed by compute resources used. It may be more efficient than on-premise software, in terms of total cost of ownership.
Slowly Changing DimensionsDimensions related to facts that change over time without following any inherent pattern (e.g., product_price). There are six established external patterns by which SCD may be tracked within a Data Warehouse.
SolutionAs used in the software industry, this term commonly refers to a set of tools which have been integrated to solve problems in a specific business domain, using automation and configuration, not code. The differences between tools and solutions can be described along a spectrum.
Source Type ExtensionA process by which data types assigned to values in a data source may be converted to a corresponding data type in a target data destination, with no loss of precision. This technique can optimize data access from the target by streamlining diverse source data types to common types in the target.
Speed To ValueA phrase describing a primary value proposition of the automation provided by systems like Infoworks. Cloud Data Migration adds significant value to enterprise data operations, and Infoworks speeds the process of achieving this value.
SQL (Structured Query Language)A classic database technology, expressed as a language, with both standard forms and many vendor specific variants, for defining, populating, and querying the contents of relational data structures. In Infoworks, transformation pipelines provide a high level abstraction for generating complex SQL statements, for many different environments, integrated with and supported by related technologies.
SqoopAn open source technology for ingesting RDBMS data into Hadoop.
Structured DataData sets for which one or more records share a common, well organized format, often but not necessarily described by external metadata (e.g., an RDBMS table, CSV or JSON file).
ToolAs used in the software industry, this term commonly refers to a software implementation which is highly focused to solve a relatively granular problem. Commonly, a tool will be designed to expose an API enabling the tool to be incorporated as part of a wider solution. The differences between tools and solutions can be described along a spectrum.
Truncate and LoadAn ETL pattern in which the target table is truncated (DDL based fast delete of all contents) and fully re-loaded with source data.
Unstructured DataData sets for which no readily discernible common format is available without complex introspection (e.g., a folder of various business documents, or of log files generated by disparate systems).
Workflow OrchestrationThe process of optimizing related ingestion and transformation workloads by coordinating the runtimes and resources assigned to their underlying jobs.

Data Management Technical Terms

TermsDefinition
Data Definition Language (DDL)The aspect of SQL which applies to defining, creating, and modifying database objects, including tables and their data types,indexes, and users.
Dimension TableA table comprised of attributes related to a fact, providing the metrics by which those related facts may be measured as to the who, what, when, where, why, how, and how many of those facts. Can be visualized as a star of dimension fields related to a particular fact field. Where dimension fields may be shared by multiple facts, this relationship can be visualized as a constellation. As a generalization, quantitative (numeric) fields tend to be facts, while qualitative (text) fields tend to be dimensions.
EndpointThe physical address of a service, commonly expressed as a URL.
Fact TableA table comprised of granular data points to be analyzed by a set of related dimensions (who, what, when, where, why, how, how many). The granularity of this table is the level at which the underlying business fact occurs (e.g., one sale of a specific product.) Commonly comprised of a specific data point (aka, fact or measure) (e.g., units_sold), and a dimension key (e.g., product_id) by which that measure can be related to its various dimensions of measurement. As a generalization, quantitative (numeric) fields tend to be facts (aka, metrics or measurements), while qualitative (text) fields tend to be dimensions (aka, attributes or descriptions.)
FieldOne data element in a record.
Hive Bucketing

A technique by which records within any given partition (e.g., State within Orders, City within State within Orders, or even just Orders) are stored in one of a specified number of data files, called buckets, rather than a single data file for each partition. The destination bucket for each record stored in this partition is determined by a hash of its key. Like partitioning, this technique increases the potential for parallel rather than sequential table scans. Also note that this technique may be implemented either alternately or in addition to partitioning.

(Orders table clustered by 3 buckets)

/Orders

part-0000

part-0001

part-0002

etc.

(Orders table clustered by 3 buckets, with partitioning by State)

/Orders

/Alaska

part-0000

part-0001

part-0002

etc.

Hive Partitioning

A technique by which data is physically organized into folders, called partitions, based on values in a designated partition column. For example, an Orders table could be partitioned by State values. In a subsequent query filtering results by State, only records from the matching State folder would be loaded and processed for this query.

Note the relevance of partition column cardinality. High cardinality results in a higher number of smaller partitions. How may this relate with system capacity for parallel processing? Too many partitions in a cluster with too little parallel capacity to efficiently process all those partitions will impede, not improve, performance.

Partitions can also be hierarchically nested. For example, Orders could be partitioned by City within State (or perhaps by Year, Month, Day). In a query filtering both by City and State, an even smaller and more efficient set of files would be loaded into memory, to process the rest of the query. This technique increases the potential for parallel rather than sequential table scans.

/Orders

/Alaska

/Anchorage

/Fairbanks

etc..

/Arkansas

etc..

ML Algorithms (Supervised Learning)

Linear Regression - used to predict numeric values

Logistic Regression - used to predict binary outcomes

Decision Tree - used for categorization problems

Random Forest - improved decision tree

Naïve Bayes - prediction from independent predictors

ML Categories

Supervised - Given A, predict B. Machine is trained to identify a type of data sharing a common label (e.g., face) by analyzing a volume of data points that both match (e.g., a face) and do not match (e.g., not a face) the target label. The distinguishing feature of this category is that the desired output (e.g., faces) is known in advance.

Reinforcement - Machine is trained to classify data points based on their interaction with a common environmental model, wherein successful interactions are rewarded, and the model updated to prefer data points exhibiting the features of previously successful data points. For example, if a red apple were found successful based on its characteristics (color, shape), a green apple would be more likely to be flagged as an apple based on its shape despite the difference in color.

Unsupervised - Machine is trained to classify data points according to common measures detected during the learning processes. For example, fruit might be classified by its color, size, surface texture, prevalence within the set, etc.

Transfer - Trained models from one use case (e.g., sort cats from dogs) is applied to a new use case (e.g., sort fruits from vegetables)

ML Cost BenefitDo the benefits of the repeated decisions made based on predictive insights outweigh the cost of acquiring data and training the required models? ML is most valuable to ongoing processes and their decisions, and much less useful with occasional or ad hoc decisions (because they provide too little data to analyze.)
ML Data CleanlinessThe degree to which average data inputs within a set are properly formatted, and are accurate relative to the question to be analyzed (eg., do all inputs record temperature data by the same measure: F or C?).
ML Data CompletenessThe extent to which all factors necessary for analysis are represented within the target data set (eg., if you are analyzing for product defects, and heat can cause defects, do your inputs include temperature data? Do you have the right and sufficient columns to describe the problem?)
ML Data CoverageThe degree to which all possible variant states to be analyzed are represented within a data set (eg., if you are analyzing for product defects caused by heat, motion, light, and impact, does the target set including multiple examples of defective states caused by each of these factors? Do your rows of data cover all the possible variants?)
ML Example

A single item comprised of one or more features, which are distinct data elements describing the example, and a label, which is the definitive numeric value for, or text classification of, this example. ML Models are used to predict the label for a proposed example based on patterns learned from the examples provided during training.

Features are descriptions that allow algorithms to figure out a label. Labels are the values to be predicted.

ML Libraries (Python)Matplotlib - generates charts from data. Seaborn - visualization of the distribution within a statistical model (e.g., heat maps). Scikit-Learn - Mining and analysis tool extending NumPy and Matplotlib. Pandas - Data wrangling tool enabling manipulation, aggregation, and some visualization. NumPy - Operations on numeric arrays and matrices.
RecordA set of related fields. See row.
RowA set of related fields (attributes) expressed as one element within a data table.
SCD 0Changes to source records are not tracked in the Data Warehouse. Applied when source dimensions are no longer relevant to the purpose of the DW.
SCD 1Changes to source records are propagated to the Data Warehouse, but no history is retained about the change. Applied when only the current dimension will ever be relevant to the facts.
SCD 2Changes to source records are propagated to the Data Warehouse as new rows, with a timestamp indicating the effective date of the add/change. Any further changes to this record cause another new row to be added for the same row key, with updated fields, and an updated effective date timestamp. A current/active record field may also applied to such records, to assist with indexing and filtering for current records, though currency may also be calculated based on date values. Applied when historic data may be relevant to future analyses.
SCD 3Changes to source records are propagated to the Data Warehouse, by adding new columns to an existing record (e.g., address field and previous_address field), along with an effective date field for the current state of the current fields. Rarely used, because only one change (previous from current) is tracked, and because it complicates coding and reporting, by requiring the use of alternative field names when accessing historical data.
SCD 4Changes to source records are propagated to the Data Warehouse, by updating the current record to new values, with an effective date timestamp, and also migrating the previous values to a related history table, which includes both start and end date timestamps for this record. May be used as an alternative to SCD2, if there is a need to control the size of the primary tables (e.g., high volume transaction tables, where there could be hundreds of updates per day / hour / minute.)
SCD 6 (1+2+3)Changes to source records are propagated to the Data Warehouse, by adding a new record to the existing table which both includes current and previous columns for relevant fields, as well as effective date timestamps for changes, along with a current flag. Not commonly used, as it is effectively redundant in its ability to access historical data, by either using the previous value field, or accessing based on effective date timestamps, or based on using the current field flags.
TableA set of one or more records arranged in rows (records) and columns (fields, attributes).