Dimensionality reduction on Dirac

Device: Dirac-1

Introduction

In machine learning problems, we often have to start with a large number of features. We can use a dimensionality reduction approach to reduce the number of features, especially when the features are used for an unsupervised algorithm such as clustering. In what follows, we present a tutorial on using QCi's technology to implement a QUBO based Singular Value Decomposition (SVD) algorithm. Slightly differently from most other Dirac tutorials, this tutorial does not focus on solving a computationally hard (NP-hard for example) problem. Singular value decompositions can effectively be performed in a similar method to what we used here but using matrix diagonalization routines (for example, those available in LAPACK or ARPACK which, can be used through SciPy, MATLAB, and Mathematica) instead of QUBO solvers. The motivation here is to show the versatility of our hardware and the underlying QUBO model.

Importance

Singular value decompositions, and the highly related task of matrix diagonalization, are a key to many applications. One of the key applications is the one studied here, principal component analysis, which provides a method to extract the most important variables to explain data. The first component is effectively the direction in a high dimensional space over which the data vary the most. Informally, this can be thought of as the combination of variables that are able to provide the most information about the data points. The second explains it the second most, et cetera. Dimensionality reduction is thus the task of building a subset of quantities from the original data that provide the most information, but can be much smaller than the total information contained in each data point. While we will not discuss it here, this can be thought of as taking a subspace of a much higher dimensional space, by effectively fitting to a high-dimensional ellipse and taking the directions where the ellipse is the largest. This process is easily visualized in two dimensions (see for example the wikipedia article on principal component analysis) but cannot be in the higher dimensions that are of practical interest.

Applications

As an unsupervised method, principal component analysis is good for exploring data without prior knowledge of the structure (a review can be found here). The example here shows a problem related to analyzing prescription data, but there are many other potential applications for principal component analysis. For example, this technique is extensively used in the chemical analysis of food samples, where the number of chemical tests should be reduced to contain only the most relevant ones. Another example is in portfolio optimization, where diversification can be achieved by choosing stocks that contribute strongly to different principal components. An additional very different application is in atmospheric science, where the most important modes in atmospheric movement (weather patterns) can be identified by finding principal components corresponding to sets of regions where quantities such as pressure are highly correlated. However, care must be taken in how these methods are applied.

Methodology

Principal Component Analysis (PCA) is a dimensionality reduction technique which is based on a Singular Value Decomposition (SVD) of the data matrix. Let us have a dataset with NN samples and dd features, represented by a N×dN \times d matrix FF. We can decompose FF as,

F=UDVTF = U D V^T

where UU is an N×dN \times d matrix with orthonormal columns, DD is a diagonal square d×dd \times d matrix, and VV is an orthogonal matrix with orthonormal columns and rows. Note that columns of UU form an orthonormal basis {uk}k{1,2,...,d}{\{\bf{u_k}\}}_{k \in \{1, 2,..., d\}} for the vector space that is spanned by columns FF. It can be shown that the larger the kk-th diagonal element of DD, the more the contribution of the basis element uk{\bf{u_k}}. Thus, one can rank the basis elements uk{\bf{u_k}} based on the value of DkkD_{kk}, and choose a subset of most important basis elements; this yields a lower dimensional representation of the data matrix FF. Assuming that D00>D11>...>DddD_{00} > D_{11} > ... > D_{dd}, u0{\bf{u_0}} is said to be the first principal component of the original matrix FF.

It can be shown that

uk=1λkFvk{\bf{u_k}} = \frac{1}{\sqrt{\lambda_k}} F {\bf{v_k}}

where λk\lambda_k and vk{\bf{v_k}} denote the kk-th eigenvalue and normalized eigenvector of the orthogonal d×dd \times d matrix G:=FTFG := F^T F. Note too that vk{\bf{v_k}} is in fact the kk-th column of matrix VV and that λk\sqrt{\lambda_k} is the kk-th diagonal element of matrix DD. Finding the first principal component of FF is then equivalent to finding the eigenvector of GG corresponding to its maximum eigenvalue, that is the eigenvector of G-G corresponding to its minimum eigenvalue. An estimation to the first principal component of FF can then be obtained by solving a QUBO as,

minqqTGq\min_{\bf{q}} {-\bf{q}^{T}} G {\bf{q}}

where q{\bf{q}} is a dd-dimensional binary vector. The first principal component of FF is,

u0=1λ0Fq{\bf{u_0}} = \frac{1}{\sqrt{\lambda_0}} F {\bf{q}}

where λ0=v0TGv0\lambda_0 = {\bf{v_0}}^T G {\bf{v_0}}, and v0=qq{\bf{v_0}} = \frac{{\bf{q}}}{||{\bf{q}}||}. One can then replace each column of FF with its component orthogonal to u0{\bf{u_0}}, and repeat the above-mentioned process to get the second principal component u1{\bf{u_1}}. This can be repeated to get any desired number of principal components of FF, say u0,u1,...,ud{\bf{u_0}}, {\bf{u_1}},..., {\bf{u_{d^\prime}}}, where d<dd^\prime < d. To remove a principal component u{\bf{u}} from a feature matrix FF, we have,

Fnew=FuuTFF^{new} = F - {\bf{u}} {\bf{u}^{T}} F

Note that this is an estimation to a full PCA as the vector qq in the above equation is a binary vector. This estimation is equivalent to quantizing the directions in feature space such that a potentially large but finite set of directions can be chosen by PCA. These are the directions that correspond to the dd-dimensional binary vector qq.

Medicare Prescription Data

We implemented this approach using a publically available set of data on prescription of opioids in the United States. The dataset can be found at https://www.cms.gov/data-research/statistics-trends-and-reports/medicare-provider-utilization-payment-data/part-d-prescriber

Clean data

We start by cleaning the dataset.

In [2]:

  • import pandas as pd
  • # Input
  • INP_FILE = "Medicare_Provider_Utilization_and_Payment_Data__Part_D_Prescriber_Summary_Table_CY2014__50001-NNN__ANON.csv"
  • OUT_FILE = "cleaned_medicare_data.csv"
  • CON_VARS = [
  • "total_claim_count",
  • "total_30_day_fill_count",
  • "total_drug_cost",
  • "total_day_supply",
  • "bene_count",
  • "total_claim_count_ge65",
  • "total_30_day_fill_count_ge65",
  • "total_drug_cost_ge65",
  • "total_day_supply_ge65",
  • "bene_count_ge65",
  • "brand_claim_count",
  • "brand_drug_cost",
  • "generic_claim_count",
  • "generic_drug_cost",
  • "other_claim_count",
  • "other_drug_cost",
  • "mapd_claim_count",
  • "mapd_drug_cost",
  • "pdp_claim_count",
  • "pdp_drug_cost",
  • "lis_claim_count",
  • "lis_drug_cost",
  • "nonlis_claim_count",
  • "nonlis_drug_cost",
  • "opioid_claim_count",
  • "opioid_drug_cost",
  • "opioid_day_supply",
  • "opioid_bene_count",
  • "opioid_prescriber_rate",
  • "antibiotic_claim_count",
  • "antibiotic_drug_cost",
  • "antibiotic_bene_count",
  • "hrm_claim_count_ge65",
  • "hrm_drug_cost_ge65",
  • "hrm_bene_count_ge65",
  • "antipsych_claim_count_ge65",
  • "antipsych_drug_cost_ge65",
  • "antipsych_bene_count_ge65",
  • "average_age_of_beneficiaries",
  • "beneficiary_age_less_65_count",
  • "beneficiary_age_65_74_count",
  • "beneficiary_age_75_84_count",
  • "beneficiary_age_greater_84_count",
  • "beneficiary_female_count",
  • "beneficiary_male_count",
  • "beneficiary_race_white_count",
  • "beneficiary_race_black_count",
  • "beneficiary_race_asian_pi_count",
  • "beneficiary_race_hispanic_count",
  • "beneficiary_race_nat_ind_count",
  • "beneficiary_race_other_count",
  • "beneficiary_nondual_count",
  • "beneficiary_dual_count",
  • "beneficiary_average_risk_score",
  • ]
  • VALID_PROVIDER_MI = [
  • "A",
  • "M",
  • "J",
  • "L",
  • "R",
  • "S",
  • "E",
  • "D",
  • "C",
  • "B",
  • "K",
  • "P",
  • "W",
  • "H",
  • "T",
  • "G",
  • "F",
  • "N",
  • "V",
  • "I",
  • "O",
  • "Y",
  • "Z",
  • "U",
  • "Q",
  • "X",
  • ]
  • VALID_GEN = ["F", "M", "Other", "Unknown"]
  • VALID_ENTITIES = ["I", "O"]
  • VALID_DESC_FLAGS = ["S", "T"]
  • VALID_ENROLLS = ["E", "N", "O"]
  • # Some utilities
  • def convert_to_float(x):
  • try:
  • return float(x)
  • except:
  • return None
  • def convert_to_int(x):
  • try:
  • return int(float(x))
  • except:
  • return None
  • # Read data
  • df = pd.read_csv(INP_FILE, on_bad_lines = "skip", low_memory=False)
  • # Clean categorical variables
  • df["nppes_provider_mi"] = df["nppes_provider_mi"].fillna("Unknown")
  • df["nppes_provider_mi"] = df["nppes_provider_mi"].apply(
  • lambda x: x if x in VALID_PROVIDER_MI else "Unknown"
  • )
  • df["nppes_credentials"] = df["nppes_credentials"].fillna("Unknown")
  • df["nppes_credentials"] = df["nppes_credentials"].apply(
  • lambda x: str(x).replace(".", "")
  • )
  • cred_hash = {
  • "MEDICAL DOCTOR": "MD",
  • "NURSE PRACTITIONER": "NP",
  • }
  • df["nppes_credentials"] = df["nppes_credentials"].apply(
  • lambda x: cred_hash[x] if x in cred_hash else x,
  • )
  • df["nppes_provider_gender"] = df["nppes_provider_gender"].fillna("Unknown")
  • df["nppes_provider_gender"] = df["nppes_provider_gender"].apply(
  • lambda x: x if x in VALID_GEN else "Other",
  • )
  • df["nppes_entity_code"] = df["nppes_entity_code"].apply(
  • lambda x: x if x in VALID_ENTITIES else "Unknown",
  • )
  • df["nppes_provider_zip5"] = df["nppes_provider_zip5"].fillna("Unknown")
  • df["nppes_provider_country"] = df["nppes_provider_country"].apply(
  • lambda x: "US" if x == "US" else "Other",
  • )
  • df["description_flag"] = df["description_flag"].apply(
  • lambda x: x if x in VALID_DESC_FLAGS else "Unknown",
  • )
  • df["medicare_prvdr_enroll_status"] = df["medicare_prvdr_enroll_status"].apply(
  • lambda x: x if x in VALID_ENROLLS else "Unknown",
  • )
  • # Treat missing beneficiary count as it cannot be zero
  • df["bene_count"] = df["bene_count"].apply(
  • convert_to_int
  • ).fillna(-1)
  • tmp_df = df.groupby(
  • "specialty_description", as_index=False,
  • )["bene_count"].mean()
  • bene_count_hash = dict(
  • zip(
  • tmp_df["specialty_description"],
  • tmp_df["bene_count"],
  • )
  • )
  • df["bene_count"] = df.apply(
  • lambda x: x["bene_count"] if x[
  • "bene_count"
  • ] > 0 else bene_count_hash[
  • x["specialty_description"]
  • ],
  • axis=1,
  • )
  • # Treat continuous variables
  • for item in CON_VARS:
  • df[item] = df[item].apply(
  • convert_to_float
  • ).fillna(0.0)
  • # Filter out invalid states
  • df = df[
  • ~df["nppes_provider_state"].isin(
  • ["XX", "E", "N", "S"]
  • )
  • ]
  • # Output
  • df.to_csv(OUT_FILE, index=False)

Generate features

We then generate features. The categorical features are encoded using the average value of a few important variables in each category.

In [3]:

  • import pandas as pd
  • # Input
  • INP_FILE = "cleaned_medicare_data.csv"
  • OUT_FILE = "medicare_features.csv"
  • # Set some parameters
  • CAT_VARS = [
  • "nppes_provider_mi",
  • #"nppes_credentials", # This is rather messy, so ignoring it.
  • "nppes_provider_gender",
  • "nppes_entity_code",
  • "nppes_provider_city",
  • "nppes_provider_zip5",
  • #"nppes_provider_country", # Almost all cases are US
  • "specialty_description",
  • "medicare_prvdr_enroll_status",
  • "nppes_provider_state",
  • ]
  • CON_VARS = [
  • "total_claim_count",
  • "total_30_day_fill_count",
  • "total_drug_cost",
  • "total_day_supply",
  • "bene_count",
  • "total_claim_count_ge65",
  • "total_30_day_fill_count_ge65",
  • "total_drug_cost_ge65",
  • "total_day_supply_ge65",
  • "bene_count_ge65",
  • "brand_claim_count",
  • "brand_drug_cost",
  • "generic_claim_count",
  • "generic_drug_cost",
  • "other_claim_count",
  • "other_drug_cost",
  • "mapd_claim_count",
  • "mapd_drug_cost",
  • "pdp_claim_count",
  • "pdp_drug_cost",
  • "lis_claim_count",
  • "lis_drug_cost",
  • "nonlis_claim_count",
  • "nonlis_drug_cost",
  • "opioid_claim_count",
  • "opioid_drug_cost",
  • "opioid_day_supply",
  • "opioid_bene_count",
  • "antibiotic_claim_count",
  • "antibiotic_drug_cost",
  • "antibiotic_bene_count",
  • "hrm_claim_count_ge65",
  • "hrm_drug_cost_ge65",
  • "hrm_bene_count_ge65",
  • "antipsych_claim_count_ge65",
  • "antipsych_drug_cost_ge65",
  • "antipsych_bene_count_ge65",
  • "average_age_of_beneficiaries",
  • "beneficiary_age_less_65_count",
  • "beneficiary_age_65_74_count",
  • "beneficiary_age_75_84_count",
  • "beneficiary_age_greater_84_count",
  • "beneficiary_female_count",
  • "beneficiary_male_count",
  • "beneficiary_race_white_count",
  • "beneficiary_race_black_count",
  • "beneficiary_race_asian_pi_count",
  • "beneficiary_race_hispanic_count",
  • "beneficiary_race_nat_ind_count",
  • "beneficiary_race_other_count",
  • "beneficiary_nondual_count",
  • "beneficiary_dual_count",
  • "beneficiary_average_risk_score",
  • ]
  • # Read and clean data
  • df = pd.read_csv(INP_FILE, low_memory=False)
  • # Embed categorical features
  • embedded_cat_features = []
  • for item in CAT_VARS:
  • tmp_df = df.groupby(item, as_index=False).agg(
  • {
  • "opioid_claim_count": "mean",
  • "opioid_drug_cost": "mean",
  • "opioid_day_supply": "mean",
  • "opioid_bene_count": "mean",
  • "opioid_prescriber_rate": "mean",
  • }
  • ).rename(
  • columns={
  • "opioid_claim_count": "%s_opioid_claim_count" % item,
  • "opioid_drug_cost": "%s_opioid_drug_cost" % item,
  • "opioid_day_supply": "%s_opioid_day_supply" % item,
  • "opioid_bene_count": "%s_opioid_bene_count" % item,
  • "opioid_prescriber_rate": "%s_opioid_prescriber_rate" % item,
  • }
  • )
  • df = df.merge(tmp_df, how="left", on=item)
  • embedded_cat_features += [
  • "%s_opioid_claim_count" % item,
  • "%s_opioid_drug_cost" % item,
  • "%s_opioid_day_supply" % item,
  • "%s_opioid_bene_count" % item,
  • "%s_opioid_prescriber_rate" % item,
  • ]
  • # Drop unembedded categorical variables and some others
  • df = df[["npi"] + CON_VARS + embedded_cat_features]
  • # Write features file
  • df.to_csv(OUT_FILE, index=False)

Dimensionality Reduction

Once the features are generated, we can implement the above-mentioned SVD algorithm. We start by importing some libraries, setting some parameters, and loading the features into a Pandas dataframe.

In [4]:

  • # Import libs
  • import sys
  • import os
  • import time
  • import numpy as np
  • import pandas as pd
  • from qci_client import QciClient
  • # Define some parameters
  • FEATURES_FILE = "medicare_features.csv"
  • REDUCED_DIM = 3
  • # Read features
  • df = pd.read_csv(FEATURES_FILE, low_memory=False)

We now print the feature names and get the total count of features in the dataset,

In [5]:

  • feature_names = list(set(df.columns) - {"npi"})
  • orig_dim = len(feature_names)
  • print(
  • "Original dimension is %d; reduced dimension will be %d" % (
  • orig_dim,
  • REDUCED_DIM,
  • )
  • )

Out [ ]:

Original dimension is 93; reduced dimension will be 3

We now define a function that gets the first principal component of features by solving a QUBO,

In [6]:

  • token = "your_token"
  • api_url = "https://api.qci-prod.com"
  • qci = QciClient(api_token=token, url=api_url)

In [7]:

  • def get_first_principal_comp(F):
  • qubo = -np.matmul(F.transpose(), F)
  • # Make sure matrix is symmetric to machine precision
  • qubo = 0.5 * (qubo + qubo.transpose())
  • # Create json objects
  • qubo_json = {
  • "file_name": "qubo_tutorial.json",
  • "file_config": {
  • "qubo": {"data": qubo, "num_variables": orig_dim},
  • }
  • }
  • # Solve the optimizzation problem
  • #qci = QciClient()
  • response_json = qci.upload_file(file=qubo_json)
  • qubo_file_id = response_json["file_id"]
  • # Setup job json
  • job_params = {
  • "device_type": "dirac-1",
  • "alpha": 1.0,
  • "num_samples": 20,
  • }
  • job_json = qci.build_job_body(
  • job_type="sample-qubo",
  • job_params=job_params,
  • qubo_file_id=qubo_file_id,
  • job_name="tutorial_eqc1",
  • job_tags=["tutorial_eqc1"],
  • )
  • print(job_json)
  • # Run the job
  • job_response_json = qci.process_job(
  • job_body=job_json,
  • )
  • print(job_response_json)
  • results = job_response_json["results"]
  • energies = results["energies"]
  • samples = results["solutions"]
  • if True:
  • print("Energies:", energies)
  • q = np.array(samples[0])
  • assert len(q) == orig_dim, "Inconsistent solution size!"
  • fct = np.linalg.norm(q)
  • if fct > 0:
  • fct = 1.0 / fct
  • v0 = fct * q
  • v0 = v0.reshape((v0.shape[0], 1))
  • lambda0 = np.matmul(
  • np.matmul(v0.transpose(), -qubo),
  • v0
  • )[0][0]
  • assert lambda0 >= 0, "Unexpected negative eigenvalue!"
  • fct = np.sqrt(lambda0)
  • if fct > 0:
  • fct = 1.0 / fct
  • u0 = fct * np.matmul(F, v0)
  • u0 = u0.reshape(-1)
  • fct = np.linalg.norm(u0)
  • if fct > 0:
  • fct = 1.0 / fct
  • u0 = fct * u0
  • return u0

One can get the first REDUCED_DIM components by applying the above function recursively,

In [8]:

  • F = np.array(df[feature_names])
  • U = []
  • for i in range(min(REDUCED_DIM, F.shape[1])):
  • u = get_first_principal_comp(F)
  • U.append(u)
  • u = u.reshape((u.shape[0], 1))
  • F = F - np.matmul(
  • u,
  • np.matmul(u.transpose(), F),
  • )

Out [ ]:

{'job_submission': {'problem_config': {'quadratic_unconstrained_binary_optimization': {'qubo_file_id': '663ba43298263204a36574d0'}}, 'device_config': {'dirac-1': {'num_samples': 20}}, 'job_name': 'tutorial_eqc1', 'job_tags': ['tutorial_eqc1']}}
2024-05-08 09:11:30 - Dirac allocation balance = 0 s (unmetered)
2024-05-08 09:11:30 - Job submitted: job_id='663ba432d448b017e54f94a8'
2024-05-08 09:11:31 - QUEUED
2024-05-08 09:11:33 - RUNNING
2024-05-08 09:16:54 - COMPLETED
2024-05-08 09:16:57 - Dirac allocation balance = 0 s (unmetered)
{'job_info': {'job_id': '663ba432d448b017e54f94a8', 'job_submission': {'job_name': 'tutorial_eqc1', 'job_tags': ['tutorial_eqc1'], 'problem_config': {'quadratic_unconstrained_binary_optimization': {'qubo_file_id': '663ba43298263204a36574d0'}}, 'device_config': {'dirac-1': {'num_samples': 20}}}, 'job_status': {'submitted_at_rfc3339nano': '2024-05-08T16:11:30.979Z', 'queued_at_rfc3339nano': '2024-05-08T16:11:30.98Z', 'running_at_rfc3339nano': '2024-05-08T16:11:31.644Z', 'completed_at_rfc3339nano': '2024-05-08T16:16:52.435Z'}, 'job_result': {'file_id': '663ba57498263204a36574d2', 'device_usage_s': 281}}, 'status': 'COMPLETED', 'results': {'counts': [19, 1], 'energies': [-2802848646149886000, -2802846722004537300], 'solutions': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}}
Energies: [-2802848646149886000, -2802846722004537300]
{'job_submission': {'problem_config': {'quadratic_unconstrained_binary_optimization': {'qubo_file_id': '663ba5a298263204a36574d4'}}, 'device_config': {'dirac-1': {'num_samples': 20}}, 'job_name': 'tutorial_eqc1', 'job_tags': ['tutorial_eqc1']}}
2024-05-08 09:17:38 - Dirac allocation balance = 0 s (unmetered)
2024-05-08 09:17:38 - Job submitted: job_id='663ba5a2d448b017e54f94a9'
2024-05-08 09:17:38 - QUEUED
2024-05-08 09:17:41 - RUNNING
2024-05-08 09:23:02 - COMPLETED
2024-05-08 09:23:04 - Dirac allocation balance = 0 s (unmetered)
{'job_info': {'job_id': '663ba5a2d448b017e54f94a9', 'job_submission': {'job_name': 'tutorial_eqc1', 'job_tags': ['tutorial_eqc1'], 'problem_config': {'quadratic_unconstrained_binary_optimization': {'qubo_file_id': '663ba5a298263204a36574d4'}}, 'device_config': {'dirac-1': {'num_samples': 20}}}, 'job_status': {'submitted_at_rfc3339nano': '2024-05-08T16:17:38.906Z', 'queued_at_rfc3339nano': '2024-05-08T16:17:38.909Z', 'running_at_rfc3339nano': '2024-05-08T16:17:39.763Z', 'completed_at_rfc3339nano': '2024-05-08T16:22:59.746Z'}, 'job_result': {'file_id': '663ba6e398263204a36574d8', 'device_usage_s': 279}}, 'status': 'COMPLETED', 'results': {'counts': [13, 4, 2, 1], 'energies': [-60631112719794140, -60631112719794140, -60631108424826850, -60631052590252000], 'solutions': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1], [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0]]}}
Energies: [-60631112719794140, -60631112719794140, -60631108424826850, -60631052590252000]
{'job_submission': {'problem_config': {'quadratic_unconstrained_binary_optimization': {'qubo_file_id': '663ba70d98263204a36574da'}}, 'device_config': {'dirac-1': {'num_samples': 20}}, 'job_name': 'tutorial_eqc1', 'job_tags': ['tutorial_eqc1']}}
2024-05-08 09:23:41 - Dirac allocation balance = 0 s (unmetered)
2024-05-08 09:23:41 - Job submitted: job_id='663ba70dd448b017e54f94ab'
2024-05-08 09:23:41 - QUEUED
2024-05-08 09:37:48 - RUNNING
2024-05-08 09:43:10 - COMPLETED
2024-05-08 09:43:12 - Dirac allocation balance = 0 s (unmetered)
{'job_info': {'job_id': '663ba70dd448b017e54f94ab', 'job_submission': {'job_name': 'tutorial_eqc1', 'job_tags': ['tutorial_eqc1'], 'problem_config': {'quadratic_unconstrained_binary_optimization': {'qubo_file_id': '663ba70d98263204a36574da'}}, 'device_config': {'dirac-1': {'num_samples': 20}}}, 'job_status': {'submitted_at_rfc3339nano': '2024-05-08T16:23:41.935Z', 'queued_at_rfc3339nano': '2024-05-08T16:23:41.936Z', 'running_at_rfc3339nano': '2024-05-08T16:37:48.722Z', 'completed_at_rfc3339nano': '2024-05-08T16:43:08.614Z'}, 'job_result': {'file_id': '663bab9c98263204a36574de', 'device_usage_s': 279}}, 'status': 'COMPLETED', 'results': {'counts': [12, 8], 'energies': [-31525704197734400, -31525704197734400], 'solutions': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]}}
Energies: [-31525704197734400, -31525704197734400]

We can save the results and echo some information,

In [9]:

  • U = np.array(U)
  • print(U.shape)
  • np.save("U.npy", U)

Out [ ]:

(3, 1022499)

We can do a classical SVD,

In [10]:

  • from sklearn.decomposition import PCA
  • from sklearn.preprocessing import normalize
  • F = np.array(df[feature_names])
  • pca_classical = PCA(n_components=REDUCED_DIM)
  • U_classical = pca_classical.fit_transform(F)
  • U_classical = normalize(U_classical, axis=0, norm="l2")

and compare the results to those of the above-mentioned approach,

In [12]:

  • U = np.load("U.npy")
  • #U = U_classical.transpose()
  • print(U.shape)
  • print(U_classical.shape)
  • print(
  • abs(
  • np.diag(
  • np.matmul(U, U_classical)
  • )
  • )
  • )

Out [ ]:

(3, 1022499)
(1022499, 3)
[0.9101745  0.93294887 0.82830476]

Conclusion

In this tutorial, we have demonstrated the versatility of our Dirac-1 machine and its underlying QUBO model. Rather than an NP-hard optimization problem, we instead show that our device can be used for dimensionality reduction using principal component analysis, a powerful matrix based data-science method. For more traditional QUBO applications, see here. This includes a QUBO based machine learning method known as QBoost. You could also try out these devices on your own problems.

Like portfolio optimization, QBoost, and feature selection, this tutorial is a variation on the same theme of taking advantage of the correlation structure of an underlying data set. Such problems keep arising both because they have important applications, and because they are naturally expressed as QUBOs, given the importance of two-body correlations. Trying one of these tutorials is a natural next step.

In [ ]: