Summary
Cryptocurrencies like Bitcoin and Ethereum are gaining popularity due to their decentralized nature and lack of central control. These digital currencies provide a high degree of user anonymity, making it difficult for attackers to identify the actual individuals behind addresses and trace funds transferred between users. However, these features of blockchain-based cryptocurrencies also pose challenges, as they can facilitate criminal activities and fraudulent transactions. Detecting such illicit actions or entities within cryptocurrencies proves to be a significant challenge for security agencies and financial authorities. As a response to this challenge, a comprehensive system for identifying fraudulent entities in cryptocurrency systems has been developed. This system comprises two primary modules: (1) off-chain monitoring and (2) on-chain monitoring. Off-chain monitoring involves artificial-intelligence (AI)-based real-time surveillance of the World Wide Web (WWW), dark web searches, and social media analytics to detect fraudulent entities. Subsequently, it issues alerts to prevent individuals from engaging in transactions with such entities. Through off-chain analysis, a significant set of fraudulent cryptocurrency addresses is extracted, which aids in on-chain monitoring. Contrary to off-chain analysis, which identifies fraudulent addresses before any cryptocurrency fraud occurs, the on-chain monitoring module detects fraudulent entities after the fraud has taken place in the blockchain. Using on-chain analysis, machine-learning (ML)-based models have been developed for detecting fraudulent addresses in Bitcoin and Ethereum. Additionally, a function to identify mixer and tumbler services has been created, facilitating the identification of money-laundering activities involving cryptocurrencies. These results demonstrate promising outcomes in terms of both correctness and real-time capability.
Introduction
Fraud has been a persistent issue in society since human creation; the only difference lies in the evolving methods of committing fraud. With the advancement of technology, fraudulent activities have become more modernized and sophisticated, making them increasingly challenging to identify. In recent decades, identifying fraud has garnered significant attention and discussion. Banks have made substantial financial investments to detect fraudulent transactions occurring within their networks. In the traditional fiat currency system, banks and government authorities manage and supervise fund movements. They have implemented a new generation of security measures to address these risks [1]. Since the fiat currency system is regulated and controlled, involves recognized customers, and is monitored, it is not easy for criminals to engage in financial fraud.
Contrary to fiat currency, committing fraud using cryptocurrency is comparatively easier due to features provided by cryptocurrency systems, such as user anonymity and decentralization. All cryptocurrencies (e.g., Bitcoin and Ethereum) employ decentralized blockchain technology to execute transactions and record them in a public ledger. While everyone has access to the ledger, no one has control over it. User identities are anonymized and represented as a long random number called a public key or address. A user can generate multiple addresses to receive and transfer cryptocurrency coins. These features make cryptocurrencies a strong alternative to the fiat currency system, resulting in an increasing number of cryptocurrency users each day. However, they also contribute to a rise in the overall fraud ratio. Although everyone has access to all transactions and authorities can trace them, the challenge lies in detecting fraud and uncovering the actual identity behind it due to the anonymity provided. Typically, fraudulent users generate a new address each time to receive and transmit coins, aiming for increased privacy and avoiding being identified.
Cryptocurrencies not only facilitate fraud but also various other criminal activities like money laundering, terrorism financing, drug trafficking, child trafficking, bribery, and ransom. According to a report by Chainalysis [2], illicit addresses received over $24 billion in 2023, constituting 0.42% of overall transactions. The report highlighted a significant increase in ransomware and dark net crimes. Additionally, Elliptic reported that illicit entities laundered $2.7 billion worth of coins in 2022 using cross-chain methods, with North Korea’s hacking organization alone responsible for over $900 million [3]. The inherent features of cryptocurrencies, such as anonymity, the changing nature of user addresses, and the lack of central control, make it challenging to identify and track criminals and their transactions within the cryptocurrency system.
Government authorities, agencies, and various companies are actively engaged in detecting cryptocurrency fraud and criminal activities and identifying real entities associated with them through on-chain analysis. However, these agencies conduct manual analyses and transaction tracing with the assistance of human experts to ascertain the legitimacy of a given transaction or entity, aiming to identify the actual individuals or organizations behind illicit activities. Some of these organizations utilize computer algorithms for analysis and detection, although their monitoring processes are not always in real time. Additionally, contemporary criminals employ sophisticated third-party services, such as mixers and tumblers, to enhance privacy and obscure the traceability of their fund transfers. For instance, money launderers leverage mixers and tumblers that offer protection by mixing coins through multiple transactions involving several users. This mixing mechanism significantly complicates the task for authorities attempting to trace funds within the cryptocurrency system and identify their source.
Furthermore, the majority of cryptocurrency monitoring bureaus overlook off-chain monitoring. Most frauds are perpetrated by scammers who encourage individuals to invest in cryptocurrency and purchase counterfeit products or services. To achieve this, scammers create deceptive websites or advertise on social media, providing cryptocurrency addresses to receive coins. The early detection of these off-chain platforms, advertisements, and crypto-addresses can help prevent fraud. While some monitoring organizations engage in manual detection, the process is slow and cannot keep up with the rapid creation of such deceptive content.
To address these challenges, an AI-based comprehensive system has been designed to identify scams, fraudulent entities, and services supporting money laundering, such as mixers. This system comprises two major modules: (1) off-chain monitoring and (2) on-chain monitoring. The objective of the off-chain module is to prevent cryptocurrency fraud before it occurs, while the on-chain module detects an illicit entity after the crime is committed. The off-chain module involves real-time monitoring of the WWW to identify cryptocurrency phishing websites and extract associated cryptocurrency wallet addresses. It also establishes connections between the extracted wallet addresses and the dark web, tracing them on social media platforms. Upon identifying any suspicious website, social media content, or crypto-address, the system issues an alert to deter individuals from engaging in transactions with such entities. Through off-chain analysis, a significant set of fraudulent cryptocurrency addresses can be extracted, which aids on-chain monitoring.
On the other hand, the on-chain monitoring conducts real-time surveillance of cryptocurrency blockchains to detect fraud and money-laundering activities. To achieve this, ML-based models for detecting fraudulent addresses in Ethereum and Bitcoin have been developed. Additionally, a function to identify mixer and tumbler services has been created, facilitating the identification of money-laundering activities involving cryptocurrencies. The on-chain module also enables users to assess the risk level associated with a given address, providing insights into the potential risk of trading with that address. Furthermore, clustering on addresses has been performed to group all addresses associated with a single entity. Finally, the system’s correctness and its ability to operate in a real-time environment has been evaluated. Results demonstrate promising outcomes in terms of both correctness and real-time efficiency.
The remainder of this article is organized as follows. The next section titled Preliminaries provides an understanding of basic concepts, including the workings of Bitcoin and Ethereum, the definition of mixers and tumblers, and an exploration of how fraud is committed by scammers. The Proposed System section follows, examining the technical details of the system, encompassing both on-chain and off-chain monitoring modules. The results of the system evaluation are presented in the section titled System Evaluation and Results, followed by Conclusions.
Preliminaries
In this section, the preliminary concepts to understand the overall system are discussed. How Bitcoin, Ethereum, and Mixers execute their operations is then demonstrated. Furthermore, the overall scenario of fraud execution in cryptocurrency domain is reviewed.
Bitcoin
Bitcoin is a decentralized cryptocurrency system that facilitates peer-to-peer transfers of Bitcoins without involving any central party. In contrast to traditional banking systems, Bitcoin operates without central controlling or regulatory authority. Each Bitcoin user possesses a pair (or multiple pairs) of a public key and a corresponding private key, generated through secure mechanisms. The public key serves as the user’s address for receiving Bitcoins, while the corresponding private key is utilized to transfer or withdraw those Bitcoins to other users by signing the transaction. The private key is kept confidential, known only to the user who employs it for withdrawing or transferring coins, particularly after receiving Bitcoins in the corresponding public key. This approach ensures user anonymity within the Bitcoin system, making it challenging to identify the real identities, such as persons or individuals, behind public keys or addresses. The combination of user anonymity and decentralized control has contributed to Bitcoin’s popularity. However, these features also pose risks, as they can be exploited for criminal activities such as fraud, money laundering, child trafficking, and more.
Consider a simple Bitcoin transfer scenario, as shown in Figure 1, between users Un, Ux, Uy, and Uz who have public and private key pairs (PUn,PRn), (PUx,PRx), (PUy,PRy), and (PUz,PRz), respectively. Ux has received 15 Bitcoins from Un in two transactions, TD11 and TD111. Now, Ux wants to transfer 15 Bitcoins to Uy and generates a transaction TD22 by mentioning the receiver’s address as PUy and the input transactions’ references (i.e., transactions from where those Bitcoins were received) as TD11 and TD111. Ux signs TD22 using private key PRx and broadcasts it to the Bitcoin network. When each Bitcoin miner receives TD22, the signature that the TD11 and TD111 are never spent in any earlier transaction is verified and then added to a block, along with other transactions to mine it. Once any miner node successfully mines the block, the block is added to the blockchain, and the transaction is successfully committed. In the same way, Uy transfers Bitcoins to Uz in the transaction TD33. In the throughout process, no one knows who is behind the PUn, PUx, PUy, and PUz addresses (i.e., Un, Ux, Uy, and Uz are never revealed). Specific details of the mining process are not provided here; however, Al-Farsi et al. [4] explain the working of the Bitcoin system and whole mining process in a very comprehensive way.
Ethereum
Ethereum is the second-most popular cryptocurrency after Bitcoin, introduced in 2013 by Vitalik Buterin in his white paper [5]. Like Bitcoin, Ethereum operates on a decentralized network and prioritizes user anonymity. However, unlike Bitcoin, Ethereum employs proof-of-stake (PoS) [6] for block mining, transaction validation, and security, as opposed to proof-of-work (PoW). Notably, Ethereum recently transitioned from PoW to PoS for its mining protocol. Unlike PoW, PoS is quite efficient, as it does not keep all nodes busy in the mining process. PoS randomly selects a single node in the network to validate all the transactions in a block and add them to the Ethereum network. Moreover, Ethereum’s transfer message only supports one sender and one receiver address in a single transaction. In contrast, a Bitcoin transaction can involve multiple senders and receivers in a single transaction.
Mixers and Tumblers
Mixers, also known as tumblers, are cryptocurrency services provided by third parties to enhance the privacy of transactions for cryptocurrency users. The Bitcoin blockchain, being public, allows anyone to view all transactions, enabling tracking of fund sources. Through sophisticated transaction analysis, monitoring, and tracking, the real person behind an address can be identified. To make it more challenging to trace funds to a specific address, mixers combine coins from various sources by repeatedly sending and receiving funds using multiple addresses in a single transaction. They obfuscate coins by receiving them from a user; gathering and mixing them using multiple source and destination addresses in one or more transactions, as illustrated in Figure 2; and then sending them to the user-provided address. This method makes it difficult for attackers to backtrack the funds received in a transaction by an address. From the perspective of cryptocurrency users, mixers offer privacy. However, it is essential to note that mixing can be exploited by criminals for money laundering and concealing their sources. Consequently, many national security authorities aim to identify transactions involving mixing services.
Fraud Scenarios
Most cryptocurrency frauds are perpetrated by scammers who employ various tactics to deceive users. These individuals convince users to send them Bitcoins in exchange for promised services, products, or financial gains. However, once the coins are received, no such offerings are provided. The anonymity feature inherent in Bitcoin makes it challenging to trace and apprehend these fraudulent actors. Most scams occur through phishing attacks, wherein scammers create deceptive websites (referred to as phishing websites), send fraudulent emails, utilize social media and other platforms for misleading advertisements, and encourage individuals to invest in their platform or acquire services/products by transferring Bitcoins to a provided Bitcoin address. Early detection of such phishing attacks can aid in identifying potential sources of fraud, allowing for proactive measures to be taken to prevent cryptocurrency-related scams.
Proposed System
To monitor fraudulent activities in cryptocurrencies and prevent them through an early-detection mechanism, a comprehensive system that performs both real-time, off-chain monitoring and on-chain monitoring has been developed. Figure 3 illustrates the overall layered architecture of the system. At the current stage, the focus is on monitoring two major cryptocurrencies, including Bitcoin and Ethereum. Off-chain monitoring primarily focuses on early fraud detection by identifying cryptocurrency-related scamming/phishing websites and monitoring social media and the dark web. On the other hand, real time, on-chain monitoring is designed to identify fraudulent transactions as they occur in the cryptocurrency blockchain. This system identifies mixers to help control money laundering, detects fraudulent transactions, and clusters addresses to group those belonging to the same entity. Additionally, risk analysis is performed on addresses to assign a risk level, indicating whether an address belongs to a legitimate user or a scammer. Intercomponent analysis between each of the off-chain and on-chain submodules has been conducted. Both NoSQL database management systems (DBMS), such as MongoDB, and relational DBMS, such as PostgreSQL, are utilized to store data, features, intermediate findings, and results. For data analysis at both on-chain and off-chain levels, signature-based pattern detection algorithms, statistical methods, and ML are employed. This section delves into the technical details of each individual module of off-chain and on-chain monitoring.
Data Collection, Preprocessing, and Features Engineering
The detection models are data driven, requiring an extensive amount of data for initial analysis and model development. These data were collected from various public sources and Telegram groups. Some of the datasets were built in-house. Data on cryptocurrency scamming websites for off-chain analysis were sourced from industry experts who conducted manual analyses to identify cryptocurrency scams. Initially, a dataset comprising 500 deceptive websites and 200 nondeceptive websites (resembling scams but not fraudulent) was compiled. Using this dataset as a foundation, AI-based clustering techniques were employed to expand a list of deceptive websites to 10,000. For on-chain analysis, the data generated by Toyoda et al. [7] containing addresses of 3,199 mixers and 23,114 nonmixers (26,313 total) were obtained. The labels were primarily assigned through heuristics and clustering techniques. Nonmixer samples include addresses from services that may resemble mixers, such as exchanges, faucets, high-yield investment program, pools, markets, and gambling platforms. For fraudulent transaction analysis, a list of 12,146 fraudulent and legitimate Ethereum addresses was obtained from Kaggle [8].
After removing 246 duplicated entries, 5,054 fraudulent addresses and 6,846 nonfraudulent addresses were left. Additionally, a set of fraudulent addresses from off-chain sources was gathered and included Telegram chats, scamming websites, the dark web, and social media.
Usually, the data are not refined; they may contain null or missing values, unnormalized values, repeated data, and unbalanced samples. Consequently, data preprocessing was conducted to refine the data and prepare it for analysis. Both the Bitcoin mixers dataset and Ethereum dataset were imbalanced, indicating an unequal distribution of samples across categories. Therefore, when training an AI-based model for mixer detection, only 3,601 random nonmixer addresses from the mixer dataset were selected to achieve a somewhat balanced dataset.
In contrast, for the Fraudulent dataset, the SMOTE [9] resampling mechanism was employed to equalize the number of both fraudulent and nonfraudulent addresses to 6,846 each. Additionally, zero-variance attributes, abnormal outside samples, and outliers were eliminated from all datasets using box-plot analysis. To address missing and null values, the data were partitioned, based on class labels, and k-means clustering (with k = 10) was applied to each partition to identify subgroups. Subsequently, the mean value of all attributes in each subgroup was calculated, and the missing values were replaced with the mean value of the corresponding attribute. Furthermore, to prevent poor performance due to nonstandardized attribute values, Gaussian normalization was applied, ensuring zero mean and unit variance. Values were transformed using the Yeo-Johnson power transformer method [10] with in-place computation [11].
On-chain datasets comprise many parameters. Mixer and nonmixer addresses are characterized by 36 features, while Ethereum addresses have 32 parameters (excluding “address” and the “class” parameter). Utilizing all these parameters for on-chain, real-time monitoring may result in significant overhead. Therefore, to reduce the number of parameters in the Ethereum dataset, correlation and feature importance scores were employed. Certain attribute pairs with strong interlinear relationships, such as ratioRecSent and receivedTransactions, maxValReceived and avgValReceived, and ratioSentTotal and ratioRecTotal, were identified. One attribute from each pair was removed by choosing the one with a stronger correlation to the class attribute. Subsequently, the Gini index [12] (a method for determining feature importance) was applied to the remaining 29 attributes to select the top 16 attributes. The feature importance scores for these attributes are depicted in Figure 4.
To eliminate unimportant features from the Bitcoin mixer dataset, basic signature and pattern, statistical, correlation, and information gain analysis was conducted [13]. The information gain of an attribute A represents the amount of information gained about a class variable from the observing attribute A. In simple terms, it is the measure used in a decision tree to identify the best parameter. Through information gain analysis, each feature was categorized as either important or nonimportant. The top eight most important features that can effectively distinguish mixer and nonmixer addresses were then selected. Table 1 shows the selected features and their corresponding information gain values. Later, patterns and flow analysis of Bitcoin transactions was performed by constructing a big graph using Neo4j [14]. Specifically, the focus was on transactions containing a mixer’s address in the sender list (inputs) or receiving list (outputs).
Findings revealed that most of the mixer’s transactions exhibited “fan-in,” “fan out,” or “scatter-gathered patterns.” In a fan-in pattern, the transaction involves multiple senders and one receiver, while the fan-out pattern features one sender and many receivers. In a scatter gathered (or gathered-scatter) scenario, there are too many relationships between the number of inputs and outputs of the transaction. Graphs depicting the analysis of transactions involving the mixer address “1PzuVHgrSH7rRJNttzgknuomMLohX54dCB” are illustrated in Figures 5 and 6. Figure 5 shows all the Bitcoin exchanges (represented by blue nodes) by the mixer (represented by the orange node). Each blue node contains the information about transactions where Bitcoins were received and subsequently sent to other addresses. To further analyze inputs and outputs, a blue node can be expanded, as demonstrated in Figure 6. Based on this transaction pattern and flow analysis, three additional features were added to the list of important features, including “number of fan-in patterns,” “number of fan-out patterns,” and “number of scattered-gather patterns,” making a total of 11 features.
Real-Time, Off-Chain Monitoring
The motivation for real-time, off-chain monitoring is to prevent fraud before it occurs. Through the surveillance of the web, dark web, and social media, cryptocurrency wallet addresses associated with scams and fraudulent activities are identified. These efforts aim to thwart attempts to collect coins through deceptive means. The overall flow of the off-chain monitoring system is illustrated in Figure 7.
Web Monitoring
This system provides real-time monitoring of the web to detect any fake websites that attempt to persuade users to invest in cryptocurrency or purchase products/services at a lower price. Scammers often provide cryptocurrency wallet addresses for users to make payments. This system extracts newly created websites daily and assesses which ones are scams. To achieve this, the system initially filters all cryptocurrency-related websites by applying a signature- and pattern-matching algorithm to the website’s content and basic features. Subsequently, complex features are extracted from the filtered websites, such as the age of the website, region, daily visitor count, and more, and the term frequency inverse document frequency vector is computed. This vector, along with the feature set, is then provided to the AI-based deceptive website detection module to determine whether the website is deceptive or nondeceptive. Also, the system thoroughly scans each website flagged as a scam to extract the Bitcoin addresses provided by the scammer. The system generates alerts whenever it identifies a scam website on the web and stores the website details, along with the corresponding extracted wallet address in a blacklist.
The deceptive website detection module employs a random forest ML model, trained and extensively tested on an in-house labeled dataset, as previously discussed in the Data Collection, Preprocessing, and Features Engineering section of this article. Occasionally, scammers may create multiple similar websites to launch an effective phishing campaign. There is also the possibility that a group of scammers works together to launch phishing attacks using a combination of scamming websites. Thus, this system identifies such groups of similar websites that are identical to a given scamming website by employing a k-means clustering approach. The system also identifies similar websites that share the same wallet address. To date, this system has scanned more than 250 million websites created since 2014 [15], at a daily rate of 250,000 per day. Overall, more than 67,000 websites have been detected as deceptive, with a daily detection rate exceeding 500 per day.
Dark Web Monitoring
This system also monitors the dark web to identify cryptocurrency addresses. It searches for traces of wallet addresses blacklisted by the scamming website detector within the dark web, thereby enhancing the confidence level of the system. SOS Intelligence Limited’s application programming interface (API) [16] is employed to extract all the active dark web uniform resource locators (commonly known as URLs) daily and monitor them regularly. Whenever the system finds the cryptocurrency address on the dark web, it retrieves the contents of the corresponding dark web page for further investigation.
Social Media Monitoring
As with dark web monitoring, social media contents like X are also monitored to compile all the wallet addresses discussed on social media. To raise the confidence level for the scammers wallet address list (blacklist addresses), the system checks to see if any of these addresses are reported on social media. This module is at the initial stage of development.
Real-Time, On-Chain Monitoring
Bitcoin and Ethereum do not have a central authority to control their operations and monitor them. Therefore, a real-time monitoring system is required for these cryptocurrencies to generate alerts if a new transaction involves any suspicious entity in either the receiving side or sending side. Specifically, the aim of the real-time monitoring system is to identify fraudulent entities and mixer services.
The overall flow of the on-chain fraudulent entities and mixer detection is illustrated in Figure 8. The system is connected with the Ethereum and Bitcoin blockchain nodes, which always have the latest version of the blockchains and updates whenever a new block is created in the chain. This system processes each newly created block and prepares a list of all input and output addresses in each transaction. Each address is looked up in the database to determine whether the address exists in the record (i.e., finding whether the address had already been analyzed in the past). If the address does not exist in the database, the system extracts and computes all the address features, as previously mentioned in the Data Collection, Preprocessing, and Features Engineering section of this article. These features are then provided to the AI-based classification models to determine whether the address is legitimate or fraudulent or if it is a mixer. The system generates alerts to users if there is a mixer or a fraudulent entity involved in any transaction. Also, by utilizing graph-based clustering and address connection analysis, the system determines the group of addresses that may belong to a single specific entity. In the same way, the system also provides risk scoring to each entity to guide the user to avoid any trade with high-risk addresses.
Mixers and Tumblers Detection
Mixers assist money launderers not only in concealing the origin of their illicit funds, which may be obtained through fraud, ransom, scams, or other unlawful activities, but also in obscuring the recipients of these funds in the form of cryptocurrency coins. Consequently, identifying mixing services is crucial for unveiling money-laundering activities within the realm of cryptocurrencies. Current approaches in the literature either lack high accuracy due to the dynamic nature of mixing methods or are insufficiently efficient for real-time monitoring. This system achieved high accuracy by building ML-based distributed gradient-boosted decision trees (GBDT) using the extreme gradient boosting (XGBoost) library [17, 18]. The mixer detection model [19] consists of multiple small decision trees that operate in parallel to perform decision-making. Furthermore, after an extensive analysis of mixers, only 11 important features were found, which are quite fewer in number. A smaller number of attribute computations and a parallel decision-making approach through multiple distributed trees ensure real-time processing capabilities.
Fraud Detection
Off-chain monitoring is employed to prevent fraud before it happens. The other way to deal with the fraudulent entities is to detect them and their transactions in the on-chain by monitoring the cryptocurrency blockchain in real time. This system performs fraudulent entity detection [20] by employing an AI-trained model.
To select the best ML model, four tree-based learners, such as classification regression trees (CART), random forest, light gradient-boosting machine (LGBM), and GBDT, were tested using XGBoost. These models were assessed using a 10-fold cross-validation on a refined version of an Ethereum dataset, which consisted of both fraudulent and nonfraudulent entities (wallet addresses) based on 16 features. The correctness results for each of these models are presented in Table 2. Ultimately, the distributed GBDT learner was chosen, as it demonstrated exceptional performance. The model was retained on the entire dataset, and validation was conducted using various hyperparameter configurations. The optimal hyperparameters were determined to be [colsample_bytree = 0.7, learning rate = 0.5, max depth = 4, n estimators = 200, and subsample = 0.9]. Eventually, the model was deployed in a real-time environment, where the GBDT-based, fraud-detection model exhibited significantly higher accuracy with these hyperparameters.
Wallets and Address Clustering
In cryptocurrencies, an address is essentially a public key. A user can have multiple addresses and generate as many pairs of public and private key as desired. Once an address (i.e., a public key) receives coins in a transaction, the corresponding private key can be employed to access the received coins, either for transfer or withdrawal. This feature in cryptocurrency enables criminals to generate and utilize multiple addresses for sending and receiving coins, thus avoiding identification. Therefore, a graph-based algorithm has been developed to identify the set of addresses that may pertain to a single entity, particularly a fraudulent one. The algorithm accomplishes this grouping by tracking and tracing all the transactions associated with a given address, both upward and downward. A heuristic approach has been utilized, and rules for addresses to be part of the same group have been established. For instance, a straightforward rule is as follows: if two addresses are used as inputs in a single transaction, they belong to the same entity. Similarly, there are more complex rules utilized by the algorithm.
Address Risk Factor Computation
In addition to detecting fraudulent entities, a risk model that assigns risk scores to addresses has also been developed, providing cryptocurrency users with information about the level of risk when trading with a particular address. Once again, a graph-based approach is utilized to analyze addresses, focusing on the transaction network of each address and its interactions with known fraudulent or high risk addresses. The risk scores are determined by the distance of an address from identified fraudulent or high-risk entities within the network.
System Evaluation And Results
Overall, the system comprises six modules. However, in this evaluation, the focus is on three major modules of the comprehensive system: (1) real-time web monitoring, (2) mixers detection, and (3) fraudulent entity detection. These modules are assessed in terms of correctness and efficiency.
Implementation Environment
Currently, the back-end, real-time monitoring modules operate on multiple local machines. The results, report, and alerts are generated through a cloud portal [15]. Three local machines are utilized, with two of them equipped with 16-GB random access memory (RAM) and 20 central processing unit (known as CPU) cores at 2.10-GHz speed, whereas the third one is more powerful—a Dell Precision 7920 machine with 128-GB RAM and 52 2.10-GHz Intel Xeon(R) Gold 6230R processors. All systems are running an Ubuntu 22.04.3 long-term support operating system.
For real-time, on-chain monitoring, transactions data are extracted from two platforms—Bitquery [21] and Blockchain Explorer [22]. For off-chain monitoring, the website list is obtained from the WHOIS database [23], the dark net data are extracted using SOS Intelligence APIs [16], and the social media data are collected using Twitter API [24]. The ML-based detection models are trained, tested, evaluated, and implemented in Python using the XGBoost library [25].
Correctness
The correctness of the AI-based detection models is evaluated with k-fold cross-validation, where k = 10. With this validation approach, the entire dataset is divided into k mutually exclusive equal portions. The model is trained and tested k times, with each trial involving training the model on one of the k−1 data portions and then testing it on the remaining part, in a repetitive manner. Finally, the results from all trials are aggregated.
To assess the system correctness from all perspectives, the accuracy, precision, and recall parameters were utilized—computed by equations 1, 2, and 3, respectively—along with true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). TP represents the number of correctly identified positive instances (frauds/mixers/scams), TN denotes the number of correctly identified negative instances (nonfrauds/nonmixers/nonscams), FP indicates the number of incorrectly recognized positive instances, and FN is the number of incorrectly recognized negative instances.
Accuracy = (TP + TN)/(TP + TN + FP + FN). (1)
Precision = (TP)/(TP + FP). (2)
Recall = TP/(TP + FN). (3)
The AI-based detection models demonstrate high accuracy. The offline monitoring module successfully identifies cryptocurrency scams and phishing websites with an accuracy exceeding 90%, whereas the on-chain modules exhibit even greater accuracy, surpassing 96% in detecting mixers and fraudulent entities across both Bitcoin and Ethereum (as detailed in Table 3). Notably, this study indicates that the reduction in feature sets (11 features for mixer detection in Bitcoin and 16 features for fraud detection in Ethereum) has no adverse impact on accuracy levels. This is corroborated by the consistent accuracy observed in both full and reduced feature sets.
Efficiency
This designed system is capable of operating in a real-time environment. One of the local machines is dedicated to processing all the newly uploaded websites on the web each day. On average, the machine processes more than 250,000 websites daily and determines which ones are scams in just 4–6 hours.
In terms of on-chain monitoring, real-time cryptocurrency transaction data are collected from external blockchain nodes that hold the current state of the cryptocurrency blockchains. The bottleneck in the mixer and fraudulent entity detection is extracting data from these nodes, as the time depends on the current network speed, bandwidth, delay, server computation power, and other network factors at the client and server ends. In most cases, it takes fewer than 200 ms per address to retrieve basic parameters. However, at the local machine, it takes less than 1 ms to compute the selected 24 features from the basic ones. Once the features are computed, the detection time is negligible (i.e., less than 0.5 ms); it took around 20 ms to detect fraudulent entities from 2,739 instances.
With these results and having an in-house blockchain node running at the local machine, it is possible to detect mixer services and fraudulent entities, even if the cryptocurrency transactions are being generated at a very high speed.
Conclusions
This article presents a novel AI-based cryptocurrency monitoring system designed to identify scams, fraudulent entities, and mixers. The system aids in criminal investigations related to activities like money laundering, child trafficking and abuse, gambling, ransom, Ponzi schemes, and others. The system comprises two major modules: (1) off-chain monitoring and (2) on-chain monitoring. Off-chain monitoring involves real-time surveillance of the WWW to identify cryptocurrency phishing websites and extract associated cryptocurrency wallet addresses. Furthermore, it establishes the connection between the extracted wallet addresses and the dark web and traces them on social media.
On the other end, on-chain monitoring does the real-time surveillance of cryptocurrency blockchains to detect frauds and money-laundering activities (through detecting mixer service involvement in a transaction). This module also enables users to assess the risk level associated with a given address, providing insight into the potential risks of trading with that address. Furthermore, clustering is applied to group all addresses associated with a single entity. In summary, off-chain monitoring helps prevent cryptocurrency fraud before it occurs, while on-chain monitoring detects fraud after it has been committed within the cryptocurrency blockchain.
Acknowledgments
The current work is supported by Atlantic Innovation Fund and Mitacs (funding no. IT24468).
Note
This article is exclusively written by the authors. AI-based tools like ChatGPT and Grammarly are employed solely for the purpose of detecting and correcting typos and grammar errors.
References
- Lévesque, F. L., S. Chiasson, A. Somayaji, and J. Fernandez. “Technological and Human Factors of Malware Attacks: A Computer Security Clinical Trial Approach.” ACM Transactions on Privacy and Security, vol. 21, no. 4, 2018.
- Chainalysis Team. “2024 Crypto Crime Trends: Illicit Activity Down as Scamming and Stolen Funds Fall, but Ransomware and Darknet Markets See Growth.” https://www.chainalysis.com/blog/2024-crypto-crime-report-introduction/, accessed February 2024.
- Elliptic Enterprises Limited. “The State of Cross-Chain Crime 2023.” https://www.elliptic.co/resources/state-of-cross-chain-crime-2023, accessed February 2024.
- Al-Farsi, S., M. M. Rathore, and S. Bakiras. “Security of Blockchain-Based Supply Chain Management Systems: Challenges and Opportunities.” Applied Sciences, vol. 11, no. 12, p. 5585, 2021.
- Buterin, V. “Ethereum Whitepaper.” Ethereum, https://ethereum. org, 2014.
- King, S., and S. Nadal. “PPCoin: Peer-to-Peer Crypto-Currency With Proof-of-Stake.” Self-published paper, 19 August 2012.
- Toyoda, K., T. Ohtsuki, and P. T. Mathiopoulos. “Multi-Class Bitcoin-Enabled Service Identification Based on Transaction History Summarization.” 2018 IEEE International Conference on Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData), pp. 1153–1160, Halifax, NS, Canada, 2018.
- Escobero, G. “Ethereum-Fraud-Dataset.” Kaggle, https://www.kaggle.com/datasets/gescobero/ethereum-fraud-dataset?resource=download, accessed February 2024.
- Camacho, L., G. Douzas, and F. Baçäo. “Geometric SMOTE for Regression.” Expert Systems With Applications, vol. 193, no. 2, p. 116387, January 2022.
- Yeo, I.-K., and R. A. Johnson. “A New Family of Power Transformations to Improve Normality or Symmetry.” Biometrika, vol. 87, no. 4, pp. 954–959, December 2000.
- Rein, S., and M. Reisslein. “Low-Memory Wavelet Transforms for Wireless Sensor Networks: A Tutorial. IEEE Communications Surveys & Tutorials, vol. 13, no. 2, pp. 291–307, 2011.
- Hasell, J. “Measuring Inequality: What Is the Gini Coefficient?” Our World in Data, https://ourworldindata.org/what-is-the-gini-coefficient, 30 June 2023.
- Azhagusundari, B., and A. S. Thanamani. “Feature Selection Based on Information Gain.” International Journal of Innovative Technology and Exploring Engineering (IJITEE), vol. 2, issue 2, pp. 18–21, January 2013.
- Neo4j, Inc. “GenAI Apps, Grounded in Your Data.” https://neo4j.com/, accessed February 2024.
- Gray Wolf Analytics. “StaySafeCrypto: Analyze and Discover Deceptive Activities in Cryptocurrency.” https://staysafecrypto.com/, accessed February 2024.
- SOS Intelligence Limited. “Business Risk Insight Using Cyber Threat Intelligence.” https://sosintel.co.uk/, accessed February 2024.
- Chen, T., and C. Guestrin. “XGBoost: A Scalable Tree Boosting System.” Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794, August 2016.
- Brownlee, J. XGBoost With Python: Gradient Boosted Trees With XGBoost and scikit-learn. Machine Learning Mastery, 2016.
- Rathore, M. M., S. Chaurasia, and D. Shukla. “Mixers Detection in Bitcoin Network: A Step Towards Detecting Money Laundering in Crypto-Currencies.” 2022 IEEE International Conference on Big Data (Big Data), Osaka, Japan, pp. 5775–5782, 2022.
- Rathore, M. M., S. Chaurasia, D. Shukla, and P. Anand. “Detection of Fraudulent Entities in Ethereum Cryptocurrency: A Boosting-Based Machine Learning Approach.” 2023 Global Communications Conference, Kuala Lumpur, Malaysia, 2023.
- Bitquery Inc. “Bitquery: Blockchain API and Crypto Data Products.” https://bitquery.io/, accessed February 2024.
- Blockchain.com, Inc. “Blockchain Explorer APIs.” https://www.blockchain.com/, accessed February 2024.
- WHOIS API, Inc. “WHOIS API Offers Unified & Consistent Data.” https://whois.whoisxmlapi.com/, accessed February 2024.
- X Corp. “Twitter API.” X Developer Platform, https://developer.twitter.com/en/docs/twitter-api, accessed February 2024.
- XGBoost Developers. “XGBoost Documentation.” dmlc XGBoost, https://xgboost.readthedocs.io/en/stable/, accessed February 2024.
BIOGRAPHIES
Dhirendra Shukla is a professor and Dr. J. Herbert Smith Atlantic Canada Opportunities Agency chair in technology management and entrepreneurship at the University of New Brunswick (UNB), Canada, where he uses his telecom industry expertise and academic background to promote a bright future for UNB. His nominations as a finalist for Industry Champion by KIRA and Progress Media’s Innovation in Practice Award show his tireless efforts and vision. Dhirendra was a finalist for the RBC Top 25 Canadian Immigrant Award. He received the 2017 Entrepreneur Promotion Award by Startup Canada and 2018 Outstanding Educator Award by the Association of Professional Engineers and Geoscientists of New Brunswick. Dr. Shukla holds a Ph.D. from King’s London College.
Muhammad Mazhar Ullah Rathore is a postdoctoral researcher at the University of New Brunswick, Canada, where he researches Big Data Analytics, the Internet of Things, Smart Systems, Network Traffic Analysis and Monitoring, Remote Sensing, Smart Cities, Urban Planning, Intrusion Detection, and Information Security and Privacy. He serves as guest editor for various journals and is a professional member of the Institute of Electrical and Electronics Engineers and the Association for Computing Machinery. Dr. Rathore holds a Ph.D. in computer science and engineering from Kyungpook National University, South Korea, and a master’s degree in computer and communication security from the National University of Sciences and Technology, Pakistan.