CycleML

Github entry *Published with customer permission*

**Introduction**

This project was ordered by Ergotech which specializes in factory automation software. The source material for all electronics CPUs and solar panels are semiconductor "wafers". A wafer undergoes a suite of technological processes in a facility called tool, controlled by customers SECS/GEM software. The processing time can be considerable and the result is not always successful. During the process a tool generates an abundance of data. The customer decided to introduce ML to have more insight into a current cycle and to prevent costly wafer losses, potentially. In this project I was to create an ML component to predict the cycle processing time of a wafer batch given batch parameters.

The idea was to have an online learning system, which generates cycle time predictions and improves with each other batch. There are 3 input parameters: equipment name, recipe, wafer count. The time could be linear for each tool/recipe. Obviously there is statistical solution for the problem. However, ML was chosen as preferred way with the perspective to be extended to other related problems. It was a requirement to use TensorFlow/Python for this project.

**Model**

Given the requirements the linear regression was chosen. Further there were two major options which I will describe in the following two paragraphs.

As the first option the tool and recipe could be combined into one ID parameter, which denotes a separate model. Then a simple model is trained/stored for each ID (

`i).`

Let's denote m as the number of tool/recipe combinations, `W`

is wafer count and `Y`

is the outcome.$$Y_i = \alpha_i + \beta_i * W_i, \quad\quad\quad i\in \{1, ..., m\}$$

Alternatively, I could use all 3 initial parameters as the model inputs. In this case the same model applies for all tools/recipes. Tool and recipe become two categorical features while wafer count is a continuous feature. Let's combine tool and recipe into one categorical feature

`T`

and transform it into features `X1`

... `Xm.`

$$ T\in \{1,...,m\} \mapsto X_1,...,X_m: X_i = \begin{cases} 1 & \text{if } T = i\\ 0 & \text{otherwise}\end{cases}$$

then

$$Y = \displaystyle\sum_{i=1}^{m} \alpha_i * X_i + \displaystyle\sum_{i=1}^{m} \beta_i * X_i * W$$

say T = 2, then

$$Y = \alpha_2 + \beta_2 * W$$

The first model is preferable as a simpler one. In addition, it allows for adding new tool/recipe setting at any time. The cost is the code to store m different models which is indeed acceptable.

To train the model the algorithm below was chosen. It is essentially a Mini-Batch Gradient Descent, while is wrapped to work as Online training algorithm. One sample comes by one application run, by requirement. So, on receiving sample number N, the application does the following steps:

1. Load the previously trained model, which was trained on the first N - TAIL samples, where TAIL = N % (BATCH_SIZE + 1). BATCH_SIZE is set to 30.

2. Loads the samples from the last batch, add the newly received sample and trains the model on the resulting TAIL samples.

3. Saves the model if TAIL == BATCH_SIZE only.

So that the model acquires information from all the training set, but is trained on each TAIL < 30 samples only on each run.

For one specific ID:

The advantages are the following:

1. No need to store all training samples (TAIL only). Each model state is a result of one pass of data.

2. If parameters being inferred change (recipe parameters change) the model will eventually adopt to this.

3. Compared to Stohastic Gradient Descent (with optimal learning rate), converges little slower. However SGD won't work well with one pass of data.

Comparing algorithms convergence. MSE depending on iterations.

Mini-Batch (one pass)

SGD (few passes)

The graphs below show the application effectiveness, running on test data (see User Interface). AE fluctuates around expected values, which is 0.5*dispersion, where dispersion is a max delta of random distribution of points around hypotetical curve being modelled by test data.

Absolute Error depending on sample number

Dispersion=4

Dispersion=30

**Implementation**

According to requirement TensorFlow was used to implement linear regression. The code snippet below show manual variables setup, alternatively one can use

`tf.contrib.learn.LinearRegressor.`

1 2 3 4 5 6 7 8 9 10 11 12 13 14 | ```
self.x = tf.placeholder(tf.float32, name="x")
self.y_act = tf.placeholder(tf.float32, name="y_act")
"""input validity checking"""
x_shape = tf.shape(self.x)
y_act_shape = tf.shape(self.y_act)
self.assert_0 = tf.assert_equal(x_shape, y_act_shape, [x_shape, y_act_shape])
self.b0 = tf.Variable([0], dtype=tf.float32, trainable=True, name="b0")
self.b1 = tf.Variable([0], dtype=tf.float32, trainable=True, name="b1")
"""building the model"""
self.y = self.b0 + self.b1 * self.x
error = tf.squared_difference(self.y, self.y_act)
``` |

For the storage of TAIL data

`TFRecords`

format is used. Binary formats are effective in terms of access time and storage size. The format is not self-describing so both reading end writing ends should be aware of the structure of the data.1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | ```
writer = tf.python_io.TFRecordWriter(get_path(tool_recipe))
serialized = b""
"""batch data"""
for i in range(0, len(self.wafer_counts)):
xval = self.wafer_counts[i]
yval = self.cycle_times[i]
example = tf.train.Example(
features=tf.train.Features(
feature={
'x': tf.train.Feature(
float_list=tf.train.FloatList(value=[xval,])),
'y': tf.train.Feature(
float_list=tf.train.FloatList(value=[yval,])),
}))
serialized = example.SerializeToString()
writer.write(serialized)
``` |

To generate train/test data a small cosole utility was created. It generates data points to either

`.csv`

or `TFRecords`

format according to preferred parameters. There are clusters around 10 and 20 wafers as in real data.`python generate.py --b0 5 --b1 0.5 --dispersion 3 --x_max 30 --tool_recipe RECIPE20_15 --csv`

**User interface**

The program is console application and the general functionality consists of 2 commands. One wafer batch run and one corresponding cycle time observation is handled by a pair of commands. First, the enclosing system declares the next datapoint and reports the ongoing batch size (18), along with tool/recipe name (centura_nano13-1) The application repoprts the predicted cycle time. Mean Absolute Error is calculated over the last 10 measurements to give a hint about prediction accuracy. This is supposed to happen before batch processing.

`python main.py --next_datapoint 18.0 centura_nano13-1 --verbose`

1 2 3 4 5 | ```
Loading the model...
No pre-trained model yet
Tail batch length 10
Predicted cycle time (s): 33.0722
MAE(last 10)(s): 5.1159
``` |

Second the user or enclosing system reports the actual cycle time for the current batch. The system acquires the result and trains the model to improve further predictions. This is suppesed to happen after batch processing is done.

python main.py --finish_datapoint 29.6703167567 centura_nano13-1

1 2 3 4 | ```
Acquired cycle time 29.6703167567 for wafer count 18.0
Absolute error: 3.401857315565625
DateTime, WaferCount, Predicted, Actual, AbsError, MeanAbsError(10), SequenceNo
Jun 23 17:48:02 2017, 18.0, 33.0722, 29.6703, 3.4019, 5.1159, 0
``` |

For each tool a separate log is written, containing all measurements/predictions.

**Conclusion**

The predictive system was developed according to the target problem. The absolute error values close to statistical distribution are obtained after around ten measurements. The system behaves like performing Online training, using Mini-Batch algorithm internally. TensorFlow library and its binary record format were utilized. The designed interface is suitable for easy integration to a larger system. The project was successfully shipped to the customer.