Data miner/Kaggle Notetaking

[Dataset] OTTO – Multi-Objective Recommender System

carayoon 2022. 11. 30. 16:05
728x90

- 데이터셋 소개 

 

(링크 : https://www.kaggle.com/competitions/otto-recommender-system/data?select=test.jsonl)

 

OTTO – Multi-Objective Recommender System | Kaggle

 

www.kaggle.com

 

  • 본 데이터셋의 모델링 목표는 이커머스 클릭, 카드 물품 추가 항목 및 순서를 예측하는 것이다. 따라서 우리는 이전 세션의 로그들을 참고하여 다양한 목적의 추천 시스템을 구축해야 한다. 
  • Train Data에는 이커머스 세션 전체 데이터가 있다. 
  • Test 단계에서는 각각의 session별 aid(article id/제품 id)를 각각의 제품 로그 데이터(event type)와 결합하여 예측해야 한다. (ex. session_number - eventype - aids label 쌍으로 예측함)
session_type predicted lables (aids)
12899779_clicks 129004 126836 118524
12899779_carts 129004 126836 118524
12899779_orders 129004 126836 118524
12899780_clicks 129004 126836 118524
12899780_carts 129004 126836 118524
  • 제품 로그 데이터 타입에는 clicks, carts, orders가 존재
  • 각 session에는 timestamp값이 존재한다.

 

- 평가 방식

 

weighted Recall score로 측정

아래의 수식에서 보듯이, orders의 Recall score를 가장 높은 비중으로 책정하므로, 실제 주문한 제품을 예측하는 것이 가장 중요한 과제라고 할 수 있음.

각각의 예측되는 aid label의 개수는 최대 20개까지 가능함

 

 

Multi objective CF 0. Data Overview

0. Data Overview

In [1]:
import json
import pandas as pd
from pathlib import Path
import os
import random
import numpy as np
from datetime import timedelta
In [25]:
DATA_PATH = Path('/Users/soyoon-yoon/Kaggle mining/Multi_CF')
TRAIN_PATH = DATA_PATH/'train.jsonl'
TEST_PATH = DATA_PATH/'test.jsonl'
In [8]:
sample_size = 10000
chunks = pd.read_json(TRAIN_PATH, lines=True, chunksize = sample_size)
In [9]:
with open(TRAIN_PATH, 'r') as f:
    print(f"We have {len(f.readlines()):,} lines in the training data")
We have 2,963,606 lines in the training data
In [12]:
for c in chunks:
    sample_train_df = c
    print(c)
    break
       session                                             events
10000    10000  [{'aid': 1033792, 'ts': 1659305201724, 'type':...
10001    10001  [{'aid': 476264, 'ts': 1659305201745, 'type': ...
10002    10002  [{'aid': 1754433, 'ts': 1659305201774, 'type':...
10003    10003  [{'aid': 1536959, 'ts': 1659305201875, 'type':...
10004    10004  [{'aid': 287161, 'ts': 1659305201899, 'type': ...
...        ...                                                ...
19995    19995  [{'aid': 1481519, 'ts': 1659305842045, 'type':...
19996    19996  [{'aid': 1109584, 'ts': 1659305842183, 'type':...
19997    19997  [{'aid': 1647277, 'ts': 1659305842315, 'type':...
19998    19998  [{'aid': 753948, 'ts': 1659305842328, 'type': ...
19999    19999  [{'aid': 1690380, 'ts': 1659305842502, 'type':...

[10000 rows x 2 columns]
In [13]:
sample_train_df
Out[13]:
session events
10000 10000 [{'aid': 1033792, 'ts': 1659305201724, 'type':...
10001 10001 [{'aid': 476264, 'ts': 1659305201745, 'type': ...
10002 10002 [{'aid': 1754433, 'ts': 1659305201774, 'type':...
10003 10003 [{'aid': 1536959, 'ts': 1659305201875, 'type':...
10004 10004 [{'aid': 287161, 'ts': 1659305201899, 'type': ...
... ... ...
19995 19995 [{'aid': 1481519, 'ts': 1659305842045, 'type':...
19996 19996 [{'aid': 1109584, 'ts': 1659305842183, 'type':...
19997 19997 [{'aid': 1647277, 'ts': 1659305842315, 'type':...
19998 19998 [{'aid': 753948, 'ts': 1659305842328, 'type': ...
19999 19999 [{'aid': 1690380, 'ts': 1659305842502, 'type':...

10000 rows × 2 columns

In [45]:
sample_train_df.loc[10543, :]['events'][:10]
Out[45]:
[{'aid': 602784, 'ts': 1659305228407, 'type': 'clicks'},
 {'aid': 602784, 'ts': 1659305254095, 'type': 'carts'},
 {'aid': 1456023, 'ts': 1659305269630, 'type': 'clicks'},
 {'aid': 1456023, 'ts': 1659305313768, 'type': 'carts'},
 {'aid': 1466811, 'ts': 1659305321905, 'type': 'clicks'},
 {'aid': 1456023, 'ts': 1659305346764, 'type': 'clicks'},
 {'aid': 602784, 'ts': 1659305500999, 'type': 'orders'},
 {'aid': 1456023, 'ts': 1659305500999, 'type': 'orders'},
 {'aid': 884043, 'ts': 1659305559962, 'type': 'clicks'},
 {'aid': 1343414, 'ts': 1659305600039, 'type': 'clicks'}]
In [22]:
sample_train_df.set_index('session', drop=True, inplace=True)
sample_train_df.head()
Out[22]:
events
session
10000 [{'aid': 1033792, 'ts': 1659305201724, 'type':...
10001 [{'aid': 476264, 'ts': 1659305201745, 'type': ...
10002 [{'aid': 1754433, 'ts': 1659305201774, 'type':...
10003 [{'aid': 1536959, 'ts': 1659305201875, 'type':...
10004 [{'aid': 287161, 'ts': 1659305201899, 'type': ...
In [37]:
example_session = sample_train_df.iloc[100].item()

time_elapsed = example_session[-1]["ts"] - example_session[0]["ts"]

# The timestamp is in milliseconds since 00:00:00 UTC on 1 January 1970
print(f'The first session elapsed: {str(timedelta(milliseconds=time_elapsed))} \n')
The first session elapsed: 24 days, 0:33:11.682000 

In [38]:
# Count the frequency of actions within the session
action_counts = {}
for action in example_session:
    action_counts[action['type']] = action_counts.get(action['type'], 0) + 1  
print(f'The first session contains the following frequency of actions: {action_counts}')
The first session contains the following frequency of actions: {'clicks': 116, 'carts': 6}
In [28]:
with open(TEST_PATH, 'r') as f:
    print(f"We have {len(f.readlines()):,} lines in the training data")
We have 1,671,803 lines in the training data
In [27]:
sample_size = 150

chunks = pd.read_json(TEST_PATH, lines=True, chunksize = sample_size)

for c in chunks:
    sample_test_df = c
    break
In [31]:
sample_test_df.loc[0, 'events']
#sample_train_df.loc[10000, :]['events']
Out[31]:
[{'aid': 59625, 'ts': 1661724000278, 'type': 'clicks'}]
In [43]:
sample_test_df.loc[2, 'events']
Out[43]:
[{'aid': 141736, 'ts': 1661724000559, 'type': 'clicks'},
 {'aid': 199008, 'ts': 1661724022851, 'type': 'clicks'},
 {'aid': 57315, 'ts': 1661724170835, 'type': 'clicks'},
 {'aid': 194067, 'ts': 1661724246188, 'type': 'clicks'},
 {'aid': 199008, 'ts': 1661780623778, 'type': 'clicks'},
 {'aid': 199008, 'ts': 1661781274081, 'type': 'clicks'},
 {'aid': 199008, 'ts': 1661781409993, 'type': 'carts'},
 {'aid': 199008, 'ts': 1661804151788, 'type': 'clicks'},
 {'aid': 199008, 'ts': 1662060028567, 'type': 'clicks'},
 {'aid': 199008, 'ts': 1662060064706, 'type': 'clicks'},
 {'aid': 918667, 'ts': 1662060160406, 'type': 'clicks'}]