Monday, September 4, 2017

Python: extracting column from np array and making it flat

>>> import numpy as np
>>> A = np.array([[1,2,3,4],[5,6,7,8]])

>>> A
array([[1, 2, 3, 4],
    [5, 6, 7, 8]])

>>> A[:,2] # returns the third columm
>>> import pandas as pd
>>>aa = pd.DataFrame(A[:,2].ravel())
>>>aa

matlab mat file in python

import scipy.io as sio
test = sio.loadmat('test.mat')
test
{'a': array([[[  1.,   4.,   7.,  10.],
        [  2.,   5.,   8.,  11.],
        [  3.,   6.,   9.,  12.]]]),
 '__version__': '1.0',
 '__header__': 'MATLAB 5.0 MAT-file, written by
 Octave 3.6.3, 2013-02-17 21:02:11 UTC',
 '__globals__': []}
>>> oct_a = test['a']
>>> oct_a

Friday, August 18, 2017

Multi-class vs Multi-label classification

Got nice definitions from the link
https://stats.stackexchange.com/questions/11859/what-is-the-difference-between-multiclass-and-multilabel-problem/168945#168945

Multiclass classification means a classification task with more than two classes; e.g., classify a set of images of fruits which may be oranges, apples, or pears. Multiclass classification makes the assumption that each sample is assigned to one and only one label: a fruit can be either an apple or a pear but not both at the same time.
Multilabel classification assigns to each sample a set of target labels. This can be thought as predicting properties of a data-point that are not mutually exclusive, such as topics that are relevant for a document. A text might be about any of religion, politics, finance or education at the same time or none of these.

Sunday, August 6, 2017

python: removing non ascii character from text

yourstring = yourstring.encode('ascii', 'ignore').decode('ascii')

python: newspaper articles collection

import newspaper
et_paper = newspaper.build('http://cnn.com/',memoize_articles=False)
#for article in et_paper.articles:
# print(article.url)
print(et_paper.size())
for category in et_paper.category_urls():
print(category)


Wednesday, August 2, 2017

Python: selection a column based on the values in another column and saving it into CSV file

Example:
index, column1_name,column2_name
1, asdb, 2
2, asdfsf, 1
3, asdasdfasd, 1
4, dgfg, 2


Results will be
asdb
dgfg




import numpy as np
import pandas as pd
#reading the data
dataframe = pd.read_csv('labeled_data.csv')

for i, col in enumerate(dataframe.columns):
    print(i, col)


newData=dataframe .loc[dataframe ['column2_name'] == 2]
newData2 = newData['column1_name']
newData2.to_csv("foo2.csv",index=False )

Python: reading the headers of the csv file

import numpy as np
import pandas as pd
#reading the data
hate_speech = pd.read_csv('labeled_data.csv')

for i, col in enumerate(hate_speech.columns):
    print(i, col)

Monday, July 31, 2017

sklearn: sklearn.cross_validation deprecated

sklearn.cross_validation is deprecated.

Now we have to use sklearn.model_selection

Monday, July 24, 2017

Oracle VM: folder sharing and mounting

sudo mount -t vboxsf "name of folder in shared folders" "name of folder to be mounted"

sudo mount -t vboxsf mydocuments /media/mydocuments 

word2vec: some explanation

https://en.wikipedia.org/wiki/Word2vec
https://code.google.com/archive/p/word2vec/

Thursday, June 22, 2017

Jena:How to initialize log4j properly?

While setting up log4j properly is great for "real" projects you might want a quick-and-dirty solution, e.g. if you're just testing a new library.
If so a call to the static method
org.apache.log4j.BasicConfigurator.configure();
will setup basic logging to the console, and the error messages will be gone.


Text taken from
https://stackoverflow.com/questions/1140358/how-to-initialize-log4j-properly

Friday, June 16, 2017

Problems in machine learning

Original post is at link


Machine Learning Done Wrong

Statistical modeling is a lot like engineering.
In engineering, there are various ways to build a key-value storage, and each design makes a different set of assumptions about the usage pattern. In statistical modeling, there are various algorithms to build a classifier, and each algorithm makes a different set of assumptions about the data.
When dealing with small amounts of data, it’s reasonable to try as many algorithms as possible and to pick the best one since the cost of experimentation is low. But as we hit “big data”, it pays off to analyze the data upfront and then design the modeling pipeline (pre-processing, modeling, optimization algorithm, evaluation, productionization) accordingly.
As pointed out in my previous post, there are dozens of ways to solve a given modeling problem. Each model assumes something different, and it’s not obvious how to navigate and identify which assumptions are reasonable. In industry, most practitioners pick the modeling algorithm they are most familiar with rather than pick the one which best suits the data. In this post, I would like to share some common mistakes (the don't-s). I’ll save some of the best practices (the do-s) in a future post.

1. Take default loss function for granted

Many practitioners train and pick the best model using the default loss function (e.g., squared error). In practice, off-the-shelf loss function rarely aligns with the business objective. Take fraud detection as an example. When trying to detect fraudulent transactions, the business objective is to minimize the fraud loss. The off-the-shelf loss function of binary classifiers weighs false positives and false negatives equally. To align with the business objective, the loss function should not only penalize false negatives more than false positives, but also penalize each false negative in proportion to the dollar amount. Also, data sets in fraud detection usually contain highly imbalanced labels. In these cases, bias the loss function in favor of the rare case (e.g., through up/down sampling).

2. Use plain linear models for non-linear interaction

When building a binary classifier, many practitioners immediately jump to logistic regression because it’s simple. But, many also forget that logistic regression is a linear model and the non-linear interaction among predictors need to be encoded manually. Returning to fraud detection, high order interaction features like "billing address = shipping address and transaction amount < $50" are required for good model performance. So one should prefer non-linear models like SVM with kernel or tree based classifiers that bake in higher-order interaction features.

3. Forget about outliers

Outliers are interesting. Depending on the context, they either deserve special attention or should be completely ignored. Take the example of revenue forecasting. If unusual spikes of revenue are observed, it's probably a good idea to pay extra attention to them and figure out what caused the spike. But if the outliers are due to mechanical error, measurement error or anything else that’s not generalizable, it’s a good idea to filter out these outliers before feeding the data to the modeling algorithm.
Some models are more sensitive to outliers than others. For instance, AdaBoost might treat those outliers as "hard" cases and put tremendous weights on outliers while decision tree might simply count each outlier as one false classification. If the data set contains a fair amount of outliers, it's important to either use modeling algorithm robust against outliers or filter the outliers out.

4. Use high variance model when n<<p

SVM is one of the most popular off-the-shelf modeling algorithms and one of its most powerful features is the ability to fit the model with different kernels. SVM kernels can be thought of as a way to automatically combine existing features to form a richer feature space. Since this power feature comes almost for free, most practitioners by default use kernel when training a SVM model. However, when the data has n<<p (number of samples << number of features) --  common in industries like medical data -- the richer feature space implies a much higher risk to overfit the data. In fact, high variance models should be avoided entirely when n<<p.

5. L1/L2/... regularization without standardization

Applying L1 or L2 to penalize large coefficients is a common way to regularize linear or logistic regression. However, many practitioners are not aware of the importance of standardizing features before applying those regularization.
Returning to fraud detection, imagine a linear regression model with a transaction amount feature. Without regularization, if the unit of transaction amount is in dollars, the fitted coefficient is going to be around 100 times larger than the fitted coefficient if the unit were in cents. With regularization, as the L1 / L2 penalize larger coefficient more, the transaction amount will get penalized more if the unit is in dollars. Hence, the regularization is biased and tend to penalize features in smaller scales. To mitigate the problem, standardize all the features and put them on equal footing as a preprocessing step.

6. Use linear model without considering multi-collinear predictors

Imagine building a linear model with two variables X1 and X2 and suppose the ground truth model is Y=X1+X2. Ideally, if the data is observed with small amount of noise, the linear regression solution would recover the ground truth. However, if X1 and X2 are collinear, to most of the optimization algorithms' concerns, Y=2*X1, Y=3*X1-X2 or Y=100*X1-99*X2 are all as good. The problem might not be detrimental as it doesn't bias the estimation. However, it does make the problem ill-conditioned and make the coefficient weight uninterpretable.

7. Interpreting absolute value of coefficients from linear or logistic regression as feature importance

Because many off-the-shelf linear regressor returns p-value for each coefficient, many practitioners believe that for linear models, the bigger the absolute value of the coefficient, the more important the corresponding feature is. This is rarely true as (a) changing the scale of the variable changes the absolute value of the coefficient (b) if features are multi-collinear, coefficients can shift from one feature to others. Also, the more features the data set has, the more likely the features are multi-collinear and the less reliable to interpret the feature importance by coefficients.
So there you go: 7 common mistakes when doing ML in practice. This list is not meant to be exhaustive but merely to provoke the reader to consider modeling assumptions that may not be applicable to the data at hand. To achieve the best model performance, it is important to pick the modeling algorithm that makes the most fitting assumptions -- not just the one you’re most familiar with.
If you like the post, you can follow me (@chengtao_chu) on Twitter or subscribe to my blog "ML in the Valley" Also, special thanks Ian Wong (@ihat) for reading a draft of this. 

scipy installation: No lapack/blas

Solution
Install linear algebra libraries from repository (for Ubuntu),
sudo apt-get install gfortran libopenblas-dev liblapack-dev
Then install SciPy, (after downloading the SciPy source): python setup.py install or
pip install scipy

Wednesday, June 7, 2017

SFTP to Amazon EC2

http://angus.readthedocs.io/en/2014/amazon/transfer-files-between-instance.html

Friday, May 19, 2017

Keras: Hyperparameter optimization example 1

import numpy
from sklearn.model_selection import GridSearchCV
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
import pandas as pd
import numpy as np
import tensorflow as tf
from sklearn.cross_validation import train_test_split

from keras.utils import np_utils
def one_hot_encode_object_array(arr):
    '''One hot encode a numpy array of objects (e.g. strings)'''
    uniques, ids = np.unique(arr, return_inverse=True)
    return np_utils.to_categorical(ids, len(uniques))
def create_model():
    model = Sequential()
    model.add(Dense(12, input_dim=4, activation='relu'))
    model.add(Dense(3, activation='sigmoid'))
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

seed = 7
numpy.random.seed(seed)
dataframe = pd.read_csv("iris.csv",header=None)
dataset = dataframe.values

X = dataset[:,0:4].astype(float)
Y = dataset[:,4]

train_X, test_X, train_y, test_y = train_test_split(X, Y, train_size=0.5, random_state=1)
#print X
Y_train = one_hot_encode_object_array(train_y)
Y_test = one_hot_encode_object_array(test_y)
#Y = one_hot_encode_object_array(Y1)
#print Y
model = KerasClassifier(build_fn=create_model, verbose=0)
batch_size = [10,20,40]
epochs =[10,50,100]
param_grid= dict(batch_size=batch_size, epochs=epochs)
grid = GridSearchCV(estimator=model, param_grid=param_grid)
grid_result=grid.fit(train_X,Y_train)

print("Best:%f using %s" %(grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds,params):
    print("%f (%f) with: %r" % (mean, stdev, param))


-----------------------------------------------------------------------------
Results
Best:0.826667 using {'epochs': 100, 'batch_size': 10}
0.266667 (0.018856) with: {'epochs': 10, 'batch_size': 10}
0.640000 (0.198662) with: {'epochs': 50, 'batch_size': 10}
0.826667 (0.099778) with: {'epochs': 100, 'batch_size': 10}
0.413333 (0.049889) with: {'epochs': 10, 'batch_size': 20}
0.506667 (0.082192) with: {'epochs': 50, 'batch_size': 20}
0.773333 (0.147271) with: {'epochs': 100, 'batch_size': 20}
0.266667 (0.018856) with: {'epochs': 10, 'batch_size': 40}
0.320000 (0.299333) with: {'epochs': 50, 'batch_size': 40}
0.746667 (0.131993) with: {'epochs': 100, 'batch_size': 40}
 

Keras: Optimizing number of hidden layers

Saturday, May 6, 2017

hadoop: Container is running beyond virtual memory limits

From the error message, you can see that you're using more virtual memory than your current limit of 1.0gb. This can be resolved in two ways:
Disable Virtual Memory Limit Checking
YARN will simply ignore the limit; in order to do this, add this to your yarn-site.xml:
<property>
  <name>yarn.nodemanager.vmem-check-enabled</name>
  <value>false</value>
  <description>Whether virtual memory limits will be enforced for containers.</description>
</property>
The default for this setting is true.
Increase Virtual Memory to Physical Memory Ratio
In your yarn-site.xml change this to a higher value than is currently set
<property>
  <name>yarn.nodemanager.vmem-pmem-ratio</name>
  <value>5</value>
  <description>Ratio between virtual memory to physical memory when setting memory limits for containers. Container allocations are expressed in terms of physical memory, and virtual memory usage is allowed to exceed this allocation by this ratio.</description>
</property>
The default is 2.1
You could also increase the amount of physical memory you allocate to a container.
Make sure you don't forget to restart yarn after you change the config.


hadoop: hadoop auxservice: mapreduce_shuffle does not exist

Please use this in yarn-site.xml; when you set the framework to use as yarn, it starts to look for these values.
<configuration>
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>
  <property>
    <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
    <value>org.apache.hadoop.mapred.ShuffleHandler</value>
  </property>
</configuration>

Wednesday, May 3, 2017

python: read all files in the directory and copy the text in one file

from os import listdir
from os.path import isfile, join
mypath ="./puzzles/p"
onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath,f))]
#text = []
for file in onlyfiles:
    with open(mypath + "/" + file,'r') as f:
        text = f.readlines()
        #text = [l for l in text if "ROW" in l]
        with open("out.txt","a") as f1:
            f1.writelines(text)

Keras: installation with Tensorflow and opencv

$ mkvirtualenv keras_tf
$ workon keras_tf

$ export TF_BINARY_URL=https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.12.1-cp27-none-linux_x86_64.whl
#$ export TF_BINARY_URL=https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-0.12.0rc2-py2-none-any.whl
wheel name should come from
linklink

$ pip install --upgrade $TF_BINARY_URL


$ pip install numpy scipy
$ pip install scikit-learn
$ pip install pillow

$ pip install h5py


$ pip install keras

Before we get too far we should check the contents of our keras.json  configuration file. You can find this file in ~/.keras/keras.json .

$gedit ~/.keras/keras.json .

add  "image_dim_ordering": "tf" in the file and file contents should look lik


{
    "image_dim_ordering": "tf",
    "epsilon": 1e-07,
    "floatx": "float32",
    "backend": "tensorflow"
}


You might be wondering what exactly image_dim_ordering  controls.
Using TensorFlow, images are represented as NumPy arrays with the shape (height, width, depth), where the depth is the number of channels in the image.
However, if you are using Theano, images are instead assumed to be represented as (depth, height, width).

Find CV2.so
$ cd /
$ sudo find . -name '*cv2.so*'
./Users/adrianrosebrock/.virtualenvs/cv/lib/python2.7/site-packages/cv2.so
./Users/adrianrosebrock/.virtualenvs/gurus/lib/python2.7/site-packages/cv2.so
./Users/adrianrosebrock/.virtualenvs/keras_th/lib/python2.7/site-packages/cv2.so
./usr/local/lib/python2.7/site-packages/cv2.so

and copy that to virtual environment


$ cd ~/.virtualenvs/keras_tf/lib/python2.7/site-packages/
$ ln -s /usr/local/lib/python2.7/site-packages/cv2.so cv2.so
$ cd ~

------------------------------------------------

References
Content taken from