朝小闇的博客

海上月是天上月,眼前人是心上人

使用Python数据分析遇到的部分问题

工作流程

本次数据分析目标是对大量的txt文本匹配查询一系列词汇出现的总频次,并从另一个大型二维数组(第一列元素存储了一系列文件名,需要将第二列和第三列元素相应添加到Excel表格中)中对每个文件名做相应的匹配查询

使用python进行数据分析工作,主要工作流程如下:

  1. a1a2文本中读取内容(中文情感词汇,积极词汇表和消极词汇表)并以键值对形式存入两个大的字典中,键用取出来的词语表示,值则初始化赋值为0,用来对目标文本进行匹配分析对应词语出现次数:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    includesPositive = {}
    includesNegative = {}
    def readNegativeFile(path):
    txt = open(path, 'r', encoding='UTF-8').readlines()
    for key in txt:
    key = key.replace('\n', '')
    print(key)
    includesNegative[key] = 0
    def readNegativeFile(path):
    txt = open(path, 'r', encoding='UTF-8').readlines()
    for key in txt:
    key = key.replace('\n', '')
    print(key)
    includesNegative[key] = 0
  2. b文本中读取内容并存入一个大的二维数组中,其中文本每一行以符号,分隔三个字符串,并存入二维数组一行中,其中第一个字符串是文件名,第二个字符串是一段数字代码,第三个字符串是时间,并过滤文件名中特殊符号\n*

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    # 问题一:定义全局变量后,在函数中赋值,离开函数后却失效
    data = [[] for i in range(4149)]
    # 实现按行读取文件,并使用分隔符号将每行字符串分割成二维数组一行存储
    def data_storage(path):
    txt = open(path, 'r', encoding='UTF-8').readlines()
    i = 0

    for row in txt:
    row = row.replace('\n','')
    row = row.replace('*','')
    data[i] = row.split(',')
    print(data[i])
    i = i + 1
  1. 循环遍历目标文件夹下所有目标文件且自动过滤非txt格式以及含有敏感词汇英语的文件,并记录其文件路径:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    # 循环遍历指定目录下所有文件
    def ergodic():
    paths = r'.\00\005'
    # paths = r'.\test'
    fns = [os.path.join(root, fn) for root, dirs, files in os.walk(paths) for fn in files]
    for path in fns:
    if path[len(path) - 4:len(path)].lower() == '.txt':
    print(path)
    # 去除英文版
    if path.find("英文") == -1:
    search_word(path)
  1. 对循环得到的每一个有效文件路径实现以下功能:

    1. 从相应的路径名中取出文件名(不含格式后缀)并存入字符串name中,并对name进行过滤特殊字符操作,该name用来从第二步获取的二维数组中匹配文件名,匹配正确则记录下该行中相应的数字代码和时间;
    2. 过滤特殊字符;
    3. 创建一个字典用wordAndNum来存储所有积极词汇和消极词汇分别出现的次数;
    4. 以该文件路径读取该文件并全部存入txt字符串中;
    5. txt使用结巴分词并将结果存入words中,对每一个单词分别与积极词汇includesPositive和消极词汇includesNegative进行比对,匹配成功则分别加入积极词汇字典countsOfPositive和消极词汇字典countsOfNegative中,并将字典中每一个键值对分别存入相应的积极消极元组itemsOfPositiveitemsOfNegative中;
    6. 分别遍历元组元素,将所有积极词汇出现的次数叠加存入最初的wordAndNum字典中,并计算相应的TONE值;
    7. 创建列表li并将文件名、匹配的数字代码、匹配的时间、积极词汇总数、消极词汇总数、TONE存入其中,最后将列表li追加存储到表格中;
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    def search_word(path):
    name = get_name(path)
    # 去除特殊字符的干扰
    name = name.replace('-','')
    name = name.replace('_','')
    name = name.replace('*','')

    data_num = match_name(name)
    data_match = ['','']
    if(data_num == 10000):
    data_match = ['NO','NO']
    else:
    print(data_num)
    print(data[data_num][0])
    data_match[0] = data[data_num][1]
    data_match[1] = data[data_num][2]
    print(data_match)
    # 删除已匹配行
    del data[data_num]
    print("匹配数据源还剩下:"+str(len(data))+",加油!")

    wordAndNum = {'Positive': 0, 'Negative': 0, 'TONE': 0}

    # 问题二:读取时编码错误则会直接结束进程
    encoding = detectCode(path)
    print(encoding)
    print(path)
    if encoding == "UTF-16":
    txt = open(path, "r", encoding="utf-16-le" ).read()
    elif encoding == "UTF-8":
    txt = open(path, "r", encoding="utf-8").read()
    else:
    txt = open(path, "rb").read()

    words = jieba.lcut(txt)
    countsOfPositive = {}
    countsOfNegative = {}
    for word in words:
    # 分别处理属于词汇列表中的积极和消极词汇
    if word in includesPositive:
    countsOfPositive[word] = countsOfPositive.get(word, 0) + 1
    if word in includesNegative:
    countsOfNegative[word] = countsOfNegative.get(word, 0) + 1

    dictsOfPositive = dict(countsOfPositive.items())
    itemsOfPositive = dictsOfPositive.items()

    dictsOfNegative = dict(countsOfNegative.items())
    itemsOfNegative = dictsOfNegative.items()

    print(itemsOfPositive)
    print(itemsOfNegative)

    for item in itemsOfPositive:
    ls = list(item)
    # print(ls)
    for i in range(len(ls)):
    if i % 2 == 1:
    wordAndNum['Positive'] = wordAndNum['Positive'] + ls[i]

    for item in itemsOfNegative:
    ls = list(item)
    for i in range(len(ls)):
    if i % 2 == 1:
    wordAndNum['Negative'] = wordAndNum['Negative'] + ls[i]
    # 问题三:分母不能为0
    if(wordAndNum['Positive'] + wordAndNum['Negative'] != 0):
    wordAndNum['TONE'] = abs(
    (wordAndNum['Positive'] - wordAndNum['Negative']) / (wordAndNum['Positive'] + wordAndNum['Negative']))
    else:
    wordAndNum['TONE'] = 0

    li = [name,data_match[0],data_match[1]]
    for key in wordAndNum:
    li.append(key)
    li.append(str(wordAndNum[key]))
    # fo = open("002.csv", "a"),a表示追加,w表示重新写入
    fo = open("年报数据.csv", "a")
    fo.write(",".join(li) + "\n")
    fo.close()

问题及解决方法

问题一:定义全局变量后,在函数中赋值,离开函数后却失效

问题具体描述:

二维列表data是用来存储从文件b中获取的文件名、数字代码和时间的,文件中每一行三个字符串要存储为data的一行,从文件中取出来的内容最初存在txt列表中,每一行存储为列表中的一个元素,所以是可以直接计算该txt元素长度并实现自动申请相应的内存给二维列表data,于是我做出了以下操作:

1
2
3
4
5
6
7
8
9
10
11
data = []
def data_storage(path):
txt = open(path, 'r', encoding='UTF-8').readlines()
data = [[] for i in range(len(txt))]
i = 0
for row in txt:
row = row.replace('\n','')
row = row.replace('*','')
data[i] = row.split(',')
print(data[i])
i = i + 1

即定义全局变量data,并且在函数内部对该全局变量data先申请具体空间再进行每一行每一列的赋值操作,结果导致离开该函数后,data元素长度为空。

原因:

这应该是在函数中重新给变量data申请空间这个操作引起的问题,即因为data = [[] for i in range(len(txt))]这段代码的初始化与全局变量data = []初始化冲突了,于是程序将其定义为一个新的局部变量,导致内部使用的data的值无法在该函数之外调用。

解决方法:

就是我们现在代码修改的这样,直接最开始就给全局变量初始化具体空间,优点是简单不费脑,缺点是不够友好。

问题二:读取文件时由于编码错误直接结束进程

问题具体描述:

最初是在跑数据的过程中突然中断结束,显示以UTF-16编码格式打开文件时内容编码出错,于是以为是整个文本编码格式是UTF-8,结果发现文本编码格式是UTF-16,但是其中某个数据存储格式不符合UTF-16的编码格式,才报的错。解决方案是直接删掉那一个出错的文本/滑稽。

解决这个出错的问题时使用了二进制读取文本的方式,虽然正常运行程序了,但是最后却发现得不到想要的数据,于是百度了二进制编码以及unicode也就是UTF-8和UTF-16编码的区别。

拓展:二进制读取文件和Unicode编码以及读取文件的区别:

所有的文件最终存储形式都是二进制,以二进制形式读取文件内容就是不对文本做任何处理而直接读取。Unicode编码是指对文件内容进行固定格式的转换,如UTF-16编码就是对文字以固定两个字节16个比特进行转换存储的,只能通过UTF-16编码格式打开读取,并且会对数据进行转换字符串操作。再简单点来说,二进制形式读取出来的数据就是二进制字符集合,而使用Unicode编码读取出来的数据则是经过转换的字符串集合。所以一般以二进制直接存储的二进制文件最好使用二进制形式读取,而文本文件由于是二进制之上的字符串集合,最好使用Unicode编码格式读取,否则容易出错。UTF-8针对的一般是英文文本内容,UTF-16是针对汉字处理的。

问题三:分母不能为0

问题具体描述:

对于wordAndNum['Positive'] - wordAndNum['Negative']) / (wordAndNum['Positive'] + wordAndNum['Negative'])语句来说,即使分母相加的两个值都是大于等于0的存在,但同样是有可能出现分母为0的情形的,在本次数据处理中,由于情感词汇表给出的都是中文词汇,而处理的文本数据却有英文版,因此出现了分母为0的情形,于是直接对文件路径做了过滤处理,而且为了防止再出现此类错误,做了简单的出错处理。

完整代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
# -*- coding:utf-8
import os
import jieba

import chardet

includesPositive = {}
includesNegative = {}
data = [[] for i in range(4149)]

def readPositiveFile(path):
txt = open(path, 'r', encoding='UTF-8').readlines()
for key in txt:
key = key.replace('\n', '')
print(key)
includesPositive[key] = 0

def readNegativeFile(path):
txt = open(path, 'r', encoding='UTF-8').readlines()
for key in txt:
key = key.replace('\n', '')
print(key)
includesNegative[key] = 0

# 实现按行读取文件,并使用分隔符号将每行字符串分割成二维数组一行存储
def data_storage(path):
txt = open(path, 'r', encoding='UTF-8').readlines()
# data = [[] for i in range(len(txt))]
i = 0

for row in txt:
row = row.replace('\n','')
row = row.replace('*','')
data[i] = row.split(',')
print(data[i])
i = i + 1
# print(len(data))
# del data[1]
# print(len(data))

def detectCode(path):
with open(path, 'rb') as file:
data = file.read(200000)
dicts = chardet.detect(data)
return dicts["encoding"]

def parse():
readPositiveFile('F:\PythonSpiders\Word\_file\_positive_simplified.txt')
readNegativeFile('F:\PythonSpiders\Word\_file\_negative_simplified.txt')
# print(len(includesPositive))
# print(includesPositive.keys())
# print(len(includesNegative))
# print(includesNegative.keys())
data_storage('0.txt')
ergodic()

# 循环遍历指定目录下所有文件
def ergodic():
paths = r'.\00\005'
# paths = r'.\test'
fns = [os.path.join(root, fn) for root, dirs, files in os.walk(paths) for fn in files]
for path in fns:
if path[len(path) - 4:len(path)].lower() == '.txt':
print(path)
# 去除英文版
if path.find("英文") == -1:
search_word(path)

def search_word(path):
name = get_name(path)
# 去除特殊字符的干扰
name = name.replace('-','')
name = name.replace('_','')
name = name.replace('*','')

data_num = match_name(name)
data_match = ['','']
if(data_num == 10000):
data_match = ['NO','NO']
else:
print(data_num)
print(data[data_num][0])
data_match[0] = data[data_num][1]
data_match[1] = data[data_num][2]
print(data_match)
# 删除已匹配行
del data[data_num]
print("匹配数据源还剩下:"+str(len(data))+",加油!")

wordAndNum = {'Positive': 0, 'Negative': 0, 'TONE': 0}

encoding = detectCode(path)
print(encoding)
print(path)
if encoding == "UTF-16":
txt = open(path, "r", encoding="utf-16-le" ).read()
elif encoding == "UTF-8":
txt = open(path, "r", encoding="utf-8").read()
else:
txt = open(path, "rb").read()
words = jieba.lcut(txt)
countsOfPositive = {}
countsOfNegative = {}
for word in words:
# 分别处理属于词汇列表中的积极和消极词汇
if word in includesPositive:
countsOfPositive[word] = countsOfPositive.get(word, 0) + 1
if word in includesNegative:
countsOfNegative[word] = countsOfNegative.get(word, 0) + 1

dictsOfPositive = dict(countsOfPositive.items())
itemsOfPositive = dictsOfPositive.items()

dictsOfNegative = dict(countsOfNegative.items())
itemsOfNegative = dictsOfNegative.items()

print(itemsOfPositive)
print(itemsOfNegative)

for item in itemsOfPositive:
ls = list(item)
# print(ls)
for i in range(len(ls)):
if i % 2 == 1:
wordAndNum['Positive'] = wordAndNum['Positive'] + ls[i]

for item in itemsOfNegative:
ls = list(item)
# print(ls)
for i in range(len(ls)):
if i % 2 == 1:
wordAndNum['Negative'] = wordAndNum['Negative'] + ls[i]
if(wordAndNum['Positive'] + wordAndNum['Negative'] != 0):
wordAndNum['TONE'] = abs(
(wordAndNum['Positive'] - wordAndNum['Negative']) / (wordAndNum['Positive'] + wordAndNum['Negative']))
else:
wordAndNum['TONE'] = 0

li = [name,data_match[0],data_match[1]]
for key in wordAndNum:
li.append(key)
li.append(str(wordAndNum[key]))
# fo = open("002.csv", "a"),a表示追加,w表示重新写入
fo = open("年报数据.csv", "a")
fo.write(",".join(li) + "\n")
fo.close()

# 遍历data数据中文件名,对比传入name,相同则返回下标,未匹配到则返回固定值10000
def match_name(name):
print(name)
print(len(data))
for i in range(len(data)):
if name == data[i][0]:
print(data[i][0])
return i
elif i==len(data)-1:
return 10000

# 返回文件名
def get_name(path):
nameNum = last_char(path)
name = path[nameNum + 1:len(path) - 4]
return name

# 返回最后一次出现该字符的序号
def last_char(str):
lastNum = 0
num = 0
char = "\\"
for ch in str:
if char == ch:
lastNum = num
num = num + 1
return lastNum

if __name__ == '__main__':
parse()

问题四:pritf字符串输出的问题

printf函数传入参数应是字符串格式参数,即不可以为此种类型printf("num="+x)其中x=1,这时必须要把变量x转为字符串即printf("num="+str(x))格式才能正确输出。

小彩蛋

贴上一个批量pdf转txt的代码,实话实说这种简单转换不是很完善,只能当做娱乐,不如使用软件:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
# -*- coding:utf-8
import sys
import importlib
import os

importlib.reload(sys)
from pdfminer.pdfparser import PDFParser, PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LTTextBoxHorizontal, LAParams
from pdfminer.pdfinterp import PDFTextExtractionNotAllowed

'''
解析pdf 文本,保存到txt文件中
'''

# path = r'001.pdf'

# 循环遍历指定目录下所有文件
def parse():
paths = r'F:\年报00\年报01\年报'
fns = [os.path.join(root, fn) for root, dirs, files in os.walk(paths) for fn in files]
for path in fns:
if path[len(path)-4:len(path)].lower()=='.pdf':
print(path)
parse_next(path)

# 返回文件名
def get_name(path):
nameNum = last_char(path)
name = path[nameNum+1:len(path)-4] + '.txt'
return name

# 返回倒数第二次出现该字符的序号
def last_char(str):
beforeNum = 0
afterNum = 0
num = 0
char = "\\"
for ch in str:
if char == ch:
beforeNum = afterNum
afterNum = num
num = num + 1
return beforeNum

def parse_next(path):
name = get_name(path)
fp = open(path, 'rb') # 以二进制读模式打开
# 用文件对象来创建一个pdf文档分析器
praser = PDFParser(fp)
# 创建一个PDF文档
doc = PDFDocument()
# 连接分析器 与文档对象
praser.set_document(doc)
doc.set_parser(praser)

# 提供初始化密码
# 如果没有密码 就创建一个空的字符串
doc.initialize()

# 检测文档是否提供txt转换,不提供就忽略
if not doc.is_extractable:
raise PDFTextExtractionNotAllowed
else:
# 创建PDf 资源管理器 来管理共享资源
rsrcmgr = PDFResourceManager()
# 创建一个PDF设备对象
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
# 创建一个PDF解释器对象
interpreter = PDFPageInterpreter(rsrcmgr, device)

# 循环遍历列表,每次处理一个page的内容
for page in doc.get_pages(): # doc.get_pages() 获取page列表
interpreter.process_page(page)
# 接受该页面的LTPage对象
layout = device.get_result()
# 这里layout是一个LTPage对象 里面存放着 这个page解析出的各种对象 一般包括LTTextBox, LTFigure, LTImage, LTTextBoxHorizontal 等等 想要获取文本就获得对象的text属性,
for x in layout:
if (isinstance(x, LTTextBoxHorizontal)):
with open(name, 'a', encoding='utf-8') as f:
results = x.get_text()
f.write(results + '\n')

if __name__ == '__main__':
parse()
-------- 本文结束 感谢阅读 --------